NoETL – Persistent (History) Staging Area (PSA)
After setting up the initial data staging in the previous post we can load the detected data delta into the historical archive: the Persistent Staging Area (PSA). The PSA is the foundation of the Virtual Enterprise Data Warehouse because all upstream modelling and representation essentially reads from this ‘archive of (data) changes’. This is because the PSA has all the information that was ever presented to the Data Warehouse, either in structured or unstructured format. To make any upstream virtualization possible the Data Warehouse load/date time is ‘hardened’ into the Staging Area; by the interface that captures the data delta. This is explained in more detail in this post.
Now the data delta is present, the PSA ETL effectively compares and loads the delta into the PSA tables that correspond to the Staging Area counterparts (as-is). The generated view will show what data delta can be loaded to the PSA, and the INSERT INTO equivalent directly inserts the delta into the PSA. This means the view needs to include logic to:
- Ensure there is actually a change (hash comparison)
- Prevent reloads (point-in-time reload prevention)
- Processing of multiple changes in a delta, for a single natural key
As always, to remain scalable and flexible it is vitally important to be able to process multiple changes for a key in one pass. And of course you must be able to run ‘ETL’ all the time, and even multiple times, without causing problems. We can do this by generating the logic (view, ETL, anything) using the available metadata and patterns. All connections and naming conventions are driven by the information specified in the tool.
The views are now generated and visible as SQL and they are also created directly into the database, including the change comparison and reload prevention. This principle relies on the data delta already being calculated properly by the Staging Area ETL, so for instance logical deletes are already present (and don’t need to be compared further). However, they are still part of the change comparison as records may be deleted and then re-inserted.
These views show you exactly what information still needs to be added to the PSA, they are fairly straightforward but may look fairly complex due to the reprocessing and multiple-change logic. The following diagram explains this in a bit more detail:
It’s all pattern-based and easy to automate even thought it may appear overly complex. But then again, the logic is meant to support many scenarios and ultimately is generic so you only need to document this once.
With the PSA views being generated we can make these act as ETL; first let’s see review the output from the generated view.
The output shows the Staging record that have not been processed yet, so we can now run the PSA view as INSERT INTO statement (and execute this immediately in our case):
When we re-run the PSA view we will not see any results, because the ‘prevent rerun’ logic is doing its job. This neatly fits in to the ETL principles of not breaking things when run out of order or multiple times.
With this the foundation for upstream (Data Vault, Dimensional) virtualization has been established and we have an easy to configure historical archive we can try out different models with.