Embrace your Persistent Staging Area for Eventual Consistency
If you like your PSA so much…
A colleague of mine asked me this: ‘if you like the Persistent Staging Area (PSA) concept so much, why not embrace it all the way?’. By this, he meant loading upstream layers such as the Data Vault directly from the PSA instead of from a Staging Area. I was a bit resistant to the idea at first, because this would require incorporation of the PSA as a mandatory step in the the load order sequence (as opposed to loading the PSA in parallel). And of course in my view the PSA always was to be something you could decouple from the architecture at some point if you need to scale-out.
But the more I thought about it the more I realised I may have to let go of these ideas, and I decided to give it a proper go and test out these implications. The prime advantage of loading directly from a PSA for me is that it would allow for parallelism on an even higher degree than was previously possible by parallel processing multiple ETLs on the same delta (using hashing or natural business keys).
This requires a bit more explanation, which I will do in the next section.
On a secondary note, using a PSA as ‘source’ for a Data Warehouse is something that is being done already for the Virtual Data Warehouse anyway. It is also conceptually also very similar to certain approaches for storing files (delta or snapshots) in a ‘big data’ environment. I have some scripts to switch between these types of environment to prove this point, something that is still on my list to write about.
But what this script essentially does is replacing a PSA database table with a file archive (zip) in an Azure Data Lake environment, and processing it the same way upstream as would be the case with the database table version. In the Azure Data Lake example the deltas are saved as files and added to a corresponding directory or archive . These can be exposed and handled in different ways, including again in a SQL-like way.
The originals (in Biml) how to generate an interface to Azure Data Lake were pioneered by Peter Avenant from Varigence and credit is for him – and thanks for the inspiration on this topic. More information is here if you are interested.
I always stated that a PSA is not required to be a database table, and these techniques show how easy it is to switch platforms – and the required metadata to generate everything is the same – as per the Virtual Data Warehouse thinking. The message here is that the limitations in scalability of using a PSA are disappearing, it’s all about being flexible and focusing on metadata management instead of platforms.
But, back to the topic of loading your Data Warehouse from a PSA. To explain the thinking and the impact of this idea I need to take a brief sidestep in typical ETL orchestration and management.
ETL on ACID
As a best practice I always state that ‘every delta must be processed in all the relevant tables before it can be released’. This means that the ETL guarantees that all upstream tables have been populated with an available dataset before this, say, Staging Area table can be truncated again and loaded with the next delta. For example, if you load a Satellite table from a specific source (Staging Area table / delta) you need to make sure you also load the corresponding Hub table from that same source.
If you don’t, you will end up in situations where Referential Integrity (RI) is likely not to be achieved. You’re at risk of missing data.
I call this concept ‘ACID on ETL level’ and consider it a best-practice. ACID (Atomicity, Consistency, Isolation, Durability) in short is a set of techniques to manage database transactions and consistency intended to guarantee validity at all times (i.e. broken transactions, outage etc.). This is implemented in Relational Database Management Systems (RDBMS) at a fundamental level, but needs to be present (developed) at ETL ‘application level’ as well.
ETL ACID is driven by (ETL generation) metadata, and managed by the ETL control framework. In the ETL control framework I use two levels are defined: individual ETL processes and groups of related ETL processes (workflows). These concepts are referred to as Modules for individual ETL processes such as Staging, PSA, Hubs, Satellites and Batches as workflows that manage dependencies.
Have a look at the example image below. This shows a Batch (workflow) that populates all Data Vault objects related to a specific source (data delta set). The Batch is the workflow, the Modules are the individual ETL jobs. This is an example of an implementation of this concept in SSIS, but the logic applies to any technology. Note that I have added a Referential Integrity check to make sure no ETLs were missed, the control framework only allows the Batch to be restarted if this check is passed.
Scheduling is traditionally done at Batch level to comply for RI, but every -Batch can essentially run independently of others – on their own schedules.
Interestingly this is all achieved by using the same set of core metadata that is also used for ETL generation. Come talk to me at the Data Modelling Zone or Data Vault Implementation & Automation training for these topics!
This is a solid and proven approach, but arguably has some dependencies still even if packaged up neatly in the Batch concept. By fully embracing the PSA concept, I see opportunities in running even the upstream Modules fully independently. And maybe even more relevant – continuously.
This can be achieved by allowing each upstream ETL process to load from the PSA directly at all times and self-manage a loading window specifically linked to the individual ETL process. Consider the diagram below:
This may not seem very different, but at a fundamental level it is. Loading from a PSA essentially eliminates wait times because related ETLs (the ones that load from the same data delta) need to be completed (successfully). In other words, there is no longer a need to wait for a Staging Area table to be truncated again. However, this also means that the concept of a delta being processed does no longer apply at that level. Previously, you could be sure that if you load a Satellite and Hub from a source all the keys that are in the Satellite are also present in the Hub. With this new approach this all depends on the individual loading frequencies of the ETL processes. However, you know that at some point, eventually, this will be consistent again.
This works because every ETL self-manages a loading window that ‘knows’ from which point onwards the data is ‘ready to go’ for that specific ETL process. The self-managed loading window becomes the Staging Area in this respect, but instead of being a delta for a set of ETLs every ETL now has its own ‘delta’ by virtue of its personal loading window. The Staging Area itself is arguably a first victim of this approach – one of the first thoughts that come up is that, using individual loading windows, you can essentially drop the Staging Area table as a separate table. At least as a mechanism to temporarily store data deltas for processing.
In a way this idea pivots the concept of a Batch / workflow as outlined in the previous section. The ‘dependencies’ still need to be known (also for ETL generation purposes), but these are managed differently now – by making sure all ETLs have been run at some point. I’m working on a queue concept to manage this, as you will still need to guarantee ‘all ETLs are accounted for’, for instance that a Satellite from a certain source also has a corresponding Hub for that source somewhere in the overall set of ETLs that are scheduled.
In the Batch approach this is enforced via RI checks as ‘commit points’ for ETL, but in this fully parallel setup there is very rarely a moment true RI that can be enforced. However, in combination with the Load Date / Time Stamps in the PSA you can derived the last time consistency (RI) was achieved.
This works an ever-increasing high water mark. In other words, the data will eventually be consistent.
I find this a powerful and fascinating concept; everything can continuously be running and lagging with the available data is a timeline where you can ensure you are complete. In the near future I plan to write a bit more about this queue concept from a more technical perspective. Watch this space and hopefully see you at the Data Modelling Zone or any upcoming training events!