Why you really want a Persistent Staging Area in your Data Vault architecture
Recently at the Worldwide Data Vault Conference in Vermont USA (WWDVC) I had many conversations about the Persistent Staging Area (PSA) concept, also known as Historical Staging Area.
I have been using this idea for years and really can’t do without it. I would even go as far as saying you really want a PSA in your architecture. However there is a common opinion that having a PSA isn’t the best idea as it introduces a ‘2nd Data Warehouse’ in your architecture.
I fundamentally disagree with this and thought it would be good to explain this in more detail for today’s post. I wrote a post similar to this one back in 2012, but it needs a bit of an update :-).
What is a Persistent Staging Area?
For those new to the discussion, below is a quick introduction of the PSA.
A PSA is a record, or archive, of all data deltas (records) that were ever presented to the Data Warehouse (Staging Area) by its interfaces: historised raw data. Every record that is loaded into the Staging Area – a transient area (will be truncated regularly) – is sent to the PSA as well as to the upstream (usually Data Vault based) Integration Layer.
Following the Staging Layer architecture patterns: when records are loaded into the Staging Area they will be timestamped with the Load Date/Time Stamp (LOAD_DTS). It is also the role of the Staging Area ETL to capture the Event Date/Time (explained in more detail in the post covering Data Vault ‘time’) as well as logical deletes. Records then go two ways: into the PSA (part of the Staging Layer) and upwards into the Integration Layer.
The following diagram covers the overview outlined in the above paragraph:
As a quick explanation of the terms in this diagram: the ETL components are the typical loading patterns into the various areas of the Data Warehouse such as ‘Source to Staging’, ‘Staging to Data Vault’ and ‘Staging to PSA’. This is the foundation of the EDW.
The ‘Virtualisation’ arrow indicates the concept of NoETL for which I have written many articles. This is based on the idea that if you can reload & regenerate your entire Data Warehouse (schema and data) from your metadata, you can also opt to use the same metadata to create a view that simulates your Data Warehouse structure. Effectively you are then using that metadata (and model) as a ‘schema on read’ on raw historised data – your Data Warehouse model defines your schema.
Lastly, the ‘Reinitialisation’ arrow captures the process of physically (re)loading your entire history back into the Staging Area for a full rebuild or gap-fixing of your Integration Layer.
Why would you want this?
Simply put: because you will make mistakes in your model and this is the most complete and repeatable way to remediate these. More on this later.
So to start off, introducing a PSA does indeed create a 2nd repository of history in your Enterprise Data Warehouse architecture. But if that is a bad thing or not all depends on how you use and approach this. Let’s explore the benefits here:
- You can start straight away! You don’t need to wait for the modelling to be completed. This is an extremely useful feature because as soon as you know which interfaces you need (tables, files, messages) you can start ‘recording’ data deltas into your PSA. This gives you time to properly model out the Integration Layer (Data Vault): data is already being collected ready to go for further handling once your model is complete. When you are happy with your model you can ‘replay’ from your PSA and populate your Integration Layer.
- You can start small. You don’t need to load everything in the Data Vault straight away. If you only model what you need and leave the rest for later you will find there is more room for (modelling) discussions on the things that matter for the scope of your project, as opposed to spending time on elements that may be required later. As an example, if you need some additional attributes later you can reload that Satellite to populate the missing attribute back in time (as opposed to building history from that point onwards, potentially increasing record counts or requiring a bit of rework).
- You will change your mind (and/or make mistakes). Even with all the experience in the world and the best intentions invariably you look back at models and think of better alternatives. The same ‘replay’ (re-initialisation) mechanism allows you to make changes and truncate and load Data Vault tables with the same outcome (deterministic). Hash keys are invaluable here as you can truncate a Hub or Link and then reload without any negative impacts. You don’t want to try this in integer/sequence based designs.
- You can support unexpected use-cases. There are edge cases where a historical archive of source data (including the added Data Warehouse information such as logical deletes) can be become really useful. I have been involved with fraud detection projects where it does pay off to have an independent ‘memory’ of your source systems because, well, you can’t trust your source systems directly.
- You can delay some design and architecture decisions. One key point is that you don’t need to commit to a methodology or architecture from the outset of your project. Staging Layer concepts are relevant regardless of the modelling technique you use for your Integration Layer. In some cases you may want to ‘start recording’ in your PSA but start with a small Dimensional Model to prove your cases or because it meets some specific requirements better. If you later decide you want to add an Integration layer (go from 2-tiered solution design to 3-tiered solution design) you can reuse the same data and even have something to compare with.
- You can’t virtualise your Data Warehouse without an archive containing the historised raw data. Data Warehouse virtualisation brings even more agility and speed to market. I have written about this extensively but the key point is that if you virtualise your Data Warehouse you still need history somewhere.
Some last thoughts
As with any architecture decision it’s all about options and considerations. So there are some things to ponder over or be aware of. These can be considered ‘cons’ but are not really such a big deal (in my personal opinion).
- In messaging or near Real Time scenarios the Staging Layer is usually skipped, but it my recommendation to keep a message copy in your PSA regardless (as a parallel independent stream)
- You will be impacted by source system changes more regularly, and have basically more objects to manage and maintain. However, this can be omitted by adding logic to automate schema (and corresponding ETL/ELT) changes. This is perfectly doable as there is no further meaning to be applied here. For instance you don’t need additional smarts to understand where the information fits int he model, what the business key is etc.
- You don’t need necessarily need to make your PSA a database. This is fairly straightforward but good to point out: in smaller deployments you may use a database to record your historised raw data but this is by no means a requirement. It’s perfectly doable to use a file archive with potentially lower costs, less overhead and more scalability.
- Be very careful providing access to your PSA, this is in line with ‘not making your PSA a 2nd Data Warehouse’. Your PSA is there to support your Data Warehouse design and development but has no place in end-user access.
My biggest recommendation is to make sure your design can survive without the PSA. By this I mean to avoid making the PSA an integral part of your architecture. Examples of this would be to populate your Integration Layer from the Staging Area, and not from the PSA (populating the PSA is then a side step in parallel with your Integration Layer ETL). This is important, because when your system reaches maturity, you may want to decide not to use the PSA anymore (or change the physical deployment of it).
Another way of saying this is: can use the PSA to launch your new Data Warehouse – with the option of casting off these boosters once your system reach escape velocity!