Virtualising your (Enterprise) Data Warehouse – what do you need?
For a while I have been promoting the concept of defining a Historical (raw) Staging Area / archive to complement the Data Warehouse architecture. A quick recap: the Historical Staging Area is really just an insert-only persistent archive of all original data delta that has been received. One of the great things is that you can deploy this from day one, start capturing changes (data delta) and never have to do an initial load again.
In my view this can be extended even further to support other initiatives outside direct Data Warehousing as explained here. But similar to the thought about how relevant ETL software is when you can auto-generate all ETL, the same might apply for the complete Data Warehouse data. As I mentioned earlier errors can still be made even with modern data modeling techniques, and that is why we try to deploy an Historical Staging Area. It requires additional storage, but the development and maintenance effort is negligible (and benefits massively from data compression).
The reason for this post is that I have experienced that the ability to reload your Data Warehouse in many cases has substantial benefits. Using a process I call ‘re-initialisation’ we have executed this many times in various projects using a different technology mix. Re-initialistion basically copies all data from the History Staging Area into the Staging Area and the normal processes reload the upstream Data Warehouse layers (e.g. Data Vaults, Dimensional Models) from there.
This is very similar to the trending concept of virtualising the Data Mart (Information Marts in DV2.0); the message is that your design is proper when you can drop and recreate the Dimensional Model (or other delivery mechanism) from your underlying core Data Warehouse layer (typically Data Vault, sometimes 3NF) in the exact same way as it was. What ‘re-initialisation’ does is the exact same thing for your core Data Warehouse layer: you can drop and recreate your Data Vault!
The idea is that if you can automate all your ETL and reload all your data exactly the same from your Historical Staging Area, what does it take to virtualise not only your Information Mart but the full (Enterprise) Data Warehouse?
A minor disclaimer: I’m fully aware of performance implications – this post is considered a conceptual deep-dive to see if everything works. Perhaps scalability together with cost and effort reduction will make this a viable alternative at some point. For now, let’s see how it works and what you need to achieve this.
The way I see it all the necessary components from the various areas have matured to the point that it’s actually pretty easy to put the pieces together. For the sake of argument let’s go all the way and create a full EDW with Data Vault as middle-tier / core Data Warehouse layer:
A Historical Staging Area / archive with deltas (time-variant but insert-only). There are various nifty tricks to simulate this into a more ‘SCD2’ form using derived end dates and current record indicators. To virtualise everything you don’t need a dedicated Staging Areas for data deltas as you’re always reading from the full history anyway
- The Staging, Integration (Data Vault) and Presentation (Information Mart) data models – these contain an enormous amount of metadata that is required for ETL automation, including but not limited to:
- The business keys
- Context attributes
- Multi active attributes
- Entity relationships
- Driving Keys (yes, these can be read from the model too)
- Type of history presentation (Type 1,2,3,6) in the Information Marts
- Data Vault 2.0; adopting the hash techniques that Data Vault 2.0 specifies remove a whole bunch of complexities involving sequence generation!
- Mapping Metadata; you still need the basic ETL automation metadata to be available somewhere. This is exactly the same metadata you need to generate ETL processes in any platform
- ETL Automation concepts; instead of generating ETL processes you are basically generating SQL views. This is a bit different from generating ETLs as in this context you generate more dedicated ETLs (for instance – one ETL for each Hub per source table) whereas in virtualization you generate fewer views which union together more data (one view which unions all business keys from all source tables). The underlying metadata, however, is still the same so it’s not a big tweak
- An ETL Framework. The ability to virtualise everything still doesn’t absolve you from certain elemental ETL process controls, such as record sources, lineage, runtime IDs. I still want to track when certain things are run and at what time even. Even if only for being future proof: you might want to ‘physicalise’ certain elements at some point. This is no different than from normal ETL development.
One of the immediate benefits is that you define your metadata only once, and you can generate both your virtual Data Warehouse and your ETL processes from this. So it’s really design once and decide which parts to instantiate / physicalize both in terms of ETL and data storage.
Another great benefit is that it allows you to present something tangible (prepared data) very quickly, while still being able to deploy a full solution without re-engineering. I look forward to completing this personal Proof of Concept.