Virtualising your (Enterprise) Data Warehouse – what do you need?

For a while I have been promoting the concept of defining a Historical (raw) Staging Area / archive to complement the Data Warehouse architecture. A quick recap: the Historical Staging Area is really just an insert-only persistent archive of all original data delta that has been received. One of the great things is that you can deploy this from day one, start capturing changes (data delta) and never have to do an initial load again.

In my view this can be extended even further to support other initiatives outside direct Data Warehousing as explained here. But similar to the thought about how relevant ETL software is when you can auto-generate all ETL, the same might apply for the complete Data Warehouse data. As I mentioned earlier errors can still be made even with modern data modeling techniques, and that is why we try to deploy an Historical Staging Area. It requires additional storage, but the development and maintenance effort is negligible (and benefits massively from data compression).

The reason for this post is that I have experienced that the ability to reload your Data Warehouse in many cases has substantial benefits. Using a process I call ‘re-initialisation’ we have executed this many times in various projects using a different technology mix. Re-initialistion basically copies all data from the History Staging Area into the Staging Area and the normal processes reload the upstream Data Warehouse layers (e.g. Data Vaults, Dimensional Models) from there.

This is very similar to the trending concept of virtualising the Data Mart (Information Marts in DV2.0); the message is that your design is proper when you can drop and recreate the Dimensional Model (or other delivery mechanism)  from your underlying core Data Warehouse layer (typically Data Vault, sometimes 3NF) in the exact same way as it was. What ‘re-initialisation’ does is the exact same thing for your core Data Warehouse layer: you can drop and recreate your Data Vault!

The idea is that if you can automate all your ETL and reload all your data exactly the same from your Historical Staging Area, what does it take to virtualise not only your Information Mart but the full (Enterprise) Data Warehouse?

A minor disclaimer: I’m fully aware of performance implications – this post is considered a conceptual deep-dive to see if everything works. Perhaps scalability together with cost and effort reduction will make this a viable alternative at some point. For now, let’s see how it works and what you need to achieve this.

The way I see it all the necessary components from the various areas have matured to the point that it’s actually pretty easy to put the pieces together. For the sake of argument let’s go all the way and create a full EDW with Data Vault as middle-tier / core Data Warehouse layer:

  • A Historical Staging Area / archive with deltas (time-variant but insert-only). There are various nifty tricks to simulate this into a more ‘SCD2’ form using derived end dates and current record indicators. To virtualise everything you don’t need a dedicated Staging Areas for data deltas as you’re always reading from the full history anyway
  • The Staging, Integration (Data Vault) and Presentation (Information Mart) data models – these contain an enormous amount of metadata that is required for ETL automation, including but not limited to:
    • The business keys
    • Context attributes
    • Multi active attributes
    • Entity relationships
    • Driving Keys (yes, these can be read from the model too)
    • Type of history presentation (Type 1,2,3,6) in the Information Marts
  • Data Vault 2.0; adopting the hash techniques that Data Vault 2.0 specifies remove a whole bunch of complexities involving sequence generation!
  • Mapping Metadata; you still need the basic ETL automation metadata to be available somewhere. This is exactly the same metadata you need to generate ETL processes in any platform
  • ETL Automation concepts; instead of generating ETL processes you are basically generating SQL views. This is a bit different from generating ETLs as in this context you generate more dedicated ETLs (for instance – one ETL for each Hub per source table) whereas in virtualization you generate fewer views which union together more data (one view which unions all business keys from all source tables). The underlying metadata, however, is still the same so it’s not a big tweak
  • An ETL Framework. The ability to virtualise everything still doesn’t absolve you from certain elemental ETL process controls, such as record sources, lineage, runtime IDs. I still want to track when certain things are run and at what time even. Even if only for being future proof: you might want to ‘physicalise’ certain elements at some point. This is no different than from normal ETL development.

One of the immediate benefits is that you define your metadata only once, and you can generate both your virtual Data Warehouse and your ETL processes from this. So it’s really design once and decide which parts to instantiate / physicalize both in terms of ETL and data storage.

Another great benefit is that it allows you to present something tangible (prepared data) very quickly, while still being able to deploy a full solution without re-engineering. I look forward to completing this personal Proof of Concept.

Almost there.

 
Roelant Vos

Roelant Vos

You may also like...

3 Responses

  1. says:

    I have observed through your blogs that you use the Historical Staging Area with a particular function which is to derive the most accurate record change timestamps as well as Delete records in sources via FULL OUTER JOINS with source data (when CDC doesn’t exist).

    I have been also advocate of having a permanent staging area rather than a temporary one. The main reason is that there are often cases where the Data Transformation from source to target requires more context data than just what is available in the temporary staging area (i.e. the single one record that has changed), for example, need to associate records via FK’s even to more than one level or having additional reference data to be able to properly determine a target value to be populated in the Integration Layer. Even in some more complex transformations, aggregation/filters are required by looking at entire sets of historical data in the source. This view is probably more aligned with an ELT (rather than ETL) processing. However, always needed when the data model of the source is vastly difference to the target Business model we want to capture as Integration Layer.

    I understand that the examples of source queries for an ETL that you are proposing are mostly related to a theoretical idea of virtualising the entire DV model by making use of available historical staging/source data. In my experience, as said earlier, I actually have implemented many times ETL’s using mostly logic in the SQL source queries of the ETL, and requiring also this permanent staging area.

    On the other side, I understand that DV methods indicates that we wouldn’t need ever a permanent staging area because the Data vault itself serves the purpose of direct traceability/lineage between source and target. I don’t feel easy with this statement because form my experience this assumes that we are only allowed to make very simplistic ETL transformations, in other words: a) We have to model the integration Layer/Data Vault in a very similar way to the source system model, and in many cases, source system have very bad models; b) We would never be able to do a transformation where the target data needs to be determined using JOINED data against records not available in the transient staging area.

    What do you think of this?

     
  1. March 26, 2015

    […] – the History Area. Both are part of the Staging Layer of the solution design. As outlined in this and other posts, the virtual Data Warehouse runs on top of a PSA as this area contains the full […]

     
  2. April 10, 2015

    […] you adopt the Persistent Staging Area (PSA) / Historical Staging Area (HSTG) you can come back later and add more information from the original content if […]

     

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.