Data Vault versus the persistent Staging Area

One of the questions I regularly get during presentations is what the benefits of Data Vault are over a persistent Staging Area. In other words: why go through the effort of defining a Data Vault model when you can receive the same ‘regeneration / recreation’ capabilities with a persistent Staging Area which directly feeds a Dimensional Model or similar presentation model.

First off; in my reference architecture I use both the Data Vault (in the Integration Layer), a non-persistent Staging Area (the Staging Area in the Staging Layer) and a persistent Staging Area (the History Area in the Staging Layer). From my point of view one concept does not exclude the other and everything can be used to complement the EDW. The purpose of the Areas is very different with the Staging Area focusing on retrieving the information and organising delivery to the EDW and in some (real-time) cases  only serves as an archive only needed for Disaster Recovery. In this context the Staging Area is a sideline in the ETL processing with source systems feeding into the Data Vault directly.

The reason for having both models or in other words a decoupling between the ‘unintegrated set of raw data’ (Staging Area) and ‘managed raw data – by business key’ (Data Vault) is important is the way the Data Vault keys are used througout the system to identify the unique entities upstream and provide the audit trail. This also means that everything will become generic and template driven which is the requirement for ETL generation and the ETL itself will become more straightforward. The ETL from the source / Staging Area to the Data Vault also defines the entities and relationships which simplify the creation of the datamarts (which can also now be made template driven).

ETL required to load Data Marts directly is typically complex and it doesn’t help that Staging Area tables are stand-alone and typically have no Foreign Keys (recommended!) whereas Data Vault has streamlined this process and has additional capabilities to manage parallelism and dependencies. Additional arguments are accesability in terms of performance. The persistent Staging Area is optimised to store information as an archive function as best as possible whereas the Data Vault is normalised to a degree around information entitities which can be managend independantly (i.e. splitting of satellites, indexing).

In the end Data Vault is the only methodology that defines a separation of storage, history, structure, auditablity and recovery from presentation (data/information) models. All other approaches force a logical or conceptual data model of the (target) business directly on the ‘storage model’ greatly reducing flexiblitity and increasing complexity of the data model and the ETL required to update it.

 
Roelant Vos

Roelant Vos

Roelant Vos

You may also like...

3 Responses

  1. Hi Roelant, Is it possible to also use persisted staging to lookup business keys?
    That is the best solution I can think of in order to resolve the following problem:
    When tables in the source system are using system generated Ids. For example ProductId, even though there is ProductCode which is the one with business meaning.
    Lets say there is a Sales table that has ProductId, etc.. If I were to use ProductCode in my sales link, I would have to join to the product staging table in order to get the Product Code from the product id. But the issue is when I load to staging I am loading only the deltas. Hence not all products (maybe none) would exist in the product staging table.
    One solution I see is for such tables, load all rows every time. But that may be a waste in large tables. Like products in a retailer for instance.
    So I was thinking to use the persisted staging for those look ups.
    Could you please help me understand what is the standard approach in that case?
    Thank you,
    Waldemar

     
  2. Roelant Vos Roelant Vos says:

    Hi Waldemar. Sorry for the late reply, I still need to get to terms that the comments actually work these days :). I had a re-read of this one as it’s 6 years old and suppose I should add to this post that having the various layers is a way to ‘separate concerns’.

    To get back to your question, the short answer is ‘yes’ but it is a decision that comes with some pros and cons. The reason it can still be a viable option is that it can be very hard to achieve business key lookups if you’re going straight from the source (staging) to a Data Vault model which represents a true holistic view of the business – one with true business concepts defined and things like that.

    Opinions on this vary, and there are various ‘weaker’ / more ‘technical’ models which are then augmented with a ‘business Data Vault’ set of tables that pick up this function. If your target model is more ‘technical’ there is less reason to join tables together to get a business key for any given data set, but arguably it’s less of a holistic business representation which is independent from operational systems.

    I would do this, but do so sparingly – only if you really need it. In fact, there have been few implementation were this didn’t end up in the solution at least a few times.

    There are some scalability issues associated with this approach, and you are likely to have to bring forward complexities related to ‘business’ timeliness which means it is better suited to be applied to some systems than others. These are things you ideally want to address in the delivery / mart areas. The short explanation is that joining on data sets in Staging conceptually has to be done using the time lines that the systems use to get the correct representation, which opens you up to potential data quality issues.

    Another thing to be mindful of is the behaviour of delta detection using joins in Staging. As you already point out, you can’t just join Staging tables so have to rely on Persistent Staging. But due to timing issues you have to basically do a set of joins in both (all) directions to get all deltas. If you only look up a business key it’s less of an issue, but if you add some attributes you would have to cater for the behaviour that a change in any of those attributes is reflected as well, which means you have to ‘union’ two sets of joins if you involve two tables.

    As an example – if you join table A in Staging to table B in Persistent Staging you would have to union this to the result of joining table B in Staging to table A in Persistent Staging.

    Hope this makes sense!

     
  1. June 18, 2016

    […] this and thought it would be good to explain this in more detail for today’s post. I wrote a post similar to this one back in 2012, but it needs a bit of an update […]

     

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.