Data Vault versus the persistent Staging Area

Roelant Vos

Roelant Vos

Roelant Vos

You may also like...

3 Responses

  1. Hi Roelant, Is it possible to also use persisted staging to lookup business keys?
    That is the best solution I can think of in order to resolve the following problem:
    When tables in the source system are using system generated Ids. For example ProductId, even though there is ProductCode which is the one with business meaning.
    Lets say there is a Sales table that has ProductId, etc.. If I were to use ProductCode in my sales link, I would have to join to the product staging table in order to get the Product Code from the product id. But the issue is when I load to staging I am loading only the deltas. Hence not all products (maybe none) would exist in the product staging table.
    One solution I see is for such tables, load all rows every time. But that may be a waste in large tables. Like products in a retailer for instance.
    So I was thinking to use the persisted staging for those look ups.
    Could you please help me understand what is the standard approach in that case?
    Thank you,
    Waldemar

     
  2. Roelant Vos Roelant Vos says:

    Hi Waldemar. Sorry for the late reply, I still need to get to terms that the comments actually work these days :). I had a re-read of this one as it’s 6 years old and suppose I should add to this post that having the various layers is a way to ‘separate concerns’.

    To get back to your question, the short answer is ‘yes’ but it is a decision that comes with some pros and cons. The reason it can still be a viable option is that it can be very hard to achieve business key lookups if you’re going straight from the source (staging) to a Data Vault model which represents a true holistic view of the business – one with true business concepts defined and things like that.

    Opinions on this vary, and there are various ‘weaker’ / more ‘technical’ models which are then augmented with a ‘business Data Vault’ set of tables that pick up this function. If your target model is more ‘technical’ there is less reason to join tables together to get a business key for any given data set, but arguably it’s less of a holistic business representation which is independent from operational systems.

    I would do this, but do so sparingly – only if you really need it. In fact, there have been few implementation were this didn’t end up in the solution at least a few times.

    There are some scalability issues associated with this approach, and you are likely to have to bring forward complexities related to ‘business’ timeliness which means it is better suited to be applied to some systems than others. These are things you ideally want to address in the delivery / mart areas. The short explanation is that joining on data sets in Staging conceptually has to be done using the time lines that the systems use to get the correct representation, which opens you up to potential data quality issues.

    Another thing to be mindful of is the behaviour of delta detection using joins in Staging. As you already point out, you can’t just join Staging tables so have to rely on Persistent Staging. But due to timing issues you have to basically do a set of joins in both (all) directions to get all deltas. If you only look up a business key it’s less of an issue, but if you add some attributes you would have to cater for the behaviour that a change in any of those attributes is reflected as well, which means you have to ‘union’ two sets of joins if you involve two tables.

    As an example – if you join table A in Staging to table B in Persistent Staging you would have to union this to the result of joining table B in Staging to table A in Persistent Staging.

    Hope this makes sense!

     
  1. June 18, 2016

    […] this and thought it would be good to explain this in more detail for today’s post. I wrote a post similar to this one back in 2012, but it needs a bit of an update […]

     

Leave a Reply