Data Vault 2.0 – Introduction and (technical) differences with 1.0

by Ravos · Published May 12, 2014 · Updated May 13, 2014

While finalizing the content covering Data Vault implementation it is time to start looking forward towards Data Vault 2.0. For this purpose it makes sense to provide an overview of the changes between the two data modelling approaches.

To limit the scope to implementation I’ll consider it sufficient to mention that Data Vault 2.0 (DV2.0) is a complete approach covering not only the modelling (that was already part of DV1.0) but also the layered DWH architecture and supporting methodology that hooks the Data Vault concepts into methods such as Agile, CMMI, TQM and Six Sigma.

Overall, it not only focuses on getting the ‘Integration Layer’ right but provides an end-to-end solution architecture to work with.

From an implementation perspective there is not a lot of change when looking at the methodology, as the various concepts were always meant to be supported by the implementation even in DV1.0. However, there are some changes in the Modelling and Architecture components that impact how we develop ETL for Data Vault 2.0. To provide a bit of an overview these are listed below (in random order):

Hashes replace (integer) sequences/identity attributes everywhere as meaningless keys / surrogate keys. This means the Data Warehouse key will now be a hash value (of the business key) and not an integer value generated by the ETL tool or the database. Hashes also provide an integration point to non RDBMS systems such as NoSQL
Parallel Loading; taking the key generation and corresponding lookups out of the equation enables far greater levels of parallel loading compared to DV1.0. You essentially don’t have to do a key lookup for Hub and Link DWH/Surrogate keys anymore. The are replaced by hash values and it is a massive increase in performance, while also reducing complexity
Referential Integrity is disabled as part of standard loading, replaced by housekeeping. This is not part of the standards but RI severely conflicts with the new loading patterns (will be discussed later). Consider this my personal view on this. Due to increased parallel loading supporting by the hashing approach, RI cannot be enforced by ETL either, so it falls to housekeeping processing to manage RI (soft RI)
Virtualisation is the goal, to rapidly define Information Marts (the new ‘Data Mart’ – rebranded) using SQL, similar (sometimes equivalent) to views

So what does this mean for us? It means changing the templates to include a generic hashing mechanism across the board as you want your hash values to be generated the same way everywhere in the system. It also means removing key lookups for all HUB, LNK, SAT and LSAT templates. In terms of information delivery there is no real difference although it leaves you with the option to either create a view or instantiate the same information using an ETL tool towards an ‘Information Mart’.

For implementation the single biggest change is to create the hashes into the Staging Area, as opposed to letting the processes from Staging to Integration (Data Vault) handle this. This also shows why Data Vault 2.0 requires a broader solution design, it needs the Staging Area. So your interface from Source to Staging has to know it’s business keys to create hashes in this step. It makes sense though; once the hashes are available in the Staging Area everything else can be run in parallel and it spreads nicely across MPP nodes to boot.

Data Vault 2.0 – Introduction and (technical) differences with 1.0

You may also like...

1 Response

Leave a Reply Cancel reply

Search this site

Upcoming Events

Recent Posts

Data Vault 2.0 – Introduction and (technical) differences with 1.0

You may also like...

A pattern for Data Mart delivery

Introducing Agnostic Data Labs

Mapping generation for Data Vault demo: part 1 (the source system)

1 Response

Leave a Reply Cancel reply

Search this site

Upcoming Events

Recent Posts