Data Vault 2.0 – Introduction and (technical) differences with 1.0
While finalizing the content covering Data Vault implementation it is time to start looking forward towards Data Vault 2.0. For this purpose it makes sense to provide an overview of the changes between the two data modelling approaches.
To limit the scope to implementation I’ll consider it sufficient to mention that Data Vault 2.0 (DV2.0) is a complete approach covering not only the modelling (that was already part of DV1.0) but also the layered DWH architecture and supporting methodology that hooks the Data Vault concepts into methods such as Agile, CMMI, TQM and Six Sigma.
Overall, it not only focuses on getting the ‘Integration Layer’ right but provides an end-to-end solution architecture to work with.
From an implementation perspective there is not a lot of change when looking at the methodology, as the various concepts were always meant to be supported by the implementation even in DV1.0. However, there are some changes in the Modelling and Architecture components that impact how we develop ETL for Data Vault 2.0. To provide a bit of an overview these are listed below (in random order):
- Hashes replace (integer) sequences/identity attributes everywhere as meaningless keys / surrogate keys. This means the Data Warehouse key will now be a hash value (of the business key) and not an integer value generated by the ETL tool or the database. Hashes also provide an integration point to non RDBMS systems such as NoSQL
- Parallel Loading; taking the key generation and corresponding lookups out of the equation enables far greater levels of parallel loading compared to DV1.0. You essentially don’t have to do a key lookup for Hub and Link DWH/Surrogate keys anymore. The are replaced by hash values and it is a massive increase in performance, while also reducing complexity
- Referential Integrity is disabled as part of standard loading, replaced by housekeeping. This is not part of the standards but RI severely conflicts with the new loading patterns (will be discussed later). Consider this my personal view on this. Due to increased parallel loading supporting by the hashing approach, RI cannot be enforced by ETL either, so it falls to housekeeping processing to manage RI (soft RI)
- Virtualisation is the goal, to rapidly define Information Marts (the new ‘Data Mart’ – rebranded) using SQL, similar (sometimes equivalent) to views
So what does this mean for us? It means changing the templates to include a generic hashing mechanism across the board as you want your hash values to be generated the same way everywhere in the system. It also means removing key lookups for all HUB, LNK, SAT and LSAT templates. In terms of information delivery there is no real difference although it leaves you with the option to either create a view or instantiate the same information using an ETL tool towards an ‘Information Mart’.
For implementation the single biggest change is to create the hashes into the Staging Area, as opposed to letting the processes from Staging to Integration (Data Vault) handle this. This also shows why Data Vault 2.0 requires a broader solution design, it needs the Staging Area. So your interface from Source to Staging has to know it’s business keys to create hashes in this step. It makes sense though; once the hashes are available in the Staging Area everything else can be run in parallel and it spreads nicely across MPP nodes to boot.