Data Vault 2.0 Staging Area learnings & suggestions
With Data Vault 2.0 the Data Vault methodology introduces (amongst other things) a more formalised solution architecture which includes a Staging Area. In the designs as advocated in this blog this Staging Area is part of a conceptual Staging Layer that also cover the Persistent Staging Area (PSA). While updating the documentation in general I updated various sections in the Staging Layer definition and this prompted me to highlight some experiences specifically with implementing the Data Vault 2.0 style Staging Area.
It is worthwhile to note that the methodology of Data Vault 2.0 specifies hash values for intended business keys (Hubs), relationships (Links) and optionally Satellites (the ‘hash diff’) to be calculated by the ETL that loads the data delta into the Staging Area. In the case of the Hub and Link keys this means hashing the business key(s) and in the case of the Satellite the hash difference covers the combination of attributes to track for changes in the Satellites. So these may also be calculated in this process and stored in the Staging Area, and you can opt to do this if you are looking for more ETL performance.
The idea behind this is that the calculation of hash keys needs to be done only once and can be reused many times by upstream processes. The hashes are less required to be calculated ‘on the fly’, which incurs a (debated) performance hit. The downside is that this creates a direct impact between changes in the Integration Layer (Data Vault) models and the Staging Area. If you create a new Hub based on an existing interface (a change in the model) you need to add this attribute in the Staging Area as well and make sure the hash is calculated appropriately. Similarly, your source-to-staging ETL will become more complex as it needs to be aware of upstream business key definitions and in the case of the hash diff which attributes are going to be mapped to a specific Satellite.
My view on how to adopt these concepts is outlined below. Essentially I don’t want the Staging Area to be fully dependent on the Integration Layer Data Vault model so I have made the pre-calculation of all hash keys optional (but preferred) in the Staging Area. I use the naming convention HASH_<target table> for this purpose. So for instance if a Staging Area table maps to both the Customer and Product Hubs there would be two separate attributes HASH_HUB_CUSTOMER and HASH_HUB_PRODUCT available in the Staging Area table. However, if the attributes are not present I want the ETL to generate the hash keys ‘on the fly’. This allows me to quickly deploy the Staging Layer and, if I want more ETL performance, to add Hub and Link hash keys in the Staging Area later.
This way, you are still adopting the hashing wherever it makes sense but don’t create a hard relationship between Staging Area and Data Vault design. It is also sound from an architecture perspective as, by virtue of ETL automation, you can always add a hash key in the Staging Area later and just re-generate the ETLs. No data is lost, only performance is gained. To summarise:
- If a Hub or Link needs to be pre-calculated the attribute can be added to the Staging Area. If these attributes are not present in the Staging Area the ETL that loads the data from the Staging Area into the Data Vault based Integration Layer will generate the hash keys at runtime
- Satellite hashes (or ‘hashdiffs’) are optional and can be added for performance improvements if required. If they are modelled in the Staging Layer the ETL will generate a hash keys for the specified combination of attributes going into the Staging Area. However, if this attribute is not present the ETL from the Staging Area to the Data Vault based Integration Layer will generate the hash values on the fly
- Every Staging Area table always calculates a full row hash. I use the naming convention HASH_FULL_RECORD for this, this is useful if you have a PSA for row/change comparison