Data Vault 2.0 – how to handle Referential Integrity?
I was working on adding some of the automation code to support Data Vault 2.0 and this got me thinking about Referential Integrity (RI) related to the modifications that Data Vault 2.0 requires. With Data Vault ‘1.0’ Referential Integrity is always enabled (except for very big systems – let’s leave that one out of the scope for now – see this older post) and in Data Vault 2.0 this hasn’t changed according to the specifications. For Data Vault 1.0 this makes sense, and the ETL takes care of this anyway with the key distribution and lookup mechanisms (if you set them to fail on no hit, that is).
In Data Vault 2.0 you can load your data in an even more parallel fashion, which is a major asset. This is because the key distribution bottleneck / lookup is remediated using hashing concepts (of the business key and even more if you want performance). Using this approach you can load your Satellites and/or Links prior to, or in parallel with, the Hubs, but this conflicts with having Referential Integrity enabled by the database.
With Data Vault in general, and in line with Agile Business Intelligence, you want to have it ‘just work’ in the background without requiring much attention. So, you don’t want to build mechanisms that require further inspection of ‘reject records’ of any kind. Any such process becomes difficult to manage because you never quite know what kind of records will be stored here and when unless you monitor and act on this continuously. This also conflicts with the 100% data 100% of the time maxim as the reject table risk becoming just another waste bucket to store dirty data.
In short, we want to embrace the increased parallel loading / removal of ETL dependency functionality that Data Vault 2.0 offers and we don’t want to reject records in any shape or form.
The lack of lookups in Data Vault 2.0 eliminates the option to apply the concept of ‘enforcing Referential Integrity by ETL’, and enforcing Referential Integrity on database level requires some kind of detect-and-reject mechanism including reject tables so that is not an option either. The only real solution is implementing ‘soft RI’, otherwise known as a ‘housekeeping process’. This is a check that can be run after specific tables have been updated, or at specific intervals, to make sure the integrity of the system is correct.
Conceptually this is implemented at workflow / Batch level because the best way to load data for Data Vault is by loading everything possible on a per-table basis. This means all Hubs, Links, Satellites and other tables that can (should) be sourced from a single source dataset (Staging Area table) are grouped in a workflow (called Batch) – which defines the unit of work. This design makes the Batch self-sufficient because it will create keys when it needs to. For example if a Staging Area table contains context information for a certain business entity (e.g. personal details for an employee) the Batch will contain both a Hub and Satellite ETL process which source the information from the same Staging Area table.
Quite simply, after this Batch is completed the Referential Integrity check can be executed – because at this point in time all information should be complete. If the check fails the workflow should fail as well, as this indicates an issue which requires the state of information to be saved for inspection.
In Data Vault, issues related to Referential Integrity only surface when there is a flaw in the ETL workflow. For example when a Hub ETL step has been forgotten somewhere, or ETL has accidentally run out of (process) order. These are issues that will not happen after proper unit testing, as they indicate a problem in the ETL design. They are countered by using a proper framework and unit testing, which flushes out these issue very quickly.
Data Vault 2.0 specifically mentions that Referential Integrity is a case-by-case decision, as opposed to the ‘always-on’ approach in Data Vault 1.0. However, I think that with the implementation of ‘soft-RI’ the best of both worlds is achieved and in a way that can be implemented as a generic rule across platforms.
When I complete the Data Vault 2.0 templates I’ll write a post on the technical / implementation solution.