Referential Integrity in Data Vault

Traditionally, Referential Integrity (RI) is not enforced for a Data Warehouse in some approaches. This may result that the Foreign Key (FK) constraint is either not generated at all, or generated but disabled on database level. The latter can improve the efficiency of the RDBMS optimiser depending on the platform used (and should by a case-by-case consideration).

Unlike operational (transactional) systems, data for a DWH is prepared (scrubbed) before it is inserted and this happens in a automated, repeatable and structured way via ETL processes and without any manual data entry. In other words it is presumed to be ‘ok’. Most of the application of database techniques such as indexes and constraints have been geared towards efficient aggregation and querying of data and once the system has been setup there has not been a specific need for RI.

The reason for not enforcing RI on database level is usually that in traditional DWH architectures ETL processes sometimes have to load data in an order that violates RI, which requires the ETL processes to handle this instead. It goes without saying that this requires a strong ETL framework that includes placeholders for unknown records and proper standards and control. Additionally, enforcing RI on database level comes with a performance hit which requires mechanisms to disable RI before loading, and re-applying it after the loading has completed.

For Data Vault the paradigm of ‘letting ETL handle RI’ is no longer valid; instead for Data Vault, RI on database level should be enabled as long as possible. The ‘possible’ points at the RI-related (potential) performance issues. For most infrastructures there will be little cause for concern, but there may come a point where the RDBMS engine and/or hardware cannot sustain the throughput required without massive upgrades to the environment. At this point RI can be switched off to make sure the Data Vault DWH keeps up with the data flowing into the system. This can happen for either very large volumes or terribly undersized RDBMS environments.

This means that even though RI is enabled on database level, ETL should still designed to handle this in case this is turned off at some point in the future.

Roelant Vos

Roelant Vos

You may also like...

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.