Quick and easy referential integrity validation (for dynamic testing)
This post is in a way related to the recent post about generating some test data. In a similar way I was looking for ways to make life a bit easier when it comes to validating the outputs of Data Vault ETL processes. Some background is provided in an earlier post on the topic of Referential Integrity (RI) specifically in the context of Data Vault 2.0. In short, by adopting the hash key concepts it is possible to load ETL in different order – the Satellite before the Hub for instance. In a way this is similar (process wise) to loading Fact tables before Dimensions in the world of Dimensional Modelling.
This means that traditional RI is not applicable: the key lookups in ETL and constraints on database level need to be removed to allow this type of parallelism to be adopted. This also means we still need to validate RI at some point, ideally before the data delta gets committed to the database so things can be rolled-back if the RI isn’t correct after processing the delta.
As a default approach I handle this by adding a final process step to validate RI in a given workflow. All data from a specific Staging Area table is loaded to the various Data Vault tables by the dedicated ETL processes – but before completing the workflow the RI check (a SQL query) will validate if everything is in sync for the involved tables and data delta. If not, the workflow will be flagged as ‘failed’. The SQL itself basically used the model metadata to see which tables (should) have outgoing Foreign Keys (FK) and check if there are corresponding parent records. E.g. a Satellite has a FK to a Hub, a Link has FKs to the related Hubs and so on.
The logic is very straightforward and I wanted to share this here as an example. I updated my homebrew Data Vault automation tool to generate some end-to-end and partial validation scripts as shown here. The end-to-end validation checks the full tables, the partial validation includes logic to apply this only on the available data delta. As with the test data generation, it’s nothing fancy but it does the job (although I probably should use NOT EXISTS instead of an OUTER JOIN).
It is a bit of professional hobbyism, but it works for me as I can now setup metadata and table structure and immediately generate and process test data and validate the integrity in one go. Doing this ensures that all metadata is configured correctly and for instance no errors in the hash key definitions are present. It is also easy to swap between the virtual and physical structure, in line with the virtualisation efforts.
From more traditional ETL perspective: as the workflows for ETLs can be generated as well there is always the option to initially generate the ETL-based and/or database based RI validation (lookups and database constraints) and adjust this when things get more stable later. Various people I spoke with adopt this approach. Quite simply this covers: generate your ETL using the hashing concept but with key lookup options and corresponding dependencies in the workflow, but regenerate and apply post-processing RI validation when performance becomes an issue (e.g. parallelism is required). That makes sense too.