Implementing a Business Data Vault
Over the past few days I have been in a couple of discussion about the concept of the ‘Business Data Vault’. This concept has been around for some time but rarely has it been fully explained. Plainly put, it is a special utilization of the Data Vault concept and provides out of the box audit trailing for the implementation of business rules. As usual, a raw (operational) Data Vault processed first and the ETL processes populating this area identifies new business keys and assigns surrogate keys for these newly discovered business keys. Data captured in the operational Data Vault remains in its pristine, raw form. In other words: data taken from staging area gets re-modelled but its contents do not change.
In a business Data Vault, contents do change. Enterprise wide business rules can be applied to the data that exists in the operational Data Vault. For instance:
- Data cleansing / scrubbing
- Data enrichment
- Data standardisation
- De-duplication
- Validation
The intention of the business Data Vault is to reduce replication of the rules downstream. An example would be the formatting of addresses where abbreviations such as “st” or “rd” get expanded to “street” and “road”, and the validation of phone numbers. And in separating (and keeping!) the state of the data into two physical forms (original and modified), it gives the flexibility of applying multiple sets of rule to a specific data set, in order to suit specific needs.
An example of this approach would be:
In this example the data set starts with employee records which are split between an employee hub table and an employee satellite table in which the history of the employee attributes are tracked. In this case attributes include address information. Since this is the raw data coming from the staging area the contents will be stored in (operational) Data Vault. After the data has been loaded certain enterprise wide business rules are effectuated; this is done in the business Data Vault. This results in the creation of a subset of employees which have certain qualities into an entity ‘special employees’. The relationship of which employee is a special employee is stored in a link relationship table. Also, address information is updated for the employee. Technically this will result in a new satellite table with the updated values while still retaining the link to the original values from the Data Vault.
The same (simplified) example with data for the Data Vault would be:
Employee hub
ID Logical key Source row ID
1 201001 35
2 201004 36
3 201245 37
Employee satellite (raw address data)
ID First name Last name Address Valid start Valid end
1 John Doe 40 George St. 20080101 99991231
2 John Doe 40 George Street 20080101 99991231
3 Peter Smith 70 York St. 20080504 99991231
After applying a deduplication and cleansing business rule the following tables will be populated in the business Data Vault. In this case two records are merged into one using a specific match and survive algorithm. The result is a set of two employees instead of the original three.
Special employee hub(selected by business rules)
ID Logical key Source row ID
200 201004 36
201 201245 37
The relationship between the original data and the cleaned data has to be maintained for audit purposes. This table shows the relationship between the original and the cleansed data:
Employee and special employee link
ID employee ID special employee
1 200
2 200
3 201
Separately the employee history has been corrected to show the correct address information. While this table conceptually resides in the business Data Vault; it still relates to the original surrogate key table EMPLOYEE. There is no need to copy all the historical data, only the corrected attribute(s).
Employee satellite (with correct address data)
ID Address Valid start Valid end
1 40 George Street 20080101 99991231
2 40 George Street 20080101 99991231
3 70 York St. 20080504 99991231
In a similar fashion business rules can be used to detect and tag errors using the error bitmap. This will result in (updated) error bitmap attribute which will be stored in a SCD2 fashion in the business Data Vault, just as every other attribute. The process which selects the data from the business Data Vault can select the data based on the type of errors (if this is a requirement).
Final thoughts:
- New hubs in the business Data Vault will receive new surrogate keys. This is better than reusing the keys that were distributed in the original Data Vault because these keys now have different meanings.
- Technically, I would advise to put both concepts of the Data Vault in the same schema or instance, because they are related. Difference can be made by using naming conventions.
