Implementing a Business Data Vault

Over the past few days I have been in a couple of discussion about the concept of the ‘Business Data Vault’. This concept has been around for some time but rarely has it been fully explained. Plainly put, it is a special utilization of the Data Vault concept and provides out of the box audit trailing for the implementation of business rules. As usual, a raw (operational) Data Vault processed first and the ETL processes populating this area identifies new business keys and assigns surrogate keys for these newly discovered business keys. Data captured in the operational Data Vault remains in its pristine, raw form. In other words: data taken from staging area gets re-modelled but its contents do not change.   

In a business Data Vault, contents do change. Enterprise wide business rules can be applied to the data that exists in the operational Data Vault. For instance:   

  • Data cleansing / scrubbing
  • Data enrichment
  • Data standardisation
  • De-duplication
  • Validation 

The intention of the business Data Vault is to reduce replication of the rules downstream. An example would be the formatting of addresses where abbreviations such as “st” or “rd” get expanded to “street” and “road”, and the validation of phone numbers. And in separating (and keeping!)  the state of the data into two physical forms (original and modified), it gives the flexibility of applying multiple sets of rule to a specific data set, in order to suit specific needs.   

An example of this approach would be:

In this example the data set starts with employee records which are split between an employee hub table and an employee satellite table in which the history of the employee attributes are tracked. In this case attributes include address information. Since this is the raw data coming from the staging area the contents will be stored in (operational) Data Vault. After the data has been loaded certain enterprise wide business rules are effectuated; this is done in the business Data Vault. This results in the creation of a subset of employees which have certain qualities into an entity ‘special employees’. The relationship of which employee is a special employee is stored in a link relationship table. Also, address information is updated for the employee. Technically this will result in a new satellite table with the updated values while still retaining the link to the original values from the Data Vault.  

The same (simplified) example with data for the Data Vault would be: 

Employee hub
ID    Logical key    Source row ID
1        201001               35
2        201004              36
3        201245               37 

Employee satellite (raw address data)
ID     First name     Last name         Address                        Valid start          Valid end
1         John                Doe                     40 George St.              20080101         99991231
2        John                 Doe                     40 George Street       20080101         99991231
3        Peter                Smith                 70 York St.                  20080504         99991231 

After applying a deduplication and cleansing business rule the following tables will be populated in the business Data Vault. In this case two records are merged into one using a specific match and survive algorithm. The result is a set of two employees instead of the original three. 

Special employee hub(selected by business rules)
ID       Logical key    Source row ID
200    201004            36
201     201245             37 

The relationship between the original data and the cleaned data has to be maintained for audit purposes. This table shows the relationship between the original and the cleansed data:  

Employee and special employee link
ID employee      ID special employee
1                               200
2                               200
3                               201 

Separately the employee history has been corrected to show the correct address information. While this table conceptually resides in the business Data Vault; it still relates to the original surrogate key table EMPLOYEE. There is no need to copy all the historical data, only the corrected attribute(s).  

Employee satellite (with correct address data)
ID       Address                      Valid start       Valid end
1          40 George Street    20080101       99991231
2          40 George Street    20080101       99991231
3          70 York St.               20080504       99991231 

In a similar fashion business rules can be used to detect and tag errors using the error bitmap. This will result in (updated) error bitmap attribute which will be stored in a SCD2 fashion in the business Data Vault, just as every other attribute. The process which selects the data from the business Data Vault can select the data based on the type of errors (if this is a requirement). 

Final thoughts: 

  • New hubs in the business Data Vault will receive new surrogate keys. This is better than reusing the keys that were distributed in the original Data Vault because these keys now have different meanings.
  • Technically, I would advise to put both concepts of the Data Vault in the same schema or instance, because they are related. Difference can be made by using naming conventions.
 
Roelant Vos

Roelant Vos

You may also like...

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.