Data Vault implementation preparations – fundamental ETL requirements

Prior to working my way through the end-to-end ETL solution for Data Vault certain fundamentals must be in place. The reference architecture is one of them and this is largely documented as part of this site and corresponding Wiki. The other main component from an implementation perspective are the requirements for ETL. As all concepts have their place in the reference architecture for good reasons they also have tight relationships and changes to one concept can have implications in another. A good example of this is error and exception handling.

Since we try to stay true to the Data Vault principles we make sure we load all the data and apply business rules later. This also means that concepts such as streamlining datatypes may not lead to errors, Disaster Recovery must be in place and so on. The fundamental ETL requirements support these principles by providing atomicity, decoupling and restartability. In the various upcoming posts these requirements will be referred to often as they are as much part of the architecture as anything else and many concepts depend on them.

Though not necessarily related to Data Vault I like to thing that they properly support the ideal.

The requirements are as follows:

1). ETL can always be rerun without the need to manually change settings. This manifests itself in many ways depending on the purpose of the process (i.e. its place in the architecture). An ETL process that truncates a target table to load form another table is the most straightforward example since this will invariably run successfully every time. Another example is the distribution of surrogate keys in the Hub tables since it is checked if keys are already present before they are inserted. This requirement is also be valid for Presentation Layer tables which merge mutations into an aggregate. Not only does this requirement make testing and maintenance easier, it also ensures that no data is corrupted when an ETL process is run by accident.

2). Source data for any ETL can always be related to, or be recovered. This is covered by correctly using the metadata framework and concepts such as the History Area. The metadata model covers the audit trail and its ability to follow data through the Data Warehouse while the History Area enables a new initial load in case of disaster.

3). ETL processes exist in the folder where they pull data to. For instance the ETL that exist in the 200_History_Area folder loads data from the 100_Staging_Area.

4). ETL processes detect whether they should insert records or not, i.e. should not fail on constraints. This is the design decision that ETL handles data integrity (and not the database).

5). The direction of data is always ‘up’. The typical process of data is from a source, to staging, integration and ultimately presentation. No regular ETL process should write data back to an earlier layer, or use access this information using a (key) lookup.

6). ETL processes must be able to process multiple intervals (changes) in one run. This is an important criterion for ETL to be able to be run at any point in time and to support real-time processing. It means that ETL should not only be able to load a single snapshot or change for a single business key but to correctly handle multiple changes in a single dataset. For instance if the address of an employee changes multiple times during the day and ETL is run daily, all changes are still captures and correctly processed in a single run of the ETL process. This requirement also prevents ETL to be run many times  for catch-up processing.

7). ETL processes should automatically recover when failed. This means that if an error has been  detected the ETL automatically repairs the information from the erroneous  run and inserts the correct data along with any new information that has been sourced. This is one of the reasons why ETL must be able to process multiple time intervals.

8). ETL processes contain transformation logic for a single specific function or purpose. This design decision follow ETL best practises to create many ETL processes that each address a specific function as opposed to few ETL processes that perform a range of activities. Every ETL process attempts to execute an atomic functionality. Examples of atomic Data Warehouse processes (which therefore are implemented as separate ETL processes) are key distribution, detecting changes and inserting records and end-dating records.

9). Related to the above definition of designing ETL to suit atomic Data Warehouse processes every ETL process can read from one or more sources but only write to a single target.

Roelant Vos

Roelant Vos

You may also like...

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.