Data Vault implementation A-Z: Staging data (the conceptual side)
This is the first of a planned series of implementation designs for implementing Data Vault in an end-to-end Data Warehouse environment. The positioning of the Data Vault concepts and techniques in the greater design of the system (reference architecture) is documented elsewhere in this site, mainly in the ‘papers’ section. Data Vault in itself does not provide a complete solution for most Data Warehouse purposes but provides a great set of modeling techniques to design the core Data Warehouse component (Integration Layer / System of Record).
In this context the efforts required to stage and/or collect the data or the steps to present it to the business in a meaningful way are not directly covered by Data Vault. Nevertheless, these are essential components that require to be addressed in the complete solution. The reference architecture attempts to link together these various elements of the Data Warehouse and, for this reason, defines where certain types of ETL logic should be implemented. As mentioned above, the Staging Area is not a ‘part of’ Data Vault but provides essential functionality to the system for it to be able to operate as complete solution. In its own way it supports the same principles as championed by Data Vault.
As defined in the reference architecture the main purposes of the Staging Area are:
- Bring (source) information into the Data Warehouse environment for further processing.
- Streamline / standardise data types to limit the range of data types to handle in the Data Warehouse. Note that this does not transform the contents or type directly (i.e. character to numeric)!
- Determine the optimal interface for each source (system). Ideally this interface is able to detect logical deletes. In most cases the interface requires management of loading windows.
- Identify the Event Date/Time. This is deeply linked to the interface type and defines which date/time value represents the actual change as happened in the source system the closest.
It needs to be stated that the functions listed above are the concepts related to Staging Area design and do not necessarily require be implemented as a separate physical structure. In more mature implementations information can also be loaded directly into the Data Vault when loading windows are managed properly. This requires every Data Vault table to have its own window.
Having a Staging Area (physical) table does provide some benefits in terms of implementing ETL logic in a single process but this also depends on the capabilities of the software as some ETL platforms provide advanced functionality to reuse logic on various levels. This series of postings will use the physical Staging Area table as example to work through the Data Vault Enterprise Data Warehouse implementation.
Even though this is not extensively covered in the Data Vault modeling technique the Staging Area is one of the hardest things to initially get right and should be subject to extensive testing to make sure the implementation of these concepts is working correctly. In practice there only a limited number interface types but more often than not technical differences and/or software combinations result in unexpected behaviour during implementation. Generally, interfaces fall into one of the following categories:
- Push mechanisms; the operational system(s) provide and present a dataset (typically a delta) to a location where the Data Warehouse can access it. This includes information deliverd as flat-file or Excel sheet / csv.
- Pull mechanisms; the Data Warehouse directly accesses information from the operational system.
- Replication / Change Data Capture (CDC) and/or message based; only information that has changed is provided to the Data Warehouse.
- Full dataset dumps / full outer join; the Data Warehouse receives a complete set every time and compares it to the previous set to identify any changes that have occurred.
Deciding on the correct interface to suit the purpose in terms of availability, requirements and scalability is one of the hardest initial decisions to be made in Data Warehouse design and largely depends on the availability of the data, company (security) and/or DBA conventions and data volumes. Generally speaking CDC based interfaces are ultimately the preferred choice in terms of scalability and level of detail as they usually capture all changes as opposed to comparing for changes at certain points in time.
This has big implications on the definition of the Event Date/Time as this is a grey area largely dependent on the chosen interface. For instance, most CDC mechanisms use the database transaction log to capture and present changed records and this represents a very close date/time for the actual event: when a system user pressed ‘enter’ and updated a record in the source system. Similarly, a full outer join process only detects the change when the comparison is run by the Data Warehouse and in this scenario the Event Date/Time might be very different from the actual event. Between these extremes there are a variety of options that represent a varying level of accuracy.
However, the main aspect to bear in mind is that however the Event Date/Time is never a source attribute except in the case of an initial load (and even this depends on the type of source system). The Event Date/Time must be system generated and unchangeable through user input.
In the same line of thinking CDC or other near real-time mechanism provide various changes in a single (load) time interval resulting in a dataset where various ‘points in time’ for a record are identified by the Event Date/Time. In other words, there is one unique row for a key for each Event Date/Time. On the other hand, Full outer join mechanisms only result in a single ‘version’ of the record; the difference at the point in time (the Event Date/Time in this case) that the comparison was done.
Now that the Event Date/Time has been identified the only other major operation is to streamline the disparate data types to a more manageable array. This is a very straightforward operation that does not change the contents of the data in any and should not lead to errors. The only goal is to tidy up the data types to reduce complexity later on.
As always, ETL processes comply with the fundamental requirements for ETL as defined in the reference architecture (see link). In the case of Source-to-Staging ETL the target (Staging Area) tables use a truncate/insert mechanism. This summarise what we try to do, the next post will explain how this is done using Microsoft SSIS (mainly because most of my current projects use SSIS). When a new project using Informatica Powercenter, Oracle ODI and/or IBM DataStage starts I will try to keep the same documentation for implementing these concepts using those platforms.