Data Vault ETL Implementation using SSIS: Step 1 – Staging Area ETL

While technically (and conceptually) not really part of Data Vault the first step of the Enterprise Data Warehouse is to properly source, or stage, the data. This is a critical step, and often one of the most difficult to get right. It also has direct impact on the core DWH / Data Vault development, or at least supports its core message and principles. This impact can be summarised by the definition that the purpose of the Staging Area (ETL process) is to capture the date/time of a change in a certain entity as accurately as possible.

As mentioned in previous posts different Staging and/or delta capture mechanics have different ranges of accuracy and this is always the trade-off with complexity but this is essentially outside the scope of Data Vault. For the purpose of this example the Full Outer Join mechanism is used; other alternatives are CDC, push or pull delta mechanisms using various technologies or storage formats.

A lot of solutions / templates are OK as long as they are in alignment with the basic ETL principles and can capture logical / soft deletes.

While not part of Data Vault the Staging Area is still a core component of the Enterprise Data Warehouse and is subject to its (architectural) principles including restartability and the ability to process multiple changes in one run (not really applicable for Full Outer Join as it’s a snapshot, but still).

In the SSIS Control Flow the only action that stands out is the truncation of the target (STG_<table>) table. For an example (the Full Outer Join approach) the  Data Flow contents are as follows:

DV_implementation_step_1_full_outer_join

The operators are as follows:

  • SRC – <source table>. The direct connection to the operational system; since SSIS does not have a native checksum operator (plug-ins are available) the checksum is calculated using a SQL override
  • CNT – Discarded / Inserted / Input. Optional record counts. The counts are stored in local variables and committed to the ETL process metadata after completion of the package execution
  • DCO – Attribute Conversion. Streamlines the data types of the potentially disparate information sources to be in sync with the Data Warehouse standards. Note that this does not change the contents, and may never lead to an error (so no text to date conversions) so it is aligned with the Data Vault message. Conversions are done following the Staging Area design as recorded in the design section of this site
  • SRC – <HSTG table>. The direct connection to the Historical Staging (archive) where all attributes as well as the checksum is queried for comparison. Only the most recent state of each key is selected (maximum Effective Date) and Logical Deletes are filtered out since these do not require comparison (they’re already deleted in the source). The reason the attributes are loaded as well is to support the storage of Logical Deletes following the pattern to record a deleted record in it’s last known state; that’s the information that the Historical Staging provides.
  • MJO – Full Outer Join. A Full Outer Join is a relatively crude but effective way to detect changes, implemented by setting the join type to Full Outer Join and joining on the primary key of the source table (natural key). For instance Customer ID from the source on Customer ID from the Historical Staging even though the latter has a meaningless primary key /sequence
  • DRC – Interpret Join Results. Evaluate the join results to produce the correct recordset, including the Change Data Capture (CDC) event
  • CDS – Filter No Changes. The purpose of the conditional split is to remove unchanged records from the Data Flow. If the Change Data Capture event is ‘No Change’ the records are routed to the CNT_DISCARDED (gracefully removed). Otherwise they are routed towards the destination target.
  • DRC – Standard Values. The core values used for Data Vault later on are defined here:
    – Record Source; the string value of the source system
    – Event Date/Time; for this interface type the value of the moment the join was executed
  • DST – STG_<table>. The destination target (Staging Area table), which is a straight insert of the defined delta recordset

This is one of a distinct set of similar approaches towards preparing and staging information for further loading into the core Data Vault model. This process (and its alternatives) serves the purpose of preparing a delta recordset while defining the Record Source and Event Date/Time.

The above Data Flow setup can be adjusted for better performance by moving the sorting and data type conversion into the SQL override. If the conversion (DCO) is handled outside the SQL the sorter is required for SSIS to process the record in the correct order for the Merge Join. In this example the conversion is explicit, for demonstration purposes, but this does mean the sorting happens in memory and very likely swap space (tempbuffer).

 
Roelant Vos

Roelant Vos

You may also like...

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.