Data Vault ETL Implementation using SSIS: Step 2 – Historical Staging ETL
The archiving process such as the Historical Staging ETL are also not conceptually part of Data Vault but can play an important factor in the complete Enterprise Data Warehouse (EDW). While the Staging Area directly supports the Data Vault message the Historical Staging is really optional and does not impact Data Vault processes. For the sake of completion the ETL overview for the Historical Staging is provided regardless. As with any ETL process within the EDW architecture it is subject to the fundamental ETL requirements.
The Control Flow does not contain any specific ETL logic except, in my case, the handles into the ETL process model. The Data Flow is as follows:
- SRC – <STG table>. The direct connection to the Staging Area; while SSIS does provide a ranking operation I have embedded a ranking mechanism in the SQL override to be able to detect multiple changes for the same natural key. This is important to be able to process multiple changes for the same key in a single run. The query selects all attributes including the checksum that was generated in the Staging Area (or alternatively calculates the checksum here while incorporating the CDC event). The ranking mechanism is implemented as follows:
ROW_NUMBER() OVER (PARTITION BY <Natural Key> ORDER BY <Natural Key>, <Event Date/Time>) AS Row_Number
- CNT – Discarded / Inserted / Input. Optional record counts. The counts are stored in local variables and committed to the ETL process metadata after completion of the package execution.
- LKP – HSTG table for existing record check. This Lookup Operation against the target HSTG table serves as a safety catch to prevent information to be inserted multiple times. It is a lookup against the Natural Key and Event Date/Time. If there is a hit in the lookup the record is discarded (gracefully rejected); otherwise the process will continue (for that record). For performance reasons – it is an extra lookup / cache after all – the recordset in the lookup is restricted to information only available in the Staging Area (i.e. join between Historical Staging and Staging Area on the Natural Key).
- CSP – Only lookup the first record for any business key. The Conditional Split verifies the value of the Row Number. If it has value ‘1’ it means that it’s the first change for that Natural Key and will need to be compared against the most recent record in the Historical Staging. All other records for the same key (other than ‘1’) will be routed directly to the Union without any comparison against existing values, as the Staging Area ETL already covered that these are in fact changes.
- LKP – HSTG target table for attribute comparison. This Lookup operation returns the most recent version of the Natural Key (i.e. maximum Event Date/Time) as well as the checksum for the actual comparison. The join condition can be placed on just the Natural Key as the SQL override already restricts the set to the most recent record for each key.
- CSP – Compare Values. The Conditional Split evaluates the Staging and Historical Staging checksums to define if a change in the records exists. If the checksums are not the same this reflects a change and this leads to an insert into the Historical Staging target table.
- Union All. The purpose of the Union operator (Union All function) is to merge the various data flows for insertion.
- DRC – Standard Values. Any proprietary ETL metadata controls and/or values can be set in this Derived Column Transformation. No Data Vault specific values are set.
- DST – HSTG_<table>. The destination target (Historical Staging table), which is a straight insert of the remaining recordset. Note that this is an insert-only approach (i.e. no end-dating).
This process can be considered as SCD2 but for efficiency purposes and end-dating mechanism is not implemented; it is an insert-only ETL process.