Data Vault ETL Implementation using SSIS: Step 2 – Historical Staging ETL

The archiving process such as the Historical Staging ETL are also not conceptually part of Data Vault but can play an important factor in the complete Enterprise Data Warehouse (EDW). While the Staging Area directly supports the Data Vault message the Historical Staging is really optional and does not impact Data Vault processes. For the sake of completion the ETL overview for the Historical Staging is provided regardless. As with any ETL process within the EDW architecture it is subject to the fundamental ETL requirements.

The Control Flow does not contain any specific ETL logic except, in my case, the handles into the ETL process model. The Data Flow is as follows:

DV_implementation_step_2_historical_staging

  • SRC – <STG table>. The direct connection to the Staging Area; while SSIS does provide a ranking operation I have embedded a ranking mechanism in the SQL override to be able to detect multiple changes for the same natural key. This is important to be able to process multiple changes for the same key in a single run. The query selects all attributes including the checksum that was generated in the Staging Area (or alternatively calculates the checksum here while incorporating the CDC event). The ranking mechanism is implemented as follows:
    ROW_NUMBER() OVER (PARTITION BY <Natural Key> ORDER BY <Natural Key>, <Event Date/Time>) AS Row_Number
  • CNT – Discarded / Inserted / Input. Optional record counts. The counts are stored in local variables and committed to the ETL process metadata after completion of the package execution.
  • LKP – HSTG table for existing record check. This Lookup Operation against the target HSTG table serves as a safety catch to prevent information to be inserted multiple times. It is a lookup against the Natural Key and Event Date/Time. If there is a hit in the lookup the record is discarded (gracefully rejected); otherwise the process will continue (for that record). For performance reasons – it is an extra lookup / cache after all – the recordset in the lookup is restricted to information only available in the Staging Area (i.e. join between Historical Staging and Staging Area on the Natural Key).
  • CSP – Only lookup the first record for any business key. The Conditional Split verifies the value of the Row Number. If it has value ‘1’ it means that it’s the first change for that Natural Key and will need to be compared against the most recent record in the Historical Staging. All other records for the same key (other than ‘1’) will be routed directly to the Union without any comparison against existing values, as the Staging Area ETL already covered that these are in fact changes.
  • LKP – HSTG target table for attribute comparison. This Lookup operation returns the most recent version of the Natural Key (i.e. maximum Event Date/Time) as well as the checksum for the actual comparison. The join condition can be placed on just the Natural Key as the SQL override already restricts the set to the most recent record for each key.
  • CSP – Compare Values. The Conditional Split evaluates the Staging and Historical Staging checksums to define if a change in the records exists. If the checksums are not the same this reflects a change and this leads to an insert into the Historical Staging target table.
  • Union All. The purpose of the Union operator (Union All function) is to merge the various data flows for insertion.
  • DRC – Standard Values. Any proprietary ETL metadata controls and/or values can be set in this Derived Column Transformation. No Data Vault specific values are set.
  • DST – HSTG_<table>. The destination target (Historical Staging table), which is a straight insert of the remaining recordset. Note that this is an insert-only approach (i.e. no end-dating).

This process can be considered as SCD2 but for efficiency purposes and end-dating mechanism is not implemented; it is an insert-only ETL process.

 

Ravos

Roelant Vos

You may also like...

2 Responses

  1. orodriguez says:

    Hi Roelant, a colleague and I are studing different ETL patterns and found your articles very useful.
    I’d like to raise some questions to you in order to challenge our views an alternative solutions we are discovering:

    1 – In your patttern, let’s suppose that you read data from a CDC source, in this case, I assume you don’t support records that come out of sequence because you always compare the incoming record with the latest record in target. Is out of sequence records anything that you don’t consider a particular concern when implementing ETL? (By “out of sequence” I mean to process a source batch with records whethere the timestamp is before a particular batch of records already processed an loaded. Obviously this might be due to issues in CDC delivery from source systems).

    2. Can you please explan in more detail the part of logic for the component “CSP – Only lookup the first record for any business key”? Why is it importat to check only that the FIRST record of the same set of records with same NK is an actual delta record to be absorbed but not SECOND, THIRD or others?
    I am thinking that, if an indentical records comes, and we want to filter it out, this is already covered by “LKP – HSTG table for existing record check” component.
    On the other side, if a new record comes from Staging and need to compare against target STG history, you still need to compare ALL records from STG source (not only the first one), when the source is delivering data as CDC records where multiple changes for the same NK are required; alternative, if your source is not a CDC source (but STG is derived based on FULL JOINS), you may not need to compare any record because there will always be one record per NK in staging available).

    dddd

     
    • Roelant Vos says:

      Hi Oscar,

      These things are essentially a result of SSIS being relatively less mature than some of its ETL competitors. One of the biggest flaws I see is the lack of unequal join options. You can do an unequal join, but only if you effectively disable caching. I like SSIS for various reasons, but not having the ability to compare information to a point in time easily has always been a bit of a problem. This is one of the reasons a lot of emphasis was placed on the interface being able to at least provide files in the right order. It also links back to the ‘effective date/load_dts’ approach because the order is ultimately determined by your time perspective/selection. Using the event date/time solves a lot of these issues.

      So to come back to your comments;
      1) This is largely dependent on the interface solution as you point out. The Event Date/Time properly implemented should support this, for instance the lsn (lsn_to_date function) or the moment of comparison. This is separate from the business effective dates which may be different in order. But the ETL relies on the event date/time for these templates which is always in the right order except in the case of messaging.

      2) This also relies on the interfacing; it’s basically a two part concept. A first step to just prevent (accidental) reruns by just checking if the exact effective date/time for the NK was already loaded sometime. The second lookup checks if the (new/unloaded) set with a different effective date/time is reallya change. This is to support multiple changes in a single run. In tools that have more functionality in this area; for instance Powercenter which allows shared caching and point in time comparisons in a very standard way I would compare all delta to it’s nearest counterpart instead of only the most recent one. The underlying idea is that the interface already detects multiple changes, which can be really new (e.g. not filtered out by the first lookup). Because of this, we can rely on the fact that these are in fact multiple changes but it’s good to compare the first one (lowest effective date/time) with the information that is already available in the DWH. In truth, I need to set some time apart to combine these two lookups into a single one. It should be possible.

      In any case, it’s a mix of preventing already loaded information, relying on the interface providing correct sequence using the event date/time and linking the new data delta with whatever already exists. That last step may be redundant, it probably is but I need to reserve some time to properly (re)test this!

      Many thanks for your comment and sorry for the late reply

       

Leave a Reply