Data Vault ETL Implementation using SSIS: Step 7 – Link Satellite ETL – part 3 – End Dating
I’m catching up on old drafts within WordPress, and in the spirit of being complete on the older SSIS series felt I should pick this one up and complete it. While most of my focus is on developing the virtualisation concepts I still work a lot with more traditional ETL tools, one of which is Microsoft SSIS.
Recently I merged the metadata models that underpin the virtualisation and SSIS automation and I am retesting everything end-to-end. So, after a long break time and incentive to complete this series.
Please bear in mind that the logic can (probably needs) to be updated to LEAD and LAG instead of the traditional Common Table Expression in SQL Server. Also this logic relies on zero records. Thankfully things are not hard to change if you don’t want zero records or using some of the analytic SQL functions 🙂
For the same reasons as outlined for the SSIS Satelllite End Dating mechanism, Link Satellites End Dating is always implemented as a stand-alone generic ETL process. The only difference is some minor tweaking in the logic to handle the difference between historical and ‘driving key’ based Link Satellites in the same template.
Again, thanks to SSIS’ ability to pass parameters into the Execute SQL Task we are able to provide the Process ID information into this SQL before execution, so it will be a relatively efficient task to run (for a redundant step, that is).
As with Satellite end-dating logic this is a very straightforward process with only one operation in the SSIS Control Flow. Another way of saying is that SSIS is really only used as a scheduling tool, and to support the ETL control framework (this part is not displayed here) as there is a variety of event handlers one can add to an SSIS object. For example, I use pre-and post processing event handlers to do ETL framework activities.
It doesn’t look very exciting in SSIS, but the SQL query used in the displayed Execute SQL Task (EST – displayed below) identifies any records that need to be end-dated and also performs the corresponding updates; the selected Expiry Date (the Effective Date from the next record). This means that this process can be run at any time and closes off all records which need to be closed.
An example SQL is provided a bit further in this post. It’s pretty neat in the sense that this approach supports both Driving Key and normal history templates by selecting the ‘Driving Key’ in the main Common Table Expression as either the Link Key (=normal history) or the real Driving Key (=end dating relationships across the Driving Key). The Driving Key concept is better explained in the virtualisation section, but this SQL works for both templates as it is now.
To understand the example it’s really important to consider the differences between the regular historical and Driving Key behaviour, so I will mention it again: by using the ‘normal’ relationship key (Link SK) as the ‘Driving Key’ you trigger normal history tracking (SCD2) behaviour, and by using the assigned Driving Key (one part of the relationship) you trigger Driving Key history tracking behaviour.
Please note the ‘?’ symbols are mapped to parameters via the ‘Parameter Mapping’ in the Execute SQL Task properties. It’s mapped to the unique identifier of the ETL process (= package name in this case) that executes the SQL. The high dates are inserted by the regular Link-Satellite ETL and already in place from the perspective of End Dating. The SQL looks a bit complex, but as with all things SSIS it’s all generated from metadata (no need to develop packages manually in Visual Studio these days).
WITH MyCTE ( <Link SK>, DRIVING_KEY, -- Note this is an alias that will be used for either the Link Key (for historical LSAT behaviour) or one of the Hub keys in the Link (= the Driving Key) LOAD_DATETIME, -- The Effective Date LOAD_END_DATETIME, -- The End Date RowVersion ) AS ( SELECT A.<Link SK>, B.<Link SK or Driving Key>, --using a one-side of a relationship (=Driving Key) forced this behaviour, otherwise use the Link Key for normal history behaviour A.LOAD_DATETIME, A.LOAD_END_DATETIME, DENSE_RANK()OVER(PARTITION BY B.<Link SK or Driving Key> ORDER BY B.LOAD_DATETIME, B.<Link SK or Driving Key> ASC) RowVersion FROM <Link Satellite table> A JOIN <Link table> B ON A.<Link SK>=B.<Link SK> JOIN ( SELECT B.<Link SK or Driving Key> FROM <Link Satellite table> A JOIN <Link table> B ON A.<Link SK>=B.<Link SK> WHERE A.LOAD_END_DATETIME = '99991231' GROUP BY B.<Link SK or Driving Key> HAVING COUNT(*) > 1 ) C ON B.<Link SK or Driving Key> = C.<Link SK or Driving Key> ) UPDATE A SET A.LOAD_END_DATETIME = B.LOAD_END_DATETIME, A.ETL_INSERT_RUN_ID = ?, A.CURRENT_RECORD_INDICATOR = B. CURRENT_RECORD_INDICATOR FROM <Link Satellite table> A JOIN( SELECT BASE.<Link SK>, CASE WHEN LAG.LOAD_DATETIME IS NULL THEN '19000101' ELSE BASE.LOAD_DATETIME END AS LOAD_DATETIME, CASE WHEN LEAD.LOAD_DATETIME IS NULL THEN '99991231' ELSE LEAD.LOAD_DATETIME END AS LOAD_END_DATETIME, CASE WHEN LEAD.LOAD_DATETIME IS NULL THEN 'Y' ELSE 'N' END AS CURRENT_RECORD_INDICATOR FROM MyCTE BASE LEFT JOIN MyCTE LEAD ON BASE.DRIVING_KEY = LEAD.DRIVING_KEY AND BASE.RowVersion = LEAD.RowVersion-1 LEFT JOIN MyCTE LAG ON BASE.DRIVING_KEY = LAG.DRIVING_KEY AND BASE.RowVersion = LAG.RowVersion+1 WHERE BASE.LOAD_END_DATETIME = '99991231' )B ON A.<Link SK>=B.<Link SK> AND A.LOAD_DATETIME = B.LOAD_DATETIME
Let’s run an example taken from the test set used for testing the virtualisation efforts (the results should be 100% the same). The testcase taken from the sample models is LSAT_CUSTOMER_OFFER. This is the ‘Driving Key’ based history example, and the idea is that a Customer can only have one active offer at a point in time. In other words: the Customer is the Driving Key. If we were to load the normal LSAT to populate the records prior to running this end-dating logic the recordset would look like this:
I joined to the Link to make it a bit clearer what is happening by adding the Customer and Incentive Offer Hash Keys. While it still looks cryptic, you can see now that a Customer has different Incentive Offers at various points in time. Including a ‘re-opening’ where an earlier relationship is enabled again after being superseded earlier (one of the standard testcases). The four rows basically state that there is a zero record for the first relationship between Customer and Incentive Offer, followed by the Load Date/Time for this relationship. Then, the Customer is associated with a different Incentive Offer and lastly the Customer is again associated with the original Incentive Offer.
Obviously, the end-dating process has not run yet. Running this query corrects the timelines:
There you have it – Driving Key based history tracking using the same SQL used for normal history.
For reference, the query that was run was:
WITH MyCTE ( CUSTOMER_OFFER_HSH, DRIVING_KEY, LOAD_DATETIME, LOAD_END_DATETIME, RowVersion ) AS ( SELECT A.CUSTOMER_OFFER_HSH, B.CUSTOMER_HSH, -- The Driving Key A.LOAD_DATETIME, A.LOAD_END_DATETIME, DENSE_RANK()OVER(PARTITION BY B.CUSTOMER_HSH -- Again the Driving Key ORDER BY A.LOAD_DATETIME, B.CUSTOMER_OFFER_HSH ASC) RowVersion FROM LSAT_CUSTOMER_OFFER A JOIN LNK_CUSTOMER_OFFER B ON A.CUSTOMER_OFFER_HSH=B.CUSTOMER_OFFER_HSH JOIN ( SELECT B.CUSTOMER_HSH -- Again the Driving Key FROM LSAT_CUSTOMER_OFFER A JOIN LNK_CUSTOMER_OFFER B ON A.CUSTOMER_OFFER_HSH=B.CUSTOMER_OFFER_HSH WHERE A.LOAD_END_DATETIME = '99991231' GROUP BY B.CUSTOMER_HSH HAVING COUNT(*) > 1 ) C ON B.CUSTOMER_HSH = C.CUSTOMER_HSH -- Again the Driving Key ) UPDATE A SET A.LOAD_END_DATETIME = B.LOAD_END_DATETIME, A.ETL_INSERT_RUN_ID = 0, A.CURRENT_RECORD_INDICATOR = B. CURRENT_RECORD_INDICATOR FROM LSAT_CUSTOMER_OFFER A JOIN( SELECT BASE.CUSTOMER_OFFER_HSH, CASE WHEN LAG.LOAD_DATETIME IS NULL THEN '19000101' ELSE BASE.LOAD_DATETIME END AS LOAD_DATETIME, CASE WHEN LEAD.LOAD_DATETIME IS NULL THEN '99991231' ELSE LEAD.LOAD_DATETIME END AS LOAD_END_DATETIME, CASE WHEN LEAD.LOAD_DATETIME IS NULL THEN 'Y' ELSE 'N' END AS CURRENT_RECORD_INDICATOR FROM MyCTE BASE LEFT JOIN MyCTE LEAD ON BASE.DRIVING_KEY = LEAD.DRIVING_KEY AND BASE.RowVersion = LEAD.RowVersion-1 LEFT JOIN MyCTE LAG ON BASE.DRIVING_KEY = LAG.DRIVING_KEY AND BASE.RowVersion = LAG.RowVersion+1 WHERE BASE.LOAD_END_DATETIME = '99991231' )B ON A.CUSTOMER_OFFER_HSH = B.CUSTOMER_OFFER_HSH AND A.LOAD_DATETIME = B.LOAD_DATETIME