Advanced row condensing for Satellites
When it comes to record condensing, DISTINCT just doesn’t cut it. I’ve been meaning to post about this for ages as the earliest templates (as also posted on this site) were not flexible enough to work in all cases (which is what we strive for).
Record condensing is making sure that the data delta (differential) you process is a true delta for the specific scope. It’s about making sure no redundant records are processed into historised tables. If you have a source table that is logically split towards various target Data Warehouse entities (Hubs, Links and Satellites for instance) the change that originated the new (change / delta) transaction in the original table is not necessarily a change for a specific target. This is simply because CDC mechanisms typically consider the entire row in a source. Consider the following example:
The example shows that across the key (Customer Code) and date you can’t use a SELECT DISTINCT to minimize the recordset. It may look like you can, but what would happen when the name in this examples changes back to it’s original value? These changes would be lost.
- C17532 for 2016-07-01 was ‘Adam’
- C17532 for 2016-07-03 was changed to ‘Scott’
- C17532 for 2016-07-05 was changed back to ‘Adam’
This ‘A-B-A’ mechanism is something which requires additional logic to evaluate changes between rows in the order of the date/time. To add to the complexity: when evaluating rows you also need to look after the logical deletes as this is not a change in one of the attributes themselves, but in process attributes (e.g. the CDC attribute, logical delete indicator or equivalent). This is further explained in the SQL example a bit further down in this post.
Why is implementing record condensing worth it?
Row condensing behaviour in ETL processes is applicable in many cases, for example when splitting Satellites into ‘slow changing’ and ‘fast changing’, mapping attributes to different tables etc.
This concept is directly related to one of the first and fundamental Design Principles: ETL processes must be able to process multiple intervals (changes) in one run, must be able to be run at any point in time, and process whatever data there is to process. The last part of this requirement directly relates to the record condensing concept outlined in this post.
For instance: if the address of an employee changes multiple times during the day and ETL is run daily, all changes should still be captured and correctly processed in a single run of the ETL process.
This requirement prevents ETL processes to be run many times for catch-up processing, makes it possible to easily change loading frequencies and generally supports a full reload of the Integration Layer from the Persistent Staging Area. Last, but not least, it is a fundamental bit of logic that is required to support Data Warehouse virtualisation.
We often run a ‘re-initialisation’ run to make sure no delta’s have been missed. Re-initialisation is simply copying the entire contents of the Persistent Staging Area and reloading this history against the Data Warehouse / Integration Layer. This approach is something that relies on record condensing to be in place, and Data Warehouse virtualisation is simply using this same re-initialisation approach to ‘virtualise’ the Integration Layer without any data being copied.
Record condensing in Data Vault Satellites
To implement record condensing in Satellite logic you can use a full row hash containing the original source attributes (‘COMBINED_VALUE’) in a subquery and use the various components (e.g. Load Date/Time Stamp, Event Date/Time and CDC indicators) to figure out what happens within the delta set for a specific target:
COMBINED_VALUE, -- Full row hash across in-scope source attributes only -- Condition 1 -- make sure there are differences between the individual rows CASE WHEN LAG(COMBINED_VALUE,1,'N/A') OVER ( PARTITION BY <Business Key> ORDER BY <Load Date/Time stamp> ASC, <Event Date/Time> ASC, <CDC Operation> DESC ) = COMBINED_VALUE THEN 'Same' ELSE 'Different' END AS VALUE_CHANGE_INDICATOR, -- Condition 2 -- make sure the Change Data Operation is different between rows CASE WHEN LAG(<CDC Operation>,1,'') OVER ( PARTITION BY <Business Key> ORDER BY <Load Date/Time stamp> ASC, <Event Date/Time> ASC, <CDC Operation> ASC ) = <CDC Operation> THEN 'Same' ELSE 'Different' END AS CDC_CHANGE_INDICATOR, -- Condition 3 -- Make sure the Load Date/Time stamp is ordered (remove duplicates0 CASE WHEN LEAD([OMD_INSERT_DATETIME],1,'9999-12-31') OVER ( PARTITION BY <Business Key> ORDER BY
<Load Date/Time stamp> ASC, <Event Date/Time> ASC, <CDC Operation> ASC ) = <Load Date / Time stamp> T
HEN 'Same' ELSE 'Different' END AS TIME_CHANGE_INDICATOR
In the overall query the above subsequently gets interpreted as follows to remove the redundant records:
WHERE (VALUE_CHANGE_INDICATOR ='Different' and <CDC Operation> in (<new flag>, <changed flag>)) OR (CDC_CHANGE_INDICATOR = <changed flag> and TIME_CHANGE_INDICATOR = 'Different')
Many thanks for my colleague Ben Chen for helping to update & test this for the automation logic.Full code can be generated using the Virtualisation application. It is a bit long to post here, but you get the idea!