Advanced row condensing for Satellites

When it comes to record condensing, DISTINCT just doesn’t cut it. I’ve been meaning to post about this for ages as the earliest templates (as also posted on this site) were not flexible enough to work in all cases (which is what we strive for).

Record condensing

Record condensing is making sure that the data delta (differential) you process is a true delta for the specific scope. It’s about making sure no redundant records are processed into historised tables. If you have a source table that is logically split towards various target Data Warehouse entities (Hubs, Links and Satellites for instance) the change that originated the new (change / delta) transaction in the original table is not necessarily a change for a specific target. This is simply because CDC mechanisms typically consider the entire row in a source. Consider the following example:

Record Condensing

The example shows that across the key (Customer Code) and date you can’t use a SELECT DISTINCT to minimize the recordset. It may look like you can, but what would happen when the name in this examples changes back to it’s original value? These changes would be lost.

  • C17532 for 2016-07-01 was ‘Adam’
  • C17532 for 2016-07-03 was changed to ‘Scott’
  • C17532 for 2016-07-05 was changed back to ‘Adam’

This ‘A-B-A’ mechanism is something which requires additional logic to evaluate changes between rows in the order of the date/time. To add to the complexity: when evaluating rows you also need to look after the logical deletes as this is not a change in one of the attributes themselves, but in process attributes (e.g. the CDC attribute, logical delete indicator or equivalent). This is further explained in the SQL example a bit further down in this post.

Why is implementing record condensing worth it?

Row condensing behaviour in ETL processes is applicable in many cases, for example when splitting Satellites into ‘slow changing’ and ‘fast changing’, mapping attributes to different tables etc.

This concept is directly related to one of the first and fundamental Design Principles: ETL processes must be able to process multiple intervals (changes) in one run, must be able to be run at any point in time, and process whatever data there is to process. The last part of this requirement directly relates to the record condensing concept outlined in this post.

For instance: if the address of an employee changes multiple times during the day and ETL is run daily, all changes should still be captured and correctly processed in a single run of the ETL process.

This requirement prevents ETL processes to be run many times for catch-up processing, makes it possible to easily change loading frequencies and generally supports a full reload of the Integration Layer from the Persistent Staging Area. Last, but not least, it is a fundamental bit of logic that is required to support Data Warehouse virtualisation.

We often run a ‘re-initialisation’ run to make sure no delta’s have been missed. Re-initialisation is simply copying the entire contents of the Persistent Staging Area and reloading this history against the Data Warehouse / Integration Layer. This approach is something that relies on record condensing to be in place, and Data Warehouse virtualisation is simply using this same re-initialisation approach to ‘virtualise’ the Integration Layer without any data being copied.

Record condensing in Data Vault Satellites

To implement record condensing in Satellite logic you can use a full row hash containing the original source attributes (‘COMBINED_VALUE’) in a subquery and use the various components (e.g. Load Date/Time Stamp, Event Date/Time  and CDC indicators) to figure out what happens within the delta set for a specific target:

COMBINED_VALUE, -- Full row hash across in-scope source attributes only
-- Condition 1 -- make sure there are differences between the individual rows
CASE
  WHEN LAG(COMBINED_VALUE,1,'N/A') OVER 
  (
  PARTITION BY 
    <Business Key>
  ORDER BY 
    <Load Date/Time stamp> ASC, 
    <Event Date/Time> ASC, 
    <CDC Operation> DESC
   ) = COMBINED_VALUE
   THEN 'Same'
   ELSE 'Different'
END AS VALUE_CHANGE_INDICATOR,
-- Condition 2 -- make sure the Change Data Operation is different between rows
CASE
  WHEN LAG(<CDC Operation>,1,'') OVER 
  (
    PARTITION BY 
    <Business Key>
    ORDER BY 
      <Load Date/Time stamp> ASC, 
      <Event Date/Time> ASC, 
      <CDC Operation> ASC
  ) = <CDC Operation>
  THEN 'Same'
  ELSE 'Different'
END AS CDC_CHANGE_INDICATOR,
-- Condition 3 -- Make sure the Load Date/Time stamp is ordered (remove duplicates0
CASE
  WHEN LEAD([OMD_INSERT_DATETIME],1,'9999-12-31') OVER 
  (
  PARTITION BY
    <Business Key>
  ORDER BY
      <Load Date/Time stamp> ASC, 
      <Event Date/Time> ASC, 
      <CDC Operation> ASC
  ) = <Load Date / Time stamp>
    THEN 'Same'            
ELSE 'Different'          
END AS TIME_CHANGE_INDICATOR 

In the overall query the above subsequently gets interpreted as follows to remove the redundant records:

WHERE (VALUE_CHANGE_INDICATOR ='Different' and <CDC Operation> in (<new flag>, <changed flag>)) OR (CDC_CHANGE_INDICATOR = <changed flag> and TIME_CHANGE_INDICATOR = 'Different')

Many thanks for my colleague Ben Chen for helping to update & test this for the automation logic.Full code can be generated using the Virtualisation application. It is a bit long to post here, but you get the idea!

 
Roelant Vos

Roelant Vos

You may also like...

2 Responses

  1. September 14, 2016

    […] you expose this may lead to duplicates in the end result, but these can easily be handled by adding row condensing similar to the one used in the Satellites. By applying this concept over the top of the above query […]

     
  2. June 5, 2017

    […] Row condensing for Satellites. This posts explains how row condensing works in its most elaborate form. […]

     

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.