Data Vault ETL Implementation using SSIS: Step 5 – Satellite ETL – part 2 – SQL selection (wide table exception)

There is more to say on exactly how Satellite selections are made as originally documented in the Satellite ETL post. In many cases not all records from a Staging Area table are selected, but rather a subset. This creates an interesting issue to which I usually refer to as the ‘wide table exception’. It leads to a minor change in the SQL selection for Satellite entities. The outline and solution are described in this post.

Image the scenario where a subset of attributes from a Staging Area table is loaded to a Satellite type table (and some others too).

Wide Satellite scenario

In the Staging Area definitions we already established that we need to be prepared to process multiple changes in one run, which means potentially multiple changed records for the same natural key. In this case we are only interested in the Satellite loading process, even though the same Staging Area dataset is loaded to multiple other tables. This creates an interesting loading issue as the changes in the available CDC based delta itself may be unrelated to the attributes that are required by the specific Satellite. This is illustrated in the next diagram:

Wide Satellite scenario - duplication

This means that the Satellite ETL, based on the available data delta, will receive a duplicate (record 1 and 2). Also interesting is that this record is ‘not changed’ from the perspective of the Satellite. As a side note this underlines the requirement for each ETL process – the Satellite in this case – to always perform their own change detection and not to blindly rely on the received change delta.

Other than that, it is a situation that can be easily handled in a way that makes it generic. What is needed is a DISTINCT function to remove these duplicates, and it needs to be present in every Satellite SQL selection. This pushes the original functionality related to the row numbering (to load multiple changes in the first place) ‘around’ this distinct selection because the ordering undoes the distinct selection. An example query for Satellite SQL selection that handles both this scenario, but all others as well is listed below:

SELECT
 RECORD_SOURCE,
 EVENT_DATETIME,
 ETL_PROCESS_ID,
 <Business Key>,
 <Attribute(s)>,
 <ROW_NUMBER() OVER (PARTITION <Business Key> BY ORDER BY <Business Key>, EVENT_DATETIME)> AS ROW_NUMBER,
 <CHECKSUM LOGIC> -- MD5, SHA1 or similar
 (
 CDC_OPERATION,
 <Business Key>,
 <Attribute(s)>
 ) AS CHECKSUM
FROM
 (
  SELECT DISTINCT
   RECORD_SOURCE,
   EVENT_DATETIME,
   ETL_PROCESS_ID,
   <Business Key>,
   <Attribute(s)>
  FROM <Staging Area>
) distinct_selection

When this SQL logic is used only unique rows (for the selected attributes) will be presented the Satellite ETL. As we established in this post the ETL still needs to check if there really is a change, because even though the information is now distinct it still does not mean it’s changed!

 
Roelant Vos

Roelant Vos

You may also like...

4 Responses

  1. says:

    In relation to the sample queries to get data from a source for a let’s say Satellite ETL, like the one below:

    SELECT
    RECORD_SOURCE,
    EVENT_DATETIME,
    ETL_PROCESS_ID,
    Business Key,
    Attribute(s),
    ROW_NUMBER() OVER (PARTITION Business Key BY ORDER BY Business Key, EVENT_DATETIME) AS ROW_NUMBER,
    CHECKSUM LOGIC — MD5, SHA1 or similar
    (
    CDC_OPERATION,
    Business Key,
    Attribute(s)
    ) AS CHECKSUM
    FROM
    (
    SELECT DISTINCT
    RECORD_SOURCE,
    EVENT_DATETIME,
    ETL_PROCESS_ID,
    Business Key,
    Attribute(s)
    FROM
    ) distinct_selection

    I have 2 questions:

    1. Isn’t the DISTINCT clause in yellow, almost superfluous in any case, because of the inclusion of the Event_DateTime per Business Key, which will make every record distinct already? I think the logic requires something like making distinct on the records (excluding the Event_DateTime) plus something that allows for scenarios where the data changes for a record and then changes back to a previous identical set of values in the same batch, therefore the 2 changes are both relevant for capture. In other words, a change is distinct from another one if the subset of attributes of interest (excluding timestamps) is different from the immediate previous set for the same NK.

    2. In the same section on Satellite, I want to refer to this paragraph “When this SQL logic is used only unique rows (for the selected attributes) will be presented the Satellite ETL. As we established in this post the ETL still needs to check if there really is a change, because even though the information is now distinct it still does not mean it’s changed!”

    Is what you are implying that there will be the need to process one record at a time for each NK, to ensure that any new record is in fact different to the existing target record? Are you able to achieve this in the INSERT/UPSERT SQL statement? what is your suggestion?

     

    • Roelant Vos Roelant Vos says:

      Hi,

      Sorry for the late reply, but here are my 2 cents

      1) What I’m attempting to do is basically condensing changes for a set of attributes that is smaller than the set in the source (staging area table). Because the Staging Area table contains the (unique) changes for the source natural key the grain is accurate. But since a Satellite may contain a subset of attributes from that same table, you may get ‘phantom’ changes without the DISTINCT subquery. This is caused if the change as recorded in the Staging Area is related to an attribute you’re not interested in from the perspective of the Satellite. It’s worthwhile taking this into account because you want to maintain the flexibility to split Satellites into multiple (smaller) ones, for instance based on the rate of change. In order words, if the Satellite contains the same attributes as the Staging Area the DISTINCT is superfluous, but if it has less attributes it is needed to avoid storing too many rows / condense the set. This mechanism relies on the CDC function being handled by the Staging Area; at that point you have a delta to process further.

      2) With that last paragraph I meant to point out that even if you perform a distinct operation to ‘condense’ the changes to proper change (per NK / datetime) because of the limited number of attributes this still may mean it’s not a change for that particular Satellite. So you still (always) want to use checksums to prevent processing redundant rows. For instance if you don’t record the ‘address’ attribute in your Satellite, but it is available in the Staging Area (and therefore under CDC) there will be a record present in the Staging Area table when the address changes. If you assume the address changes three times, but nothing else you will have three records in your delta. These records will contain the other values as well, even though they are not changed. The query will look at the attributes that are relevant to the Satellite and condense this to a single record. But that single record is still not a ‘change’ for the Satellite, so the checksum will still be required!

      Many thanks for your reply!

       
  2. says:

    Hi Roelant, thank you for your answers,
    In relation to point 1).

    I do understand the purpose of identifying delta changes across a subset of attributes as a need to populate any Satellite, thankn you for that explanation anyway. My main observation and concern was due to the inclusion of the field EVENT_DATETIME, in the Select Distinct clause in your source queries.
    I think having that DATETIME attribute involved will destroy the possibility of finding distinct changes, because from your Staging source, the EVENT_DATETIME will always force a distinct record.

     
    • Roelant Vos Roelant Vos says:

      Hi Oscar, sorry I get your point. You’re right; the Event Date/Time will always be a unique value so this part of the logic needs to be adjusted to handle this better! My first reaction is to add a MIN(Event Date/Time) clause but I haven’t tested it yet. Great observation!

       

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.