Data Vault ETL Implementation using SSIS: Step 5 – Satellite ETL – part 2 – SQL selection (wide table exception)
There is more to say on exactly how Satellite selections are made as originally documented in the Satellite ETL post. In many cases not all records from a Staging Area table are selected, but rather a subset. This creates an interesting issue to which I usually refer to as the ‘wide table exception’. It leads to a minor change in the SQL selection for Satellite entities. The outline and solution are described in this post.
Image the scenario where a subset of attributes from a Staging Area table is loaded to a Satellite type table (and some others too).
In the Staging Area definitions we already established that we need to be prepared to process multiple changes in one run, which means potentially multiple changed records for the same natural key. In this case we are only interested in the Satellite loading process, even though the same Staging Area dataset is loaded to multiple other tables. This creates an interesting loading issue as the changes in the available CDC based delta itself may be unrelated to the attributes that are required by the specific Satellite. This is illustrated in the next diagram:
This means that the Satellite ETL, based on the available data delta, will receive a duplicate (record 1 and 2). Also interesting is that this record is ‘not changed’ from the perspective of the Satellite. As a side note this underlines the requirement for each ETL process – the Satellite in this case – to always perform their own change detection and not to blindly rely on the received change delta.
Other than that, it is a situation that can be easily handled in a way that makes it generic. What is needed is a DISTINCT function to remove these duplicates, and it needs to be present in every Satellite SQL selection. This pushes the original functionality related to the row numbering (to load multiple changes in the first place) ‘around’ this distinct selection because the ordering undoes the distinct selection. An example query for Satellite SQL selection that handles both this scenario, but all others as well is listed below:
SELECT RECORD_SOURCE, EVENT_DATETIME, ETL_PROCESS_ID, <Business Key>, <Attribute(s)>, <ROW_NUMBER() OVER (PARTITION <Business Key> BY ORDER BY <Business Key>, EVENT_DATETIME)> AS ROW_NUMBER, <CHECKSUM LOGIC> -- MD5, SHA1 or similar ( CDC_OPERATION, <Business Key>, <Attribute(s)> ) AS CHECKSUM FROM ( SELECT DISTINCT RECORD_SOURCE, EVENT_DATETIME, ETL_PROCESS_ID, <Business Key>, <Attribute(s)> FROM <Staging Area> ) distinct_selection
When this SQL logic is used only unique rows (for the selected attributes) will be presented the Satellite ETL. As we established in this post the ETL still needs to check if there really is a change, because even though the information is now distinct it still does not mean it’s changed!