Data Vault ETL Implementation using SSIS: Step 7 – Link Satellite ETL – part 1 – Regular History
From an ETL development perspective the Data Vault Link Satellite (LSAT) ETL template is the most complex, and also conceptually the most challenging. This is mainly because the behavior of ETL depends on the type of history that is tracked. To define proper LSAT ETL templates for automation I classify this as follows:
- Regular / Historical Link Satellites that track the changes for attributes of a relationship over time. This is very similar to Satellite ETL processing, with the difference that the attributes provide context for a (business) relationship (Link) and not a business entity (Hub). This type of ETL template sources the data delta from the Staging Area the same way as most of the Data Vault ETL processes that track changes (e.g. Satellites)
- Driving Key Link Satellites that track the changes between, or across, relationships. This means that (other) relationships are closed (end dated – conceptually) if a new relationship is created. Another way to look at this is seeing the Driving Key as ‘enforcing’ the ‘one’ side of a constructed one-to-many relationship data-wise inside the (model-wise many-to-many) Link. But, this is an implementation post and not a conceptual post so I’ll leave it at that for now
For data modelling this means that if you want to track both of these types of history for a relationship, you need two separate tables. This post covers the first defined template; how to load a data delta from the Staging Area into a Link Satellite table, just capturing changes over time for relationship (context) attributes: regular history.
Bear in mind that this approach embraces the design decision to record context information in Link Satellites and not in the Link entities themselves. Arguably, you can defined various types of Links such as Transactional Links to record information but for the purpose of this implementation overview I tend to follow Dan Linstedt’s direction to define Links as unique list of business key pairs / integration points and store everything else in Link Satellites.
In Data Vault (1.0) the Link Satellite behaves largely like the Satellite (explained here) except it requires the various Hub Key Lookups and consequently the Link Key Lookup to retrieve all required information (keys); specifically the Link Key of the parent Link entity.
Similar to the Satellite approach the end-dating (expiry date/time) functionality is implemented as a separate (ETL) process. This is for various reasons including reusability, parallelism and the option to remove the logic altogether. For this reason the Link Satellite process is ‘Insert Only’, but will contain an EXPIRY_DATETIME attribute which will later be used by the end-dating ETL. As end-dating logic is redundant information added for performance / selection reasons this is designed to be potentially removed without re-engineering. Alternatively this can be virtualised in views or the BI semantic layer.
The Control Flow specifies two separate Data Flows, both of which are conceptually related to the Link Satellite process but each create their own records:
- Regular Satellite processing (load from the Staging Area)
- Initial records creation (handle time-lines), also known as ‘zero-records’
The Data Flow is as follows:
The first Data Flow Task (DTF – LSAT<Table Name>) is documented here:
- SRC – STG_<Table Name>. The direct connection to the Staging Area has the following characteristics in the selection query:
- Checksum generation; the query generates a checksum across the business key attributes in the Staging Area table. This is always more than one (1) business key as this covers a relationship (Link). I typically use SHA1 or MD5 for this
- Distinct selection; because in most cases the attributes selected are a subset of all the attributes in the Staging Area, a distinct selection is implemented by default. The distinct needs to be added in an inline view / subquery to ensure the distinct because the analytical SQL functions otherwise interfere with the distinct command
- Attributes;the attributes that are required to be stored as context need to be added to the query
- Ranking; while SSIS does provide a ranking operation I have embedded a ranking mechanism in the SQL override to be able to detect multiple changes for the same business key. This is important in order to be able to process multiple changes for the same key in a single run. The ranking mechanism is implemented as follows:
ROW_NUMBER() OVER (PARTITION BY <Business Keys (multiple)> ORDER BY <Business Keys (multiple)>, <Event Date/Time>) AS ROW_NUMBER
- CNT – Discarded / Inserted / Input. Optional record counts. The counts are stored in local variables and committed to the ETL process metadata after completion of the package execution
- LKP – HUB_<Table Name 1, 2 etc.>. These (multiple) Lookup operations against the corresponding Hub table are required to obtain the Data Warehouse (Hub) Keys. It is a lookup against the Business Key only
- LKP – LNK_<Table Name>. This Lookup against the parent Link entity is executed using the combination of Hub Keys that were obtained in the previous Hub Lookups. This Lookup is needed to make sure we can associate the information to the correct relationship (the Link), and to make sure the relationship already exists
- LKP – LSAT table for existing records. This Lookup operation against the target Link Satellite table serves as a safety catch to prevent information to be inserted multiple times. It is a lookup against the Link Key and Event Date/Time. If there is a hit in the lookup the record is discarded (gracefully rejected); otherwise the process will continue (for that record). For performance reasons – it is an extra lookup / cache after all – the record set in the lookup is restricted to only the Link Key and the Event Date/Time
- CSP – Only lookup the first record for any business key. The Conditional Split verifies the value of the Row Number. If it has value ‘1’ it means that it’s the first change for that relationship (combination of Business Keys) and will need to be compared against the most recent record in the Link Satellite table. All other records for the same key (other than ‘1’) will be routed directly to the Union without any comparison against existing values, as the Staging Area ETL already covered that these are in fact changes
- LKP – LSAT table for comparison. This Lookup operation returns the most recent version of information based on the Link Key (i.e. maximum Event Date/Time) as well as the checksum for the actual comparison. The join condition can be placed on just the Link Key as the SQL override already restricts the set to the most recent record for each key from the Link Satellite
- CSP – Compare Values. The Conditional Split evaluates the Staging and Link Satellite checksums to define if a change in the records exists. If the checksums are not the same this reflects a change and this leads to an insert into the Satellite target table
- Union All. The purpose of the Union operator (Union All function) is to merge the various data flows for insertion
- DRC – Standard Values. Any proprietary ETL metadata controls and/or values can be set in this Derived Column Transformation. No Data Vault specific values are set
- DST – LSAT_<Table_Name>. The destination target Link Satellite table, which is a straight insert of the remaining recordset. Note that this is an insert-only approach (i.e. no end-dating)
The second step in the Data Flow covers the creation of Dummy Records if required:
This process is very simple in design; its purpose is to only insert a zero (dummy) record (starting with an effective date of 1900-01-01 or similar) to initiate the timeline for any record that was inserted in the Link Satellite. The dummy values cover the default values such as ‘Unknown’, ‘Not Applicable’, -1 or similar. The effective date can be set to 1900-01-01 (or similar) and the expiry date (or similar – optional) can be set to 9999-12-31.
There is no requirement to lookup the correct expiry date as the End Dating logic will handle this in its own separate / dedicated ETL process. So, similar to the regular processing this step can be labelled ‘insert-only’. From a design perspective it is recommended to let each Link Satellite process handle its own dummy records, as opposed to driving this from the Link process. The reason is that in this approach the Link ETL does not need to be updated when a new Link Satellite entity is added to the model.