Data Vault ETL Implementation using SSIS: Step 4 – Link ETL
In the Data Vault workflow the Link is the next object to typically be loaded after the Hub has been processed. This can be done in parallel with the standard Satellites but this will be covered in a future posting about the workflow/scheduling of Data Vault ETL.
The Link ETL comes down to select the distinct pairs of business keys from the respective Staging Area table, check if they’re there and insert them if they’re not there already. By its nature this ETL process is already compliant with the fundamental ETL requirements. Similar to the Hub ETL you can run it every time and any time without adding additional complexity. Also similar to the Hub ETL the string value of the Record Source will be resolved to an ID (Record Source ID).
In the SSIS implementation there is no additional logic in the Control Flow except the usual ETL process metadata handles. The Data Flow is as follows:
- SRC – <STG table>. The direct connection to the Staging Area which selects the distinct Business Key, Record Source and Event Date/Time. The aggregator further flattens the dataset to one record for each business key pair (per Record Source). This is necessary because the Event Date/Time typically occurs multiple times for each key pair, and requires to be loaded from the Staging Area. Effectively the minimum date is selected for the unique combination of business keys
- CNT – Discarded / Inserted / Input. Optional record counts. The counts are stored in local variables and committed to the ETL process metadata after completion of the package execution.
- DRC – Placeholder Management. In Link type tables not every business key value has to be present for the processing to continue. Placeholders are used in this case (-1) values only, indicating that keys were not present for the specific record at that particular point in time.
- DRC – Standard Values. Any proprietary ETL metadata controls and/or values can be set in this Derived Column Transformation. No Data Vault specific values are set.
- LKP – Record Source. This Lookup operation resolves the string value that was essentially hard coded in the Staging Area ETL to an ID (integer) value that is to be stored in the Link table as the Record Source ID. The lookup condition is on the Record Source (code) and the lookup has to fail if there is no hit.
- LKP – Hub table <all involved Hubs>. The Hub keys are queried here based on the provided source Business Key attribute and the Record Source ID that was retrieved in the previous Lookup operation. All Hub keys together will form the lookup input for the Link key.
- LKP – Target table to check for existing records. The existing records Lookup operation is the real Link key lookup, which is executed joining the key pair of attribute(s) which act as Business Key(s). The Record Source ID is not part of the join condition as the source relationship is already managed in the Hub entity. If no record is found the record will remain in the Data Flow to be inserted, otherwise it is discarded (gracefully rejected).
- DST – LNK_<table>. The destination target Link table, which is a straight insert of the remaining recordset. Note that this is an insert-only approach.
The Link ETL does not have to worry about the managing (end dating) of relationships, driving keys and many-to-many relationships, that logic is built in the Link-Satellite templates.