Data Vault ETL Implementation using SSIS: Step 3 – Hub ETL – part 1 – overview
The Staging Area and Historical Staging, both part of the conceptual Staging Layer, are not directly related to Data Vault although at least the Staging Area ETL paves to way with the definition of the Event Date/Time. The Hub however is the first true Data Vault template to be implemented in SSIS. Please note that this is true for DV1.0; for DV2.0 the Staging Area is incorporated into the design a bit more.
The Hub ETL comes down to select the distinct set of business keys from the respective Staging Area table, check if they’re there and insert them if they’re not there already. By its nature this ETL process is already compliant with the fundamental ETL requirements. You can run it every time and any time without adding additional complexity.
Arguably the biggest difference between the concept and implementation is the addition of the Record Source in every Hub table for auditing purposes. The string value of the Record Source will be resolved to an ID (Record Source ID) in the ETL process from the Staging Area to the Hub. In most cases the Record Source (ID) will be the same for all records in the Hub, but it can provide a mechanism to handle situations where the same key has a different meaning between sources.
In the SSIS implementation there is no additional logic in the Control Flow except the usual ETL process metadata handles. The Data Flow is as follows:
- SRC – <STG table>. The direct connection to the Staging Area where the distinct selection of the business key including the Event/Date Time and Record Source are queried. Since the Staging Area can contain multiple changes for the same natural key (i.e. Event Date/Time can occur more than once for the key) the minimum (MIN) value for the Event Date/Time is select in a SQL override GROUP BY statement. Without this grouping duplicate records would be inserted which invalidate the unique constraint which should be set on the Hub Business Key.
- CNT – Discarded / Inserted / Input. Optional record counts. The counts are stored in local variables and committed to the ETL process metadata after completion of the package execution.
- DRC – Standard Values. Any proprietary ETL metadata controls and/or values can be set in this Derived Column Transformation. No Data Vault specific values are set.
- LKP – Record Source. This Lookup operation resolves the string value that was essentially hard coded in the Staging Area ETL to an ID (integer) value that is to be stored in the Hub table as the Record Source ID. The lookup condition is on the Record Source (code) and the lookup has to fail if there is no hit.
- LKP – Target table to check for existing records. The existing records Lookup operation is the real Hub key lookup, which is executed joining the attribute(s) which acts as Business Key(s). If no record is found the record will remain in the Data Flow to be inserted, otherwise it is discarded (gracefully rejected).
- DST – HUB_<table>. The destination target Hub table, which is a straight insert of the remaining record set. Note that this is an insert-only approach.
One of the role of the Hub concept is to handle key distribution; in this example this is handled by having the HUB meaningless / Surrogate Key or ID as an identity column on database level (SQL Server). This is because SSIS does not have a standard sequence generator operation. Additionally, a Unique Index (Unique Key) is placed on the Business Key(s) to prevent duplicates. The correct implementation of the ETL would prevent this, but better safe than sorry and it provides performance gains as this is the field most used in lookups.
As per Microsoft best practices the Primary Key is defined as a non-clustered (not the default) index and the Unique Key is created as a clustered index.
This is an overview of the standard Hub processing and it works in most scenarios. Having said that the exception of the rule are documented here: Hub ETL exceptions.