Virtualising your Data Vault – Hubs and Links
With Data Vault, the Hub ETLs are usually the first to be developed – they are very easy to build once your model is complete! And it was the case with creating these virtualised ETL templates as well. Because Hubs and Links are so similar I covered them both in this post.
In this virtualisation Proof of Concept I used the automation metadata I normally use for automating SSIS, Data Services and Powercenter ETL development. Using this information (great for comparing the end results!) I wrote a small C# application to generate database views following the approach outlined in the SQL example below. I got a bit carried away and integrated it with SQL Server using the SMO API as well!
Compared to regular ETL / DWH automation, Hubs and Links are handled slightly differently in the virtualized DWH as you are essentially combining various ETLs into a single Hub or Link view.
For various reasons related to ease of management, consistency and parallel loading you would normally create a separate ETL process for each table that contains the Business Key you need. This allows you to keep your workflows independent and ‘self supporting’. So, for instance if you have defined an ‘Employee’ Hub that can receive its Business Keys from 10 different source tables, in normal ETL development you would create 10 separate ETL processes.
But as a virtual approach – using a view – you end up with a single Hub or Link view which includes the distinct union of the 10 (sub)queries. Technically speaking the iteration through your metadata is slightly different to achieve this: instead of looping once through the combination of the Source and Target tables as you do in regular ETL automation you now perform an additional loop to accommodate a per-source subquery.
As an end result it’s very easy to reconcile because the number of views is exactly the same as the number of entities in your Data Vault model – quite easy to understand what’s happening this way.
An overview of what I learned, tips and tricks and other things to be aware of:
- It makes sense to cast the source business keys to their Data Vault / Integration Layer counterparts (CAST <source attribute> as <Hub Business Key) in the subqueries / inline view. This is because the source attribute name for your Business Key potentially differs between the source tables and it’s easier to handle the hashing and grouping in one go with consistent attribute names. Casting the correct target name as a result of every subquery in the union makes life very easy – and it’s a minor tweak compared to existing automation logic
- Using subqueries / inline views allows you to easily accommodate specific logic such as concatenation, composite keys and pivoting which are all part of the standard metadata
- The Event Date/Time is usually synonymous to a Load Date/Time. However I define the Event Date/Time as the closes you can get to the actual change. It is ideally generated by CDC mechanism as I feel the moment of ETL execution should be decoupled from the effective dates. For the Hubs you select the MIN date/time because you try to capture the first time the Business Key was presented to the Data Warehouse
- The MIN statement used above doubles as a DISTINCT operation, so no need to specify this further
- You need a source that has full history; a Historical Staging Area. Only this, combined with proper interfacing guarantees the correct timelines; the result needs to be deterministic
- For Links, make sure you add ordering in the Business Key selection. It may be quite obvious but I got caught out by having different hashes in the Links and Link-Satellites because I hadn’t specified the order. Remember SQL is not deterministic by default 🙂
- I define Links as integration points: distinct combination of key pairs
An example SQL for the virtual Hub is available here:
SELECT <Hash algorithm covering the Business Key> AS <Hub>_SK, MIN(<Event Date/Time>) AS <Event Date/Time>, <Record Source>, <Hub Business Key attribute> -- This is the target attribute name for the Business Key, as it is known in the Hub table FROM ( SELECT <Source business key attribute> AS <Hub Business Key attribute>, OMD_RECORD_SOURCE, MIN(<Event Date/Time>) AS <Event Date/Time>, FROM <Source table 1> WHERE <Source business key attribute> IS NOT NULL GROUP BY <Source business key attribute>, <Record Source> UNION SELECT <Source business key attribute> AS <Hub Business Key attribute>, <Record Source>, MIN(<Event Date/Time>) AS <Event Date/Time>, FROM <Source table 2> WHERE <Source business key attribute> IS NOT NULL GROUP BY <Source business key attribute>, <Record Source> UNION... -- You can have as many unions as you have Business Keys in your source tables ) HUB_selection GROUP BY <Hub Business Key attribute>, <Record Source>
Regarding Links the process is very similar, you only need to replace the <Hub_SK> with a <Link_SK> and combine the Business Keys into a single Link hash. And of course both (or more) keys need to be queried in the subqueries to enable this.
So, what I did is:
- Query each <Source business key attribute> AS <Hub Business Key attribute) for every single Hub key
- Hash the combination of the Business Keys in the outer query: this is your Link SK / Link DWH Key
- Hash every single Business Key individually, these are used to join back to the outlying Hubs
And you’re done!