Virtualising your Data Vault – Satellites
Once you have nailed the fundamental metadata requirements and prerequisites for Data Vault ETL automation, changing the automation output in terms of target platform or (ETL) tool is relatively easy. This is especially true for Satellites as their implementation in a virtualised setting is usually 1-to-1 with their ETL instantiated counterparts.
To make the distinction; Hubs are handled differently for virtualization as you are essentially combining various ETLs into a single Hub view. For example: an ‘Employee’ Hub that can receive its business keys from 10 different source tables would typically result in 10 separate ETL processes if you use an ETL tool.
But as a virtual approach – using a view – you end up with a single Hub view which includes the distinct union of the 10 (sub)queries. Technically speaking the iteration through your metadata is different to achieve this for a Hub view. A Satellite is easier in a way if you adapt from existing ETL automation efforts.
The interesting point about virtualising a Satellite is the handling of the ‘zero record’. In typical ETL tool development this is implemented as a post-process or separate step. This step adds a record to initialise the timelines for the business entity context (the history of a specific Hub key in the Satellite). But of course using views, this has to be implemented differently. Zero records, although derived, are invaluable in simplifying the upstream processing / querying of time-variant information.
Some minor disclaimers:
- You need a source that has full history; a Historical Staging Area. Only this, combined with proper interfacing guarantees the correct timelines. Querying a source system directly using available effective dates has the same drawbacks as it always had
- The Event Date/Time is usually synonymous to a Load Date/Time. However I define the Event Date/Time as the closes you can get to the actual change. It is ideally generated by CDC mechanism as I feel the moment of ETL execution should be decoupled from the effective dates. Doesn’t matter for the SQL below though, as this date/time is set by the processes loading data into the Historical Staging Area
- The metadata is the same as originally posted for tool-based automation, the only real change is the union in the subquery for the zero records
- This example is geared towards Data Vault 2.0: using hashes instead of sequences. It’s a small but important tweak!
- Expiry date/time and Current Record Indicator are derived – thanks to Kent Graziano for sending the snippet. I did change it to SQL Server style 🙂
I have posted example code below. Let me know what you think!
SELECT <Hash algorithm covering the Business Key> AS <Hub>_SK, <Event Date/Time> AS EFFECTIVE_DATETIME, COALESCE ( LEAD ( <Event Date/Time> ) OVER (PARTITION BY <Business Key, including any multi-active attributes if required> ORDER BY <Event Date/Time>), CAST( '9999-12-31' AS DATETIME) ) AS EXPIRY_DATETIME, CASE WHEN ( RANK() OVER (PARTITION BY <Business Key, including any multi-active attributes if required> ORDER BY <Event Date/Time> DESC)) = 1 THEN 'Y' ELSE 'N' END AS CURRENT_RECORD_INDICATOR, <Business Key>, <Record Source>, <Attributes> AS <Satellite name>, -- I use AS statements to map the source naming to the target (Satellite) naming CAST( ROW_NUMBER() OVER (PARTITION BY <Business Key, including any multi-active attributes if required> ORDER BY <Business Key, including any multi-active attributes if required>, <Event Date/Time> ) AS INT) AS ROW_NUMBER, -- Not necessarily required, I use it to instantly see the number of changes per business key in time <Hash algorithm covering the context attributes> -- Added to be future proof in case we need to 'physicalise', we might need row comparisons later FROM ( SELECT DISTINCT -- No changes here from normal selection, you always need a DISTINCT subquery <Event Date/Time>, <Record Source>, <Business Key>, <Attributes> FROM <Historical Staging table> UNION SELECT DISTINCT -- The zero records! '1900-01-01' AS <Event Date/Time> , 'Data Warehouse' AS <Record Source>, <Business Key, NULL AS <Attributes> -- The number of attributes in the UNION must match FROM <Historical Staging table> ) sub