Virtualising your Data Vault – regular and driving key Link Satellites
Virtualising the EDW core integration layer by applying Data Vault concepts turned out to be a very useful and achievable exercise. So achievable even, that it only requires three posts to present an idea on how this all works. The Hubs and Links are already covered in the first post, and the Satellites in the second. It’s now time for the remaining primary entities: the Link Satellites.
What’s the driving key?
As explained in this post I make the distinction between tracking ‘regular’ history or ‘driving key’ history in dedicated Link Satellites. The regular history Link Satellite tracks the history for each individual relationship – as present in the Link. In the meantime, the driving key Link Satellite tracks the history from the perspective of the driving key: the ‘one side’ of a relationship. From a technical perspective the difference is in how the zero records and end dating are handled. Before I go in to the details of how to query these kinds of concepts I have prepared a quick example below which I hope clarifies these differences between the Link Satellite types.
In this example you track the (history of the) relationship between a Person and an Organisation. Typical context information that is tracked for each relationship could for instance be the contract details. An example of this would be when your terms of employment change: this type of information is tracked over time in a regular / historical Link Satellite.
Another way of tracking information is what happens between relationships. By assigning the Driving Key to Person in your model, as in this example, you essentially want to achieve that the previous relationship between a Person and a specific Organisation is closed when the Person moved to a new Organisation. This is illustrated as follows:
As is visible in the diagram above you can see that relationship ‘1’ is closed when relationship ‘3’ comes into existence on 2013-01-02. If, how and/or when this type of history is applied is a modeling decision and as this is an implementation post we can move on from here.
The key is to understand that the dynamics are different, and therefore the implementation (handling) in ETL. Having said that, a virtualised approach makes this concept a lot easier to explain and test because it almost automatically supports typical issues such as reopening closed relationships (what happens when John goes to work for TopCorp again) and issues related to record condensing in the Staging Area and corresponding timing. Generally speaking, it is a lot easier getting this to work using a virtual approach than it is using ETL software.
Virtualising the regular history Link Satellite
Going back to the overview of setting up your virtualised Data Vault EDW it is sufficient to state that the virtualised Regular / Historical Link Satellite is to a large extent similar to the virtualised Satellite. The main querying differences between Satellites and regular Link-Satellites are:
- Instead of hashing the Business Key (Hub) in the outer query, you now hash the combination of the Business Keys. This is your Link SK / Link DWH Key
- Any row numbering and/or partition by clauses now incorporate all Business Keys
It’s as simple as that, which makes sense because the overall approaches are the same (including how zero records and expiry dates are handled).
Virtualising the Driving Key Link Satellite
In the example code as posted below the ‘<Business Key A>’ attribute is designated as the Driving Key. Again the disclaimers:
- You need a source that has full history; a Historical Staging Area. Only this, combined with proper interfacing guarantees the correct timelines. Querying a source system directly using available (source) effective dates has the same drawbacks as it always had. Generally speaking you want to be deterministic: you should be able to re-generate your Data/Information Marts with exactly the same results regardless when you run everything, and the same should apply for your Data Vault
- The Event Date/Time is synonymous to a Load Date/Time in this example. I define the Event Date/Time as the closest you can get to the actual change. It is ideally generated by CDC mechanisms as I feel the moment of ETL execution should be decoupled from the effective dates. Doesn’t matter for the SQL below though, as this date/time is set by the processes loading data into the Historical Staging Area so it can also be the newly styled DV2.0 LDTS (timestamp of inserting in the database – not the ETL processing date/time)
- This example is geared towards Data Vault 2.0: using hashes instead of sequences. It’s a small but important tweak!
- Expiry date/time and Current Record Indicator are derived – thanks again to Kent Graziano for sending the snippet
The SQL below creates the Driving Key Link Satellite in one go. This is the most challenging aspect of the virtualised Data Vault as both the end dating and zero record handling are different from anything so far. The SQL statement below adds a zero record only for the first time a relationship of which the Driving Key is part of is available.
SELECT <Hash algorithm covering the combined Business Keys> AS <Link>_SK, <Event Date/Time> AS EFFECTIVE_DATETIME, COALESCE ( LEAD ( <Event Date/Time> ) OVER(PARTITION BY <Business Key A> ORDER BY <Event Date/Time>), CAST( '9999-12-31' AS DATETIME) ) AS EXPIRY_DATETIME, CASE WHEN ( RANK() OVER (PARTITION BY <Business Key A> ORDER BY <Event Date/Time> DESC)) = 1 THEN 'Y' ELSE 'N' END AS CURRENT_RECORD_INDICATOR, <Record Source>, CAST( ROW_NUMBER() OVER (PARTITION BY <Business Key A> ORDER BY <Business Key A>, <Event Date/Time>) AS INT) AS ROW_NUMBER FROM ( SELECT <Event Date/Time>, <Record Source>, <Source attribute name for Business Key A> AS <Business Key A>, <Source attribute name for Business Key B> AS <Business Key B>, COALESCE( LAG ( <Source attribute name for Business Key B>)
OVER(PARTITION BY <Source attribute name for Business Key A> ORDER BY <Event Date/Time>),0 ) AS PREVIOUS_FOLLOWER_KEY FROM <Historical Staging table> UNION SELECT <Event Date/Time>, <Record Source>, <Business Key A>, <Business Key B>, 0 AS PREVIOUS_FOLLOWER_KEY FROM ( SELECT '1900-01-01' AS <Event Date/Time>, 'Data Warehouse' AS <Record Source>, <Source attribute name for Business Key A> AS <Business Key A>, <Source attribute for Business Key B> as <Business Key B>, DENSE_RANK() OVER (PARTITION BY <Source attribute name for Business Key A> ORDER BY <Event Date/Time>, <Source attribute name for Business Key A> ASC) ROWVERSION FROM <Historical Staging table> ) dummysub WHERE ROWVERSION=1 ) sub WHERE <Business Key B> != PREVIOUS_FOLLOWER_KEY
It is important to point out that the final WHERE clause includes the ‘following key’, not the Driving Key! The end result looks like the screenshot below. I highlighted the timeline for a single Driving Key across relationships.
The individual Business Keys are not visible here, but easily added to the query if you prefer.
This concludes the experiments to completely virtualise the Data Vault!