Virtualising your Data Vault – regular and driving key Link Satellites

Virtualising the EDW core integration layer by applying Data Vault concepts turned out to be a very useful and achievable exercise. So achievable even, that it only requires three posts to present an idea on how this all works. The Hubs and Links are already covered in the first post, and the Satellites in the second. It’s now time for the remaining primary entities: the Link Satellites.

What’s the driving key?

As explained in this post I make the distinction between tracking ‘regular’ history or ‘driving key’ history in dedicated Link Satellites. The regular history Link Satellite tracks the history for each individual relationship – as present in the Link. In the meantime, the driving key Link Satellite tracks the history from the perspective of the driving key: the ‘one side’ of a relationship. From a technical perspective the difference is in how the zero records and end dating are handled. Before I go in to the details of how to query these kinds of concepts I have prepared a quick example below which I hope clarifies these differences between the Link Satellite types.

In this example you track the (history of the) relationship between a Person and an Organisation. Typical context information that is tracked for each relationship could for instance be the contract details. An example of this would be when your terms of employment change: this type of information is tracked over time in a regular / historical Link Satellite.

Another way of tracking information is what happens between relationships. By assigning the Driving Key to Person in your model, as in this example, you essentially want to achieve that the previous relationship between a Person and a specific Organisation is closed when the Person moved to a new Organisation. This is illustrated as follows:

Driving Key in Data Vault

As is visible in the diagram above you can see that relationship ‘1’ is closed when relationship ‘3’ comes into existence on 2013-01-02. If, how and/or when this type of history is applied is a modeling decision and as this is an implementation post we can move on from here.

The key is to understand that the dynamics are different, and therefore the implementation (handling) in ETL. Having said that, a virtualised approach makes this concept a lot easier to explain and test because it almost automatically supports typical issues such as reopening closed relationships (what happens when John goes to work for TopCorp again) and issues related to record condensing in the Staging Area and corresponding timing. Generally speaking, it is a lot easier getting this to work using a virtual approach than it is using ETL software.

Virtualising the regular history Link Satellite

Going back to the overview of setting up your virtualised Data Vault EDW it is sufficient to state that the virtualised Regular / Historical Link Satellite is to a large extent similar to the virtualised Satellite. The main querying differences between Satellites and regular Link-Satellites are:

  • Instead of hashing the Business Key (Hub) in the outer query, you now hash the combination of the Business Keys. This is your Link SK / Link DWH Key
  • Any row numbering and/or partition by clauses now incorporate all Business Keys

It’s as simple as that, which makes sense because the overall approaches are the same (including how zero records and expiry dates are handled).

Virtualising the Driving Key Link Satellite

In the example code as posted below the ‘<Business Key A>’ attribute is designated as the Driving Key. Again the disclaimers:

  • You need a source that has full history; a Historical Staging Area. Only this, combined with proper interfacing guarantees the correct timelines. Querying a source system directly using available (source) effective dates has the same drawbacks as it always had. Generally speaking you want to be deterministic: you should be able to re-generate your Data/Information Marts with exactly the same results regardless when you run everything, and the same should apply for your Data Vault
  • The Event Date/Time is synonymous to a Load Date/Time in this example. I define the Event Date/Time as the closest you can get to the actual change. It is ideally generated by CDC mechanisms as I feel the moment of ETL execution should be decoupled from the effective dates. Doesn’t matter for the SQL below though, as this date/time is set by the processes loading data into the Historical Staging Area so it can also be the newly styled DV2.0 LDTS (timestamp of inserting in the database – not the ETL processing date/time)
  • This example is geared towards Data Vault 2.0: using hashes instead of sequences. It’s a small but important tweak!
  • Expiry date/time and Current Record Indicator are derived – thanks again to Kent Graziano for sending the snippet

The SQL below creates the Driving Key Link Satellite in one go. This is the most challenging aspect of the virtualised Data Vault as both the end dating and zero record handling are different from anything so far. The SQL statement below adds a zero record only for the first time a relationship of which the Driving Key is part of is available.

SELECT
   <Hash algorithm covering the combined Business Keys> AS <Link>_SK,
   <Event Date/Time> AS EFFECTIVE_DATETIME,
   COALESCE (
      LEAD ( <Event Date/Time> ) OVER(PARTITION BY <Business Key A> ORDER BY <Event Date/Time>),
      CAST( '9999-12-31' AS DATETIME)
   ) AS EXPIRY_DATETIME,
   CASE
      WHEN ( RANK() OVER
      (PARTITION BY <Business Key A> ORDER BY <Event Date/Time> DESC)) = 1
      THEN 'Y'
      ELSE 'N'
   END AS CURRENT_RECORD_INDICATOR,
   <Record Source>,
   CAST(
      ROW_NUMBER() OVER (PARTITION  BY
         <Business Key A>
      ORDER BY
         <Business Key A>,
         <Event Date/Time>) AS INT)
   AS ROW_NUMBER
FROM
(
   SELECT
      <Event Date/Time>, 
      <Record Source>,
      <Source attribute name for Business Key A> AS <Business Key A>,
      <Source attribute name for Business Key B> AS <Business Key B>,
      COALESCE(
      LAG ( <Source attribute name for Business Key B>)
      OVER(PARTITION BY <Source attribute name for Business Key A>
      ORDER BY <Event Date/Time>),0
      ) AS PREVIOUS_FOLLOWER_KEY
   FROM <Historical Staging table>
   UNION
   SELECT
      <Event Date/Time>, 
      <Record Source>,
      <Business Key A>,
      <Business Key B>,
      0 AS PREVIOUS_FOLLOWER_KEY
   FROM 
   (
      SELECT
         '1900-01-01' AS <Event Date/Time>,
         'Data Warehouse' AS <Record Source>,
         <Source attribute name for Business Key A> AS <Business Key A>,
         <Source attribute for Business Key B> as <Business Key B>,
         DENSE_RANK() OVER (PARTITION BY
            <Source attribute name for Business Key A>
            ORDER BY <Event Date/Time>, <Source attribute name for Business Key A> ASC) ROWVERSION
      FROM <Historical Staging table>
   ) dummysub
   WHERE ROWVERSION=1
) sub
WHERE
<Business Key B> != PREVIOUS_FOLLOWER_KEY 

It is important to point out that the final WHERE clause includes the ‘following key’, not the Driving Key! The end result looks like the screenshot below. I highlighted the timeline for a single Driving Key across relationships.

Driving Key Virtualisation Evidence

The individual Business Keys are not visible here, but easily added to the query if you prefer.

This concludes the experiments to completely virtualise the Data Vault!

 
Roelant Vos

Roelant Vos

You may also like...

6 Responses

  1. Roelant Vos Roelant Vos says:

    Driving Keys are one of the lesser know concepts, let alone understood in the context of technical implementation. If you disagree or have comments, discussions please let me know!

     
  2. says:

    Regarding Driver Keys: n my last project we face the need to understand and to make patterns for driver keys on Link Satellites and Link effectivity tables. One of the interesting things we concluded is that in some cases, the same target entity, requires different driver keys, and it all depends how the data was captured in the source systems. Obviously this also varies depending how integrated your target Data Vault model is and how decoupled from specific source systems is.

    For example a table “Link Employee to Team”, where the primary key of the Link is:

    – Employee ID (surrogate of the Hub Employee) + Team ID (surrogate of the Hub Team). The satellite associated to the Link was created to capture the Effective and valid dates of the associations between Employee and Team.

    So in our business the source data existed in 2 different tables (from 2 different systems):

    – In some cases, the source system and business rule allowed for an Employee of certain area to belong to multiple teams. The driver key in this case matched the entire full key of the target Link table. This happened mostly with the Customer Care people, where teams where small and dynamic and people were supporting multi-function operations.

    – In some other areas, the data came from an Employee table where the assigned Team could only be one, i.e. the driver key was Employee Id.

    – I could assume that there could even be the case of a source system capturing the reverse relationship, i.e. for a Team one employee (the manager), and if this corresponds to the same concept we need in our target, then the driver key will be reversed (i.e. Team Id).

    What do you think of this?

     
    • Roelant Vos Roelant Vos says:

      Driving keys essentially force a one-to-many relationship in an otherwise many-to-many entity. In my designs it requires a dedicated LSAT table to properly capture the timelines.

      True many-to-many relationships are notoriously annoying to build; there’s no set rule; you have to find a custom way to properly handle the ending of relationships in the LSAT. This almost always involves some source attribute to play a role in pointing which record in the many-to-many set needs to be expired. Another alternative to end date relationships in a many-to-many table is doing full compares between the relationships. It’s kind of brute force, but it does work.

      You could always opt to create separate LSATs for scenarios (filtering on sets); not always an option to do so – but it is an option.

       
  1. July 24, 2018

    […] Vos , R. (2014, June 10). Virtualising your Data Vault – regular and driving key Link Satellites. Retrieved from An expert view on agile Data Warehousing: http://roelantvos.com/blog/?p=1162 […]

     

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.