Driving Keys and relationship history, one or more tables?
Handling Driving Key type mechanisms is one of the more challenging elements of Data Vault modelling. It’s not necessarily difficult to implement this concept, but more how to interpret this and get the right information out of the Data Vault again. In this post I’ll explore two ways of storing relationship history over time:
- Using a separate table to store Driving Key information and a separate table to store normal relationship history (type 2) and,
- Merging both types of history in the same table
In this simplified example we track the history of an employee that works at different companies over time. Periodically a survey is performed to record the mood, or employment satisfaction, of the employee and this information is ultimately recorded in the Data Warehouse.
For data modelling purposes the Employee business entity is defined as ‘driving’ which means the ETL logic enforces that an employee cannot be associated with multiple companies at the same point in time. When an employee moves to another company the ETL process will end-date the relationship to the (at that point) previous company. In other words: Employee is the driving entity (’Driving Key’).
First let’s pursue the first approach see how the solution pans out when modelling out two separate tables to capture relationship history.
On the initial load, on 2012-02-02, the first record arrives in the Data Warehouse. Because no previous records exist this will lead to the creation of a new Employee and a new Company Hub record. The ETL will also create the relationship between the two Hubs in the Link table as a new record. From a relationship history perspective we record that ‘Ben’ is ‘happy’ at company ‘SmallCorp’, and from a driving key perspective exactly the same records are stored; meaning ‘Ben’ works at ‘SmallCorp’. Of course this can be easily derived from the relationship history table as well at this stage. Note that by design a zero record has been added to initiate Data Warehouse history timelines – stating that we acknowledge the state of a business entity in time but know nothing about it from the start of the timeline to the first available record.
The second record arrives on 2014-03-04. Here Ben is still working at SmallCorp, but he is now sad. This doesn’t change anything in the Hub or Link tables but does indicate a change in the relationship history. Ben used to be happy, but is now sad working at that particular company. Because Ben still works at the same company no change occurs in the Driving Key based Employment LSAT table.
In the third iteration, on 2014-06-05, Ben decides to move to a new company BigCorp and is very excited about the move. This means that a new Hub record is created to add ‘BigCorp’ to the Company Hub, and also that a new relationship is established between Ben and BigCorp in the Link table. In the relationship history this is logged as a new set of records describing the new relationship using a new zero record and the new entry for 2014-06-05 with mood ‘excited’. Effectively this states that previously nothing was known about how Ben thought about BigCorp, but we (the Data Warehouse) start knowing things for this particular relationship from 2014-06-05 onwards.
Consequently the most recent state of things about Ben’s relationship to his previous company SmallCorp is still that he is ‘sad’. We don’t know anything else from this perspective (e.g. the specific relationship from Ben to SmallCorp’) so the most recent information we have is still there and active.
In the Driving Key based employment table however a change is triggered because if we look from the ‘employment’ perspective the relationship between Ben and Smallcorp is now closed (end-dated) because the relationship between Ben and BigCorp has become active. This is all relatively easy to trigger in ETL as posted previously here.
After this, on 2014-07-07, Ben realises he made a mistake and moves back to SmallCorp. Nothing changes in the Hubs and Link because all information already was available in the Data Warehouse from the previous records. This does means we can add new information about the relationship between Ben and SmallCorp: he is now ‘glad’.
From the employment point of view we will now effectively re-open the relationship and track this in time. Because Ben now works at SmallCorp again the relationship with BigCorp is closed. Still, the most recent information we have on Ben’s relationship to BigCorp is that he is excited – it’s the latest information the Data Warehouse has received.
As a last entry in the Data Warehouse on 2014-08-08 we learn that Ben regrets his move, which triggers a new record describing his relationship to SmallCorp but otherwise doesn’t change things.
This example gives an idea on how to track two different perspectives over time; the history of the individual relationships and the history from a Driving Key perspective. It makes it easy to query these perspectives: if you want to know when someone was employed where, or what the most recent state is you can query the Driving Key table. However if you’re interested in the most recent information (or point in time of course) of what an employee thinks of a specific company the relationship history can be queried directly.
However, as a second option it is also possible to merge these tables into a single Link Satellite. If we play out the same scenario the result would look as follows:
This shows the timeline for both perspectives; combining the change in relationship using the Driving Key and history as captured for the individual relationships. The changes in the data that flow from this design are highlighted in the diagram, and the following becomes clear:
- The change over time for the relationship is still the same (2014-06-05 and 2014-07-07)
- However this causes the relationship between Ben and SmallCorp (Link SK 10) to be end-dated differently. In the previous example the end date would have been 2014-07-07 as this is the date/time some new information about this particular relationship emerges
Because of this the way to query information changes a bit. For instance, to see the latest state of the ‘mood’ for a given employee and company you can’t use the high end date or current row indicator anymore. Instead you need to query ‘give me the state of the information for the highest available date/time’, which in this case would return ‘Excited’ for Link SK 11 as this is the most recent information we know about Ben and BigCorp.
Similarly there is now a gap if you specifically are looking for relationship history for Link SK 10 (e.g. the mood an employee was in for a specific company) there is no value between 2014-06-05 and 2014-07-07 . This means a point-in-time query would return NULL if the query happened to be for a point in time between those dates
This behaviour is nothing that can’t be managed, and the table structure can be split and merged without losing data at any point in time. However, as you have seen the information may be structured differently to get the easiest answer to your query focus. For me I tend to use the split tables as I find it easier to directly query what I am looking for with the same patterns and can add the required date math into the Information Marts anyway if I need to.
This does have a processing and storage penalty, but in most cases that is less of a concern and from an architecture perspective this is sound as the option to remodel (and generate ETL) without losing data in the future remains open.