A brief history of time in Data Vault
To quote Ronald Damhof in yesterday’s twitter conversation: ‘There are no best practices. Just a lot of good practices and even more bad practices’.
Sometimes I feel Data Vault lacks a centrally managed, visible, open forum to monitor standards. And, more importantly, the evolution of these standards over time. And, even more importantly, why these standards change over time.
It varies (in space and time) where sensible discussions regarding these standards take place, but lately useful content can be found on Twitter (#datavault) and on LinkedIn (‘Data Vault Discussions’ group). This post aims to capture one of these discussions (time variance) in relationship to the broader lines of thought in this weblog. Doing this, I try to put some perspective in the evolution of handling time in Data Vault in it simplest form of using the LOAD_DTS.
The LOAD_DTS (load date/time stamp) concept was introduced in the original publication of Dan Linstedt’s book ‘Supercharge your Data Warehouse’ and appears in the various downloadable demos / code (Powercenter) from LearnDataVault as well as in other places. This attribute is part of every Data Vault entity, and is the date/time that drives the effective dating in the Satellite/Link Satellite entities. With this I mean to say with that it is ‘part of the key’.
Regardless of which date/time attribute is part of the key, it is always true that you want (need) to store various different perspectives (date/times) as context. This is including, but not limited to:
- The date/time the ETL batch / workflow has started (across all ETL processes)
The date/time the ETL was processed (ETL process date / Job Run Date/time) – at what time was the ETL executed that did something with the data
- The moment of change in the source, as accurately as possible – closest to the real change made by a user (the Event Date/Time concept) – always computed and non-tamperable / derived for instance by database log transactions
- All kinds of date/time stamps as appear in, or are tracked by, the operational system. These are generally modeled as regular context attributes (Satellite or Link-Satellite attributes)
- The moment of insertion in the database – there might be delays in ETL execution and actual inserts. An auditor might want to know when the data was loaded (and changed)
Information can be presented / delivered following any of the available timelines in the data using Data/Information Marts (front-room / Presentation Layer) regardless how it is stored in Data Vault (back room / Integration Layer).
Generally speaking the issues related to time-variance center around which attribute is to ‘act’ as effective date, because there are so many perspectives to cater for. The reality is that on a practical level you typically only assign one of these fields to be part of the key. Similarly, you aim to work towards a robust back room, so you are more flexible in delivering potentially different ‘front room’ views on the information. A more robust back room tends to rely on system-driven effective dates (as opposed to dates that are delivered by source systems, and hence are not under the control of the DWH/BI team).
Out of the palette of options, which one should it be? Or to put this differently, against which record should I compare my new record?
This is the source of remarks about the ‘shoehorning’ of date/time into a database, and the comments that databases in general are inadequate at handling temporal information. Only one selected timeline is answered by the database.
For more in-depth conceptual musings I refer to Martijn Evers’ blog, which is ‘light years’ ahead (to stay true to the theme) on n-temporal designs.
In the beginning…
The LOAD_DTS in its original definition (Supercharge your Data Warehouse) is the execution date/time for the full workflow/batch to load the (entire) Data Vault. The original designs here use a CONTROL_DATE mechanism (table) that creates a single execution moment which is picked up by every ETL process that is part of the batch. This design is typically geared towards a daily run, or at best at micro-batch level depending on the total runtime of the batch.
The underlying idea is that a LOAD_DTS that is created as batch date/time stamp (in the control table) is one of the few fields that are completely under control of the Data Warehouse. So, from the perspective of the Data Warehouse you reduce risk/increase robustness by decoupling dependencies from data that is outside your (direct) control. Examples of data elements that are generally outside of direct control (and therefore risky) are source dates, and to an extent even system generated date/time stamps created by the operational systems.
Another reason for this approach is the support for integration of data from different time zones. The LOAD_DTS concept as defined above is decoupled from the time zone, as it issues a central date/time from the perspective of the Data Warehouse.
Having said that, and for this purpose, the definition of the LOAD_DTS as above works fine and saves you from many issues later on. However, the drawbacks of this approach are:
- You are limited in your batch windows, and can’t really run parts of the solution independently
- You can’t load multiple changes for the same natural key in one pass
Specifically the second drawback is a big one. An ETL process should really be able to load multiple changes (for the same source key) in one go. This is sensible for many scenarios ranging from reloading history from an archive to picking up near real-time changes. To be brief: ETL should load the available data regardless of when the process is executed, and it should (re)process the information in the same state (although the insert-into-database date/time stamp metadata is obviously different).
I never forget one of the first assignments I was tasked with at the start of my career: write a looping process that simulates the (re)run of an ETL process to rebuild/simulate a daily run that rebuilds the original state. If you want to catch up on daily runs; you really want to run the ETL once instead of x number of times until you’re up to date. Image what that means for transaction based transaction log based Change Data Capture (CDC) records! This has led to the instruction of the Event Date/Time concept to drive the effective dates (‘act’ as LOAD_DTS).
Intermediate – event date/times as ‘real’ date/time stamps
At some point, because of the limitations of the original LOAD_DTS as discussed in the previous section, some of us (including myself) started to define the Event Date/Time as the new LOAD_DTS and therefore the effective date.
The Event Date/Time concept is defined as the immutable (‘non-tamperable’) moment the change is recorded; as close as you can get to the ‘real’ event (a record changing in the source). The exact Event Date/Time value depends on the interface type; the best option for this needs to be selected depending on the available interfacing options.
For instance, a Full Outer Join comparison to detect a data delta is as accurate (close to the real event) as the moment you run the comparison. The Event Date/Time here is the moment of comparing the two sets (and only produces a single change per key). The Event Date/Time you receive from a Full Outer Join comparison is relatively far away from the ‘real’ change. Similarly a transaction-log based CDC interface provide all changes at almost the exact same moment the ‘real’ change occurs – and is still system generated.
This is also the approach of letting the ETL framework/process controls handle concepts such as loading windows, delta selection and generally start/end times for ETL processes. The underlying principle is to not let ETL execution date/times to drive effective dates, so it’s different from the original LOAD_DTS explanation.
The LOAD_DTS in its original meaning as insert date/time can still be incorporated, but needs to be reconstructed using the available ETL process metadata. You can derive this, as it’s a point-in-time view on the available information from the perspective of the ETL Batch execution. Another fundamental concept behind this approach is to decouple the ability to loading the ETL with the timelines in the data.
The great aspect of this approach is that you get a true time slice of the data, and one which is (still) not influenced by users because the Event Date/Time is not a source attribute! This approach requires proper focus on correct interfacing design, but levels the playing field for all upstream ETL logic. It is reliable and predictable: it doesn’t matter when you run your ETL. Regardless whether you run your processes in the morning or afternoon, the results will always be the same.
It is clear that defining the Event Date/Time as LOAD_DTS achieves this, and that you still need to relate the start/end information of your ETL process separately as part of additional ETL framework/process metadata.
There are also downsides to this:
- Depending on your technical architecture and (types of) interfaces you may run into time zone issues. If you use a replicated database (with CDC enabled) or similar in your own environment it’s not an issue. But if your CDC database is in another time zone, or the server changes time zones, you may get key violation issues and generally experience incorrect order of recorded time. This may even happen in daylight saving time scenarios if you’re particularly unlucky
- Due to this, the above approach is usually not sustainable in a decentral and global Data Vault solution. To an extent, this can be handled using concepts such as applying UTC (Coordinated Central Time). But, this is only achievable if you can sufficiently control the interfacing – which is not always the case
In general applying time zone business logic into your initial Data Vault interfacing ETL is a very risky approach, and is recommended to be avoided (this is generally agreed upon). This is the reason why the LOAD_DTS was defined in its original form in the first place.
LOAD_DTS restyled in DV2.0
As part of updates and improvements in the Data Vault concepts as part of ‘Data Vault 2.0’ Dan Linstedt has suggested a revised definition of LOAD_DTS: the date/time the record was received (inserted by) the database.
In terms of solution design, this can mean the date/time the record was received by the Staging Area or the Data Vault (or both). In my designs I tend to favour ‘locking in’ this date/time in the Staging Area so I can propagate these values from there. The reason is that this allows me to truncate and rebuild the entire Data Vault from a Historical Staging Area in exactly the same way, and still satisfies the auditing requirement to show when data was received by the solution. Various other posts in this blog refer to the reasoning to do this kind of truncate & reload. It also makes it a bit easier to prevent records to be accidentally loaded more than once, as we shall see in the next section. For the record, Dan defines this LOAD_DTS as being set into the Data Vault so that is the standard to relate things against. Similarly, a persistent Staging Area of any form is currently not part of the Data Vault definitions as published by Dan.
In any case, this means the LOAD_DTS is not an ETL process date/time (or control date for that matter). And as an ‘automatic/triggered’ attribute this would support the processing of multiple changes for the same natural key (in a single dataset). Also, the LOAD_DTS is not part of the change detection mechanisms.
Using this definition of the LOAD_DTS will always make the record unique, which (I feel) requires you to define additional mechanisms such as checksums to prevent from reloading the same context data more than once. Without these checks in place even accidentally rerunning the same ETL twice might result in duplicate data. To further complicate things: how this check is defined in itself depends on at which point the LOAD_DTS is locked in: in the Staging Area or the Data Vault!
If the LOAD_DTS is set into the Staging Area (before loading this into the Data Vault), the following applies:
You can truncate and rebuild your entire Data Vault from an archive (Historical Staging Area, Hadoop/HDFS file delta), and are able to reproduce this in the exact same way (e.g. representing the same LOAD_DTS)
- You can prevent loading multiple loads (duplicates) as outlined above by comparing on the LOAD_DTS you set in the Staging Area. If you already lock in the date/time in the Staging Area, you can just check on this value (and the natural key) to prevent reloads
- You also may want to log the moment the record is inserted into the Data Vault
- In full near Real Time interfaces the Staging Area may be skipped, which obviously means the LOAD_DTS cannot be set on the way in to Staging. There are various ways to work around this if required, depending on the architecture (virtualisation being one of them) but it is a consideration that needs to be noted
- The same LOAD_DTS (for the Staging Area table) is can be consistently applied in all upstream tables that are sourced from that Staging Area table
If the LOAD_DTS is set into the Data Vault (Integration Layer), the following applies:
- If you truncate and rebuild your Data Vault from an archive, your effective dates will be different each time you do so
- Preventing loading duplicates as outlined above requires additional smarts. You cannot depend on the LOAD_DTS to prevent reruns, so you need additional information to prevent records from being reloaded. In normal situations checksum mechanisms can be applied to compare information between new (Staging) and existing/current (Data Vault) information. This is typically executed against the most recent (no pun intended) record in the Satellite
- However, the above only works when a single key delta is processed. When (re)loading multiple changes per natural key in a single pass, every record also needs to be compared against… something. This may be the Event Date/Time, or a (system) ROW ID, or both
- Similar to the previous bullet, if you reload historical data against an existing historical data (e.g. Satellite) set you need to be able to do a point in time comparison, for which you need another date/time than the LOAD_DTS (I use the Event Date/Time for this)
- Using the pattern this pattern generally allows you to omit Staging (and Historical Staging) areas in your architecture
These are themselves not new concepts; but apparently if you need to make a comparison for late arriving data or to throw a historical data set against an existing historical data set it is an interesting thought that you still need the Event Date/Time in the Satellite/checksum to make the correct comparison.
In practical terms, in addition to catering for the various date/time stamps outlined in the introduction, this means you need to consciously add the Event Date/Time as it is not a ‘source attribute’ – it does not appear in the source table but needs to be provided by the interface. An example is the database transaction log timestamp as mentioned in the previous section.
Without this information, there is no way of recognizing if information has truly changed. A simple example is that you move back to your original address; the record in itself is completely identical to a record that was processed earlier – except for the Event Date/Time. If you omit this in your checksum or comparison, the record would be discarded and the change would be missed. I hope it is clear that this is not the case when you compare a single record against the most recent record, but in the event of processing multiple changes and/or reloads/merges of historical data in a single pass.
But we’re still not there yet. By applying this solution to the new LOAD_DTS concept the requirement to load multiple changes per key, as well as the requirement to reload your Data Vault from an archive in exactly the same way are almost completely satisfied. The risk is: you are still exposed to the risk of impact by time zones as in the previous section. Similar to the Event Date/Time approach, the DV2.0 LOAD_DTS approach requires additional metadata to truly make the record unique. Similar to the definition of each Event Date/Time, the ‘additional thing’ that makes the record unique needs to be defined on a per-case (interface) basis. Options that come to mind are the delivery date on the file system in combination with the Event Date/Time for flat files. But perhaps a source ROW ID, or a database log sequence number can be added – or a combination of both.
Anything that makes the record unique, and what is available (and ideally immutable) again depends on the interface options. There does not seem to be a one-size-fits-all solution, although in my personal opinion incorporating the LOAD_DTS to be set in the Staging Area is the best solution.
An additional issue that might occur is that, if you use a trigger based mechanism to create this type of LOAD_DTS on database level, the system may not be able to cope well with handling large volumes of data in one go. For instance for an initial load. This is usually OK, but performance may vary depending on the technical architecture.
Conclusion and temporal design/modeling
As is clear from the outline so far there are quite a lot of considerations to take into account; which all form part of the discussion about ‘temporal design’. It has always been one of the options to model time differently for each Satellite, even outside of the alternatives explained here. The temporal modeling issue comes down to figuring out exactly what will make a record unique for that particular source/interface combination.
The real underlying discussion is balancing out standards and architectures versus per-interface design decisions. In other words, how many patterns do we want to support? Should we have one pattern for Batch processing and a separate one for Real-Time processing? This is where opinions vary, and I’m of the opinion that it is OK to select different interface patterns to support generic upstream processing. I use the Staging and Historical Staging Area to achieve this, and can support the various options as outlined in this post using this approach.
In the best outcomes, for instance when you have control over the interface as with a replicated/log based CDC solution in the DWH environment, you can be very specific and even rely on the Event Date/Time. But in most other scenarios you need to add discriminating attributes to the key (of the Satellite) as discussed in the previous section – which comes back to information modeling and fairly deep understanding of what each date/time stamp means, and if and how it can be used for this purpose.
I am going to test the new DV2.0 LOAD_DTS approach further, by using a combination of the Event Date/Time and database Log Sequence Number to the unique record and see if this works as I expect it to. I’m aware that in the CDC world the Event Date/Time can be derived from the Log Sequence Number, but want to see it as a date/time stamp anyway.
In terms of solutions, it seems we will end up with a ‘per-interface’ set of options with different levels of risk associated to the interface and corresponding control.