When a full history of changes is too much: implementing abstraction for Point-In-Time (PIT) and Dimension tables
When changes are just too many
When you construct a Point-In-Time (PIT) table or Dimension from your Data Vault model, do you sometimes find yourself in the situation where there are too many change records present? This is because, in the standard Data Vault design, tiny variations when loading data may result in the creation of very small time slices when the various historised data sets (e.g. Satellites) are combined.
There is such a thing as too much information, and this post explains ways to remediate this by applying various ‘time condensing’ mechanism. Applying these techniques can have a significant impact on the performance and ease-of-maintenance for PIT and Dimension tables, and is worth looking into.
This post builds on some of the concepts that were covered in earlier posts. Please have a look at the following follow fundamental concepts first, or alternatively meet me at the Data Vault Implementation training at http://www.datavault.de to discuss in person :-). The main relevant concepts are:
- When is a change a change? This post outlines the concepts and effects of row condensing, and why these types of additional checks are needed to avoid record sets with (potentially) many redundant rows.
- Row condensing for Satellites. This posts explains how row condensing works in its most elaborate form.
- Creation of Point-In-Time (PIT) tables. This posts covers the merging of multiple time-variant (historised) data sets to create a full view of history.
When new data is loaded as part of a Data Vault architecture, every record is time-stamped upon entry with the ‘Load Date/Time Stamp‘ (LDTS). This approach ensure a consistent way of ‘effective dating’ data in a way that is fully within the control of the Data Warehouse environment (e.g. cannot be impacted by how OLTP systems behave).
By using this Load Date/Time Stamp as effective date in the Data Vault model, you contribute significantly to ensuring the ‘back-room’ of the Data Warehouse / ETL ‘just works’ so you can concentrate on the ‘front-room’ where most of the business logic is applied.
The very reason that the Load Date/Time Stamp is the effective date in Satellites means that the time when the interface picks up new changes from the source systems is relevant. In other words, the order in which interfaces are received. This should not be confused with the implementation of parallelism and independent loading, as it is still perfectly possible to absorb data in any order of loading. The initial point I want to make is that even though data can be received and processed in a different order (e.g. independent), there are still effects to be aware of in terms of ETL behaviour.
Side effects of independent loading patterns
For instance if you receive a transaction for a purchase (unoriginally called ‘Purchase Transaction’ here) that includes a customer identifier (Customer ID) and some transaction information, you will be able to fully load this information even for the relationship to the Customer. This is because the Customer ID will generate a new Hub record (assuming this is the Business Key and the specific Customer ID didn’t yet exist) even though nothing else is known about the customer. If, a bit later, the Customer Details become available the same Customer ID will be presented to the Hub. But in this case no new key is created since the key already is present, it was delivered by the Purchase Transaction data. The additional context data such as for instance Name, Date of Birth etc. will be loaded to a Satellite as normal. The end result is that a transaction is created and linked to a customer (key) which has certain context attributes.
When the files are received the other way around the end result in terms of available data is the same. That is, the Customer key is created, the context attributes are available for that key and relationships to transactions would have been created. However, in this scenario the Customer Details will be the first to present the Customer ID and will trigger the creation of the Hub record (and the Satellite will be loaded as per usual again). When the Purchase Transaction is presented to the Data Warehouse, it will detect that the Customer ID is already present and will not create a new Hub record. This is BAU for the Data Warehouse, but the point I’m trying to make is that the Load Date/Time Stamps will be different. And this has an impact when you combine the tables into a single time-variant data set as is the case with a PIT or Dimension.
Consider the following example:
In this small example you can already see that combining information into a single (time-variant) set will produce different results because of the tiny difference in receiving the data (look at the Load Date Time Stamp). From a ‘time’ perspective (for the temporal purists: along the perspective of the LDTS) the finished ETL will produce the following results:
Why wouldn’t you just leave the Hub date/time out of the selection, you may think? Please consider this is a simple example to show the behaviour of the ETL in combination with the Load Date/Time Stamp approach. We’re still investigating the pattern.
This effect is multiplied when you consider multiple sources of information being combined around the ‘Customer’ business concept as in the following example.
Because it is very likely that the above three source tables are presented to the Data Warehouse at slightly different intervals there will be minute (tiny) differences which, while not visible in the individual Satellites, will create NULL values in the combined time-variant result. Again, this should be considered BAU for the Data Warehouse. Works as intended. Have a look at the following example output, considering that -even though these workflows run independently- the data from the source tables arrives in the following order:
- Customer Email (at 2017-06-05 09:10)
- Customer Address (at 2017-06-05 09:25)
- Customer Details (at 2017-06-06 09:26)
This translates into the following behaviour and results when combining the three time-variant / historised sets (Satellites).
At this stage I need to add a disclaimer for some of the temporal purists around: this is the effect that happens when you choose to continue the Load Date / Time Stamp as (Satellite) Effective Date from your Data Vault model into the next layer. Of course, you have the option to choose any other timeline (=available attribute) to represent data against. But for all intends and purposes the standard Load Date / Time Stamp is considered ‘cookie cutter’ for this article.
Condensing changes, over time
As the above example highlights that, in general, the more time-variant data sets you combine the smaller the time slices will become and consequently the more records you will have in your result.
The approaches covering parallelism and independent loading briefly outlined here support a wide range of options to manage the Data Warehouse and ETL environment in a flexible and robust way. Data Vault intends to support ‘100% of the data 100% of the time’, but as a side effect there may be sometimes too many changes available when the complete view is presented back to the user. Especially when adding the effect these minute timing differences have on the size of the record set.
Is the user really interested in seeing that it took a millisecond for address data and email address to be added, that there was a tiny moment that the email wasn’t available yet? For longer periods maybe, but these changes are directed by technical behaviour. Another way of thinking about this is that while there may definitely be functional requirements to show less rows, there is definitely a technical consideration as well.
So, minute timing differences will create new rows. This is expensive, and may not be required. To counter this additional layers of ‘time condensing’ can be applied and used in conjunction with the already required record condensing mechanism – something that every time-variant merge needs.
Time-condensing is different than record condensing. Record condensing focuses on removing redundant rows from the record set (e.g. different time, but no change in values), while time condensing is about removing rows (and keeping one) from a defined period of time even if they are genuine changes (the values are different across rows). I have categorised the types of time condensing as follows:
- Frequency condensing. This is an approach to ‘group’ changes by set time period (minute, day, week, year etc.).
- Continuous gap condensing. This approach aims to assess the ‘real’ changes by the occurrence of change frequency.
Let me start explaining this with another disclaimer: the level of relevance for condensing changes at PIT level is really about what makes sense for the use-case. For certain scenarios such as financial transactions, every change may be relevant. On the other hand, tracking changes in address details may be require only looking at the changes at day level.
As you an imagine, this really depends on the intended use of the information. The requirements for fraud detection and support of (electronic) direct marketing are very different! In short: time-condensing is removing changes at a level of abstraction, which means that this warrants careful consideration, should be discussed with business subject matter experts and requires sound understanding of the data and its behaviour.
Frequency condensing is applied by selecting a defined time period to condense changes for. Only the most recent change for that time period will be retained, and the others will be removed.
Consider the following example:
Depending on what level of abstraction you are interested in looking at, you will end up with more or less rows. The ‘minute’ level of abstraction will remove two out of the total of eight records (2015-02-03 22:01:52 and 2015-02-03 23:25:00) simply because there is an other available change record within the scope of one minute. This is one way to remove the record from the combined result that is potentially created by the Hub key in the Purchase Transaction example.
I have implemented this by using a SQL windows function (FIRST_VALUE); by partitioning the set for the designated key and date part and taking the first (highest) value when order the effective dates from high to low.
The following example has implemented this at ‘hour’ level:
If you add a simple WHERE clause where the ‘Time Condenser’ value equals the Effective Date, you will have effectively condensed the changes by hour. The implementation to only retain the last record for each range is a design decision, which arguably (and easily) could be changed to retain the first iteration. The are many ways of achieving this.
Continuous gap condensing
Credits go to my esteemed colleague Zaiben Chen here for sparring with me on this. The idea for continuous gap time condensing is that a preset range is a relatively arbitrary decision and is ruled by a ‘hard’ border. For instance 22:01:59 and 23:00:00 will be two separate rows when condensing at minute level, even though the real gap is only 1 second. This also depends on implementation of date parts at RDBMS level, but that goes in too much detail for this post (worth looking into, try out SELECT DATEDIFF(MINUTE, ‘2017-01-01 16:59’, ‘2017-01-01 17:01’ if this sparks your interest).
A potential solution is the ‘continuous gap’ approach which looks at the (patterns) of changes to derive the ‘true change’. This works best in a batch environment where ETL runs at more typical schedules with ‘spikes’ of changes happening at various intervals. You could argue that the overall pattern represents ‘real changes’, and the ‘spikes’ represent more technical changes caused by ETL behaviour, but this is of course dependant on the individual setup.
In any case, continuous gap condensing works by calculating the gap between individual changes, and removing rows where the gap is smaller than a certain threshold / level of abstraction. The obvious advantage is that this is not impacted by a ‘hard border’ as is the case with the frequency condensing. Consider the example below.
This can be implemented by again looking at the next row relative to the key and calculating the difference between the dates. This can be done via a LEAD function which looks at the effective date partitioned by the key and ordered by the effective date in ascending order. This can also be done in other ways, as with most of these things.
The gap threshold can be easily adjusted by manipulating the value in the WHERE clause.
Implementing time condensing when merging time-variant record sets
If you consider the original PIT / Dimension implementation, it becomes very easy to add (toggle) time condensing at various levels of abstraction. Since the only information you need for time condensing are the effective dates themselves, you can implemented this very early in the process and possibly reduce the size (and I/O) quite drastically. To summarise: time condensing needs to be implemented after the range selection, and before the timeline creation against which the context is joined against.
It is also important to note that any time condensing needs to be implemented before record condensing in order to avoid redundant rows in the final result. This is because basically any removal of records may lead to the possibility that the remaining rows are not real changes if you look at their values – which would then need another record condensing step. This is why time condensing (and any other condensing logic you may want to add) needs to be implemented before this.
An example of this is added here (in a few days you will be able to generate this from the Virtual EDW software as well). This is best seen in the context of the full PIT logic as explained in this post, because only a section is shown here. For instance the below SQL does not show the required record condensing logic.
As indicated earlier in this post, it is important to fully understand the behaviour of your data and to involve business users in the making of these decisions. Understanding the behaviour of the data includes many things, but in this context most notably understanding the effect the scheduling has on the creation of data in the Data Warehouse. If, for instance, your Data Warehouse is scheduled at a daily run with a single snapshot change perhaps applying gap condensing for 10 minutes (or an hour even) could make sense. Logically this would only remove changes caused by the behaviour of the patterns. This is different when the complete solution is running continuously within 5 minute micro-batches…
Hopefully this post will present some points to make the right decision, and implement it as well.
I’ve almost wrapped the implementation of this in the Virtual EDW prototyping software. Watch this space!