End dating, or not, in Data Vault

This post captures the progression of my understanding about ‘end dating’ records in Data Vault Satellites (and Link-Satellites too, if used).

The concept of end dating is ingrained in day-to-day datawarehousing, and generally pretty well understood. It might seem a less interesting thing to write about, but it’s actually a wide-ranging topic covering many perspectives. I hope the options and considerations in this article are of use to someone out there.

End dating is defined here as setting the setting of an ‘expiry’ date for a given ‘effective date’. This is typically implemented by taking the effective date of the next record, sorted by the effective date, as the expiry date of the current one.

Setting the expiry date creates a visible per-record ‘timeline’, or ‘date range’. This is visualised in the below diagram.

For brevity, this post doesn’t address various potential timelines, what timeline to select, zero records etc. Suffice it to say, end dating can be applied to any timeline. Typical Data Vault implementations these days contain at least two timelines, an assertion (technical) timeline and a state (business/functional) one. I’m also not going into the ‘inclusive’ versus ‘exclusive’ argument here for now.

For more details on this please have a look at the gentle introduction to bitemporal data challenges.

Insert only?

For a long time, I have been in the camp that prefers to not implement physical (persisted) end dating in a Data Vault solution, favouring an insert-only approach instead.

End dating comes with a resource cost. It requires compute and more importantly I/O to inscribe the updates to disk.

By itself, the resource cost may not be prohibitive. But it often turns out that the end-dates that are written to Satellite records are not further used in downstream processes, which can make end dating somewhat wasteful. Whether this applies or not depends on the selected pattern for getting data out of the Data Vault, and there are approaches where end dates are helpful.

Understanding how to query the Data Vault to delivery data for consumption helps in deciding how to manage end dating.

When creating downstream time-variant objects, such as dimensions or historical extracts, usually a few columns from each Satellite object are selected – not the entire Satellite. This change in column scope, as part of merging of various time-variant tables (Satellites), means that the timelines of the resulting deliverable would have to be recalculated anyway.

This may defeat the purpose to spend the resources on inscribing the end dates in the first place.

The column scope concept applies when you implement a truly virtual data delivery, or want to create a full type-2 dimension, or similar, in one pass (and incrementally process on an ongoing basis) for maximum flexibility. Basically anywhere you query the Data vault using date-range joins.

If all you’re doing is creating snapshots, however, then having the end-date in the Satellites does help. You’re actually using the end dates to pinpoint the value as per the snapshot date. Of course, snapshots are less scalable, have more overhead, and don’t support the virtual data warehouse paradigm.

But they’re also easier to implement and effective.

Implementing a true insert-only Data Vault, with no end dates anywhere whatsoever (derived or persisted), is harder than it initially may appear, but it is definitely achievable. It’s a specific query pattern you will have to apply when performing the date-range joins, which by itself is pretty neat. I’ll write a separate post on this shortly.

Driving keys and end dating

Another consideration for end dating is how to make data from driving key based Satellites easily available and accessible to users.

The ‘driving key’ is a special mechanism through which you ‘force’ a relationship between two Hubs to be end-dated. For example, if you take a ‘Driver’ Hub and a ‘Car’ Hub you might want to implement that a Car can only ever have one active Driver. So, when a new relationship (Link record) arrives where the Car has a new Driver, then the relationship from the Car to the previous Driver is end-dated. The Car is the driving key in this example, the static Hub in the relationship.

It’s a business rule, a decision you might want to make when there really is no better way to close a relationship. E.g. there are no data points available to use to achieve the same result.

Without adding the end-date to the driving key (Link)Satellite, anyone that wants to query data from the Data Vault needs to ‘know’ what driving key to apply. Only then, users can calculate the correct end date and corresponding date ranges that are necessary to understand when each relationship was valid at a point in time.

Driving keys continue to cause problems even for experienced Data Vault practitioners. Needless to say, keeping this open for interpretation and mistakes is an area of concern, and often requires a better solution than simply doing ‘insert-only’.

There are suitable alternatives if you want to implement a driving key mechanism in different ways. These are worth a look if you’re going for insert-only, and want to mitigate the issue outlined here.

Implementation choices…

Various projects I’ve seen implement persistent end dating as a separate process, which runs independently of the process that inserts new records. This can cause many issues, usually related to timing when the end-dating process has not yet completed – but data is already selected downstream.

When implementing end dating as part of the loading procedure (‘ETL’), make sure that end dating is part of the procedure that does the inserts as well, not implemented as a separate process. Both process steps together form a single functional unit of work, which should be executed as a single module. If, for whatever reason, something goes wrong, the inserts and updates should be rolled back to the original state, as it was before starting the process that failed.

A final personal pet peeve I have with end dating is that it is often applied on the technical / assertion timeline (load date timestamp in Data Vault, or LDTS). Conceptually, the LDTS should only ever be used for filtering. I.e. to limit the selection of data for a correct as-was representation of data. Having an end date for the LDTS might lead users to believe it’s OK to perform date range joins against this timeline, which will invariably give you odd results.

There is some value of having an LDTS end date, perhaps, if users want to understand what (state of) data was received at a point in time. Or, to quickly query the most recently received data point. These are relatively minor concerns, and easily addressed by training and documentation. It’s sometimes unfortunate that so many date range / end dating examples use the technical timeline though.

End dating on the fly

A much-used argument for insert-only is that you can easily derive an end date ‘on the fly’. A simple window function (SQL Server syntax example) will give you the effective date of the next row, ordered by the date/timeline you are organising your data against:

LEAD(<date>,1,'9999-12-31') OVER (PARTITION BY <key> ORDER BY <date)

Or, to get the ‘current record’:

LEAD('N',1,'Y') OVER (PARTITION BY <Key> ORDER BY <date>)

Proponents of insert-only will argue you can easily add this to your queries when you need the end date. Others will argue that it may not be feasible to ask users of the Data Vault to understand and/or do this.

Nevertheless, things become complicated when the driving key is involved again. The window function from above is still usable, but the ‘key’ requires a join to the parent Link (or Hub in ELM-style keyed instances) to figure out what Hub keys form the relationship – and then pick the ‘right one’ to add to the PARTITION BY.

Compute cost also comes back in play here, because you will have to perform a more expensive operation multiple times for something you could have inscribed once. The counter-argument, is that it may not be required as often as perceived. But I’ll leave that for now as this has been covered earlier on.

Views – an option

The concern of users requiring writing complex SQL can be mitigated by implementing a view layer on top of an insert-only Data Vault, as seen in various implementations.

This addresses several concerns. You can still do insert-only and satisfy anyone that prefers to see date ranges and/or is concerned about the driving keys.

Computed columns can seem like a solution too, but don’t usually work for the driving key pattern because this requires a join to the parent tables -which is typically not supported unless you implement the driving key using one of the alternative patterns that does not require a join. But even for regular Satellites computed columns are hard to implement, requiring custom functions which can slow things down.

The view layer can work well, aside from performance/scalability issues in larger systems. It does mean that there are more objects in the database to look after.

But problems have been reported in platforms whose business model focuses on cheap storage and (more) expensive compute.

When deciding on how to tackle end dating, therefore also consider the compute cost involved in running the window functions. For a platform such as Snowflake, it’s often easier/better to persist the end-dates so that the cost is incurred only once. Of course, you can still follow the full insert-only approach here too and not have any end-dates at all – but if you’re implementing snapshots then persisting the end dates seems to be the best overall option.

Summary, and recommendations

For now, my preferred solution design is still insert-only, without any end dating whatsoever. It’s the most flexible and scalable solution, and in my mind the most elegant one. But it comes with a complexity overhead.

To support the insert-only paradigm I am avoiding the traditional driving key LSAT approach, in favour of an alternative that directly refers to a Hub. This is sometimes referred to as a foreign key Link, or one-to-many Link. It’s a Link that refers to a Hub directly and contains the effective date history.

If you use a pay-per-compute platform and have decided that end-dates are ‘in’, it generally helps to persist end dates to optimise cost.

In all cases, when you use snapshots to deliver data out of the Data Vault it also helps to have end-dates available – either in a view or persisted in the Satellite directly.

A final consideration is that having end dates in the Satellites sometimes helps to make the Data Vault seem simply easier. There is a certain comfort in having a visual of the timelines for a record, or across time for a key without having to write specific code for it. The same comfort applies to be able to directly select ‘the most recent record’.

You can have a great Data Vault without any end dates, and it is arguably the solution with the most technical beauty. But in the real world various considerations apply, and team skills/experience, technology, and patterns used all play a role in deciding what end dating approach works best.

Roelant Vos

Ravos Business Intelligence admin

You may also like...

Leave a Reply