Deterministic dimension keys in a virtual Data Vault

by Roelant Vos · Published February 5, 2023 · Updated February 7, 2023

Summary

By extending the Data Vault patterns with a row number, called a ‘data delivery key’, it is possible to define deterministic keys for downstream use.

This approach leverages the order of data that is already set in Data Vault, by virtue of the inscription timestamp (load date time stamp) and batch row order number (known as inscription record id or source row id).

For example, for a dimensional model.

By adding a ‘0’ before each (integer / bigint) data delivery key value and including a checksum digit, the in-scope data delivery keys can be concatenated into a string that uniquely identifies the change record across the involved tables.

This concatenated value can be interpreted as a integer / bigint for use as a dimension key, and the same combination of data will always yield the same key.

Since this value is deterministic and the order / sequence is guaranteed, it can also be used to issue dimension keys start start with 0 and increment from there if required.

And, of course, the concatenated value can be hashed into a GUID or binary checksum that can act as dimension key. In this scenario, additional sanding elements (e.g. ‘|’) can be added to the concatenation for another layer of collision protection.

In all cases, the dimension will always provide the same results without requiring any data persistence in a PIT table or equivalent.

This enables the delivery of a fully virtual data solution, providing maximum flexibility and ease of maintenance.

Update 2023-02-7: some really good detail has come out of public discussions on this topic, and there are various similar (though not exactly the same) approaches which are worth looking into. Most notably:

A post by Patrick Cuba on this topic: https://www.snowflake.com/blog/point-in-time-constructs-and-join-trees/
Considerations made by the fictional ‘FastChangeCo’ company in a simulation by Dirk Lerner: https://tedamoh.com/en/data-vault/243-data-vault-dimension-ids

Delivering a dimensional model from Data Vault

When it comes to delivering data out of the Data Vault, anything goes. The goal is to deliver data in a format that is fit for consumption – which can be anything. But, for all intents and purposes the quintessential approach for delivery is a Dimensional Model.

In simple terms, a dimensional model means defining central fact tables that contain metrics, and supporting dimension tables that can describe these measures – and can be used to aggregate the metrics against. These tables can possibly be grouped to represent a ‘data mart’, which may contain multiple fact- and dimension tables all related to the same subject area or domain.

It is a long-time tried and tested approach with which most Business Intelligence (reporting) and Online Analytical Processing (OLAP) software products are familiar with.

The standard pattern to provide data from a Data Vault in a way that can easily be merged in a persistent data mart is relatively straightforward, and can be used to deliver the necessary data into any downstream format using the same approach. Details on how to do this can be found here and here.

Using the data set that results from this standard selection pattern, data can be persisted in a data mart in many ways. This can be, for example, by implementing a dimensional model. Or, data can be persisted in a Point-In-Time (PIT) table and used either for persisted or virtual data marts.

These are the most common approaches for delivering data out of a Data Vault model.

To stay close to the original dimensional model implementation, dimension keys are often integer values. Traditionally, the dimension key is issued first and persisted in the dimension table. The fact table would then ‘look up’ this, and other relevant, dimension key(s) to create the unique key (combination) for the fact table.

Brief introduction in a Dimensional Model

Integer values typically require less storage space than character fields, and allow for fast joining of fact- and dimension tables.

When using virtual marts on top off a PIT table, the keys are issued and persisted there.

Assigning dimension keys for a virtual mart

So far, the approach requires persistence of data somewhere after the Data Vault model to be able to generate these dimension keys.

Either the dimension is persisted, and the keys are generated there. Or, the necessary data is persisted in a PIT table which can also facilitate the distribution of the keys for use in a virtual mart.

Rebuilding / reloading any of these will almost certainly lead to different keys being generated. And, having new keys in one of the dimension tables usually means doing a reload of the fact table as well.

What if you are building a truly virtual mart straight on top of the Data Vault model without persisting data anywhere? What if, we can deliver a way to consistently, deterministically, generate numeric dimension table keys? Every time we drop and rebuild either the PIT or dimensional model we would then get the same results.

After all, the approach to load data from a Data Vault for the presentation / delivery layer is the same anyway for a PIT table or a dimensional model.

You can do this in one go.

What better way to test this than creating a view (representing the dimension) on top of the Data Vault model. If this problem can be solved using a view, it certainly is possible using tables and loading processes that populate these tables with data.

This mindset of ‘virtual data warehousing’ is a red thread through most of this blog, my training, and the available open-source software that support it. In my designs everything except the Persistent Staging Area (PSA) is a view – including the Data Vault model.

Does this mean that everything will always have to be virtual? It doesn’t, but the fact that it can be done means that the patterns and supporting (code generation) templates fully work. If running a view presents the data correctly, it is relatively straightforward to create the target table, generate the data logistics code, run it and then query the target table – and get the same results.

In my view, a view and a data logistics (‘ETL’) process are interchangeable, different flavours of technical implementation, that can be generated at will and whenever makes sense and can be swapped if and when required.

A view is usually just easier, especially to start with.

But, to truly deliver a virtual approach the (view) results must be deterministic. This means that the dimension keys that are issued but always be the same each time you run the view(s) that represent the dimensional model.

In order to be deterministic, the order of data must be known and coded. An elegant way to do so presents itself in the Data Vault itself.

Here, data is already stored in a way that guarantees the correct order in a way that is unambiguous and immutable – as per the time or arrival or ‘inscription timestamp’ (the ‘load date time stamp’ in Data Vault).

In Data Vault, the combination of the inscription / load date time stamp and batch row processing order guarantees that data can be processed in the same order at any time – even when multiple changes for a key a processed in a single step or batch.

This batch row processing order column is known as the inscription record id, or sometimes source row id. I’ll use the term inscription record id for this going forward.

Each row represents a unique data change as per the time of arrival.

Please note that standard Data Vault only mentions the load date time stamp, but this may not be unique enough in some cases – hence the addition of the inscription record id. But that’s for another time.

What matters now is that this behaviour can be used to issue deterministic dimension (or other forms of data delivery) keys, by adding a simple extension to the pattern: the data delivery key.

Delivery keys

In the Data Vault, data is stored as per the time of arrival. It is the job of the data solution to ensure correct replaying of this history and thus ensure that the order is maintained forever. This is part of the unbreakable ‘back room’ of the solution, where everything is designed to ‘just work’.

Complexity ensues in the step towards the presentation layer – the dimensional model in this case.

As part of the data delivery, data must be reorganised against a suitable business, or state timeline. This is represented in the example below via the ‘state timestamp’ column – but how this pattern works is beyond the scope of this post.

For now, I would go as far as to say that in almost all cases reporting data against the technical (inscription / assertion) timeline will put the solution at risk of providing incorrect results from a business perspective.

So, we need to ensure that the data order is kept safe before we make this transition. Again, this can be done in the Data Vault by adding a few minor tweaks to the pattern that can be easily generated.

For this experiment, I have added a row number to each Data Vault entity – the data delivery key.

An example of how this would look in a Data Vault is is provided below:

A Data Vault Hub and Satellite combination extended with a data delivery key

The solution is very simple, and modifying the code generation templates to do so is a piece of cake.

A Data Vault Hub or Link that is extended with a data delivery key, would look like this:

Sample data from the Virtual Data Warehouse – extended with a data delivery key

The template modification only required the following logic to be added:

ROW_NUMBER() OVER 
( 
  ORDER BY <Inscription Timestamp>, <Business Key(s)>
) 
AS DATA_DELIVERY_KEY

Note that the zero record (‘unknown’) is issued a 0 key.

For a Satellite, the solution is very similar. The main thing to consider is including the inscription record id (order of arrival for each data logistics process) in case the inscription / load date timestamp itself is not unique (due to precision issues in some databases).

One of the Virtual Data Warehouse sample Satellite tables extended with the data delivery key

Once again, the template and corresponding code generation logic is extremely straightforward:

ROW_NUMBER() OVER 
(
  ORDER BY <Inscription Timestamp>,<Inscription Record Id>
) 
AS DATA_DELIVERY_KEY

These data delivery key values all represent unique change records in the Data Vault. Each key is unique for the table, and each record in the table represents a unique data change as recorded in the data solution.

Why again would you consider issuing these delivery keys in the Data Vault, and not in the PIT or dimensional model? Why this deviation from existing practices?

One answer is to experiment, to see if anything can be done better. Another answer is because being able to issue deterministic dimension keys will allow a greater degree of flexibility in refactoring the model, it allows more freedom to drop- and recreate PIT and presentation layer tables.

It also opens up the option to not implement a PIT table at all, and create deterministic dimensions straight on the Data Vault. Loading dimensions using inscription timestamp-based load windows is a very efficient and fast approach for data processing that doesn’t require a PIT.

Even if you do use a PIT, there is no harm in rebuilding it because the underlying Data Vault delivery keys will guarantee the outcome will be the same.

An ability to drop and rebuild parts of the solution at will makes it much easier to manage over time.

Building the dimension key

When the Data Vault can provide these unique data delivery keys, it is time to uses these to construct a dimension key when bringing together the necessary columns to define the dimension.

The dimension key is really only a string concatenation of all data delivery keys (that identify unique data changes). This string of numbers can be saved and interpreted as a numeric value.

The challenge is to ensure uniqueness across the combination of keys, to avoid collisions.

For example, 105 + 11 should not be evaluated the same as 10 + 511 when concatenating the values. While it is very tempting to use a sanding value here (e.g. ‘|’) as we would do in regular hashing, I’m trying to avoid this because the result can not be represented as an integer value.

We can’t really use a sanding value unless we’re going to hash the result anyway, but the goal here is to deliver a numeric value to act as dimension key. But of course, if you are looking to deliver a hash value or GUID as dimension key, this is a very practical solution.

If an integer value is required, part of the solution is to uniquify the combination of values as much as possible.

The trick is to add a ‘0’ (zero) before each data delivery key to avoid most collisions, and add a checksum digit at the end.

Consider the following examples when joining multiple Data Vault tables together:

Concatenated (numeric) dimension key from individual data delivery keys – with added 0

The first 0 will be dropped when interpreting the key as a numeric value.

Adding the 0 works in most, but not all cases. Consider the example below, kindly provided by Christian Haedrich who has been invaluable in brainstorming these things:

This is why the value must also be extended with a simple checksum. Various approaches can apply, including the sum or product of all values. The possibility of creating a really large number looms here, so I am looking for a simple algorithm that, together with the added 0, provides a good protection against collision.

You can create different variations of the same theme here, but I opted for an algorithm that multiplies the key order against the number of numeric positions in each key.

In the above example that would be (for each row):

(1 *1) + (2 * 1) + (3 * 3) = 14
(1 *1) + (2 * 3) + (3 * 1) = 10

So the first key has 1 digit, the second key has 1 digit, the third key has 3 digits, all multiplied against their position.

The result then would become:

Adding a numeric checksum digit for uniqueness

So far, so good. But won’t these numbers get really large? What would happen if each table would contain many millions of records?

For example, as shown below:

Large dimension key, as a result from combining large tables

This will certainly cause issues.

In SQL Server, a bigint can store values from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. The largest numeric precision (38) can store values up to 10^38 (+/-100,000,000,000,000,000,000,000,000,000,000,000,000).

For reference, the generated key above contains 65 positions! And the checksum digit has not even been added yet.

Suffice it to say, there are definite limitations for this approach. If some of your Satellites contain almost a billion records or more, virtual marts may not be the way to go anyway. And thankfully there are viable alternatives.

Of course, the most obvious approach is to then hash the concatenated value. That works, and provides a deterministic dimension key also. It’s not really what what we set out to deliver, but it’s definitely an option especially for large data sets.

If integer values are still required, you can also issue a sequence id by relying on the order of the concatenated dimension key. That would reset the seed and yield lower integer values.

Generating some code for SQL Server for this, the result look like:

Quick SQL Server sample, using Virtual Data Warehouse samples and code generation

In this case, the ‘dimension key derived’ is the concatenated string of data delivery keys, and the ‘dimension key issued’ is a sequence number leveraging the order.

Additional benefits

Aside from being able to truly deliver a completely deterministic virtual data warehouse which is generated from metadata – and therefore quick to modify – there may be some additional benefits which are not immediately obvious.

In some databases and configurations, having a sequential integer value as a clustered index can deliver a net performance optimisation. Hash keys are notoriously harsh on clustered indexes, and while the business key is a good candidate for this it sometimes help to have a truly sequential value to place the clustered index on.

Another advantage applies to some front-end / Business Intelligence (BI) software, especially where you want to have the same dimension keys across environments. This is nigh impossible with a non-deterministic approach.

Some BI tools are known to persist keys in a caching component and / or may use these for filtering. In scenarios such as these, it can be convenient to rely on a deterministic dimension key distribution concept.

It may not always apply, but it’s good to have options.

Deterministic dimension keys in a virtual Data Vault

Summary

Delivering a dimensional model from Data Vault

Assigning dimension keys for a virtual mart

Delivery keys

Building the dimension key

Additional benefits

You may also like...

Leave a Reply Cancel reply

Search this site

Upcoming Events

Recent Posts

Deterministic dimension keys in a virtual Data Vault

Summary

Delivering a dimensional model from Data Vault

Assigning dimension keys for a virtual mart

Delivery keys

Building the dimension key

Additional benefits

You may also like...

Mapping generation for Data Vault demo: part 2 (setting up OWB)

Driving Keys and relationship history, one or more tables?

NoETL – Data Vault Satellite tables

Leave a Reply Cancel reply

Search this site

Upcoming Events

Recent Posts