Using a Natural Business Key – the end of hash keys?
Do we still need Hash Keys?
I have been thinking about the need for hash keys for almost a year now, ever since I went to the Data Vault Day in Germany (Hamburg) end of 2016. During this one-day community event, the topic of stepping away from hash keys was raised in one of the discussions after a case study. Both the presentation and following discussion were in German, so I probably only got half of it (and that’s being generous) but the problem that was raised was related to re-keying issues – where the OLTP system recycles its keys (Primary and Business Keys to keep things simple).
While this is not specifically an issue with hash keys itself, the message was that the hash keys made the handling of this issue harder because the need to join in the ‘real’ values to make sense of the information. You can’t easily see what you did to ‘make things unique’.
It got me thinking about the various pros and cons of hash keys a bit more. After all, the hash key is nothing more than a representation of the original business key value.
Impacts of hash keys
The introduction of hash keys as part of Data Vault 2.0 was awesome, and credit goes to Dan Linstedt for making this an established practice. Data Warehouse architectures have traditionally relied on the generation of sequence (integer) values to manage the Data Warehouse keys – the meaningless key. In the olden days, you had to make sure you updated your ‘key tables’ first before loading any tables that make reference to these tables (e.g. with Foreign Keys). For instance, a Dimension table had to be loaded before the Fact table.
This dependency also implies that when you have to reload your key table, you also need to reload all the tables that depend on this table via the Foreign Key. This is the concept of the Data Warehouse key distribution, which typically happened as part of an ETL process that also incorporates other concepts.
Data Vault, with its separation of concerns, has isolated the key distribution concept and embedded it into the Hub entity. This has separated the concept from other mechanics such as handling changes over time. In DV 1.0 the key distribution mechanism was still implemented as a seeded (sequence) integer value, but in DV2.0 this concept and its implementation has been upgraded to using a hashed value of the business key. Because of their deterministic nature, hash keys made it possible to load ‘Facts’ before ‘Dimensions’ (Satellites before Hubs), thus removing loading dependencies. The hash keys also made it possible to truncate (and reload) your Dimension without having to also reload your Fact table, and to load data in essentially any order. Using hash keys has opened up our collective minds to implementing parallel loading, the ability to load data in any order and considering how to easily scale out.
But, hash keys have their own issues as well. They appear ‘random’ (which they are not, of course) to the typical user, and need to be joined to the Hub tables (when looking at context / Satellites etc.) to retrieve the original meaningful values. To the RDBMS this apparent ‘randomness’ is reflected as well in the sense that hash output does not have a sequential nature – which wreaks havoc on indexes.
If you consider SQL Server, for instance, the default for a Primary Key is to have a Clustered Index, which means the order in which the data are stored on disk is forced by the key (definition). The table on disk is the Clustered Index. So, if your hash key is the Primary Key, the order of the data (on disk) will need to be continuously changed because of this ‘random’ nature, causing significant issues related to page splits and index fragmentation. This can cripple your performance after a single update of a Data Vault table.
Arguably you can change the index to a Non-Clustered Index, but this will mean your table becomes a heap unless you have a clustered index on another attribute. And the hash value is still the key you use to join tables together…
Hash keys also tend to use up a lot of space, which is reflected in I/O costs when working with the data. In my experience, Link tables in DV2.0 are regularly one of the larger tables in the Data Warehouse (bigger than most Satellites), especially when they include more than two business concepts.
Each hash key depending on the algorithm used is typically 32 (MD5) or 40 (SHA1) characters – which are all used (and don’t compress really well). Hash collision is yet another relevant topic, but I won’t discuss this here. Have a look at Ronald Bouman’s excellent article on the hash key and it’s algorithms. Lastly, there is a CPU processing overhead which applies to using hash keys – the algorithm has to be applied on every business key and combination of business keys. Data Vault provides pointers to remediate this by suggesting values can be ‘pre-hashed’ in the Staging Area, the source system (platform) or via distributed calculations. This limits the amount of time the algorithm has to be applied and / or allows processing to be distributed, but the calculation still has to be applied somewhere.
Evolution of these and other concepts is supported by continuous thinking, challenging and contributing to a growing knowledge base to deepen our understanding why we’re doing things the way we do. We do this so we can make the right decisions depending for our specific scenario: options and considerations. The consideration in this case is: would there be a way to maintain the functionality the hash key has introduced while making it work better? At least in some cases?
Do we need a Data Warehouse key at all?
First of all, we need to revisit why we have a Data Warehouse key and the corresponding key distribution concept to begin with. Do we need a Data Warehouse key at all, as opposed to just using the natural key? The short answer is still ‘yes’, but with the introduction of hash keys this may be in need of clarification again.
Sometimes it is forgotten why using sequence IDs for Data Warehouse keys has been around for such a long time, and discussions around not using hash keys tend to lead to solutions that suggest using the source natural key directly (as in, no separate Data Warehouse key attribute). This almost reverts back to the early days of Data Warehousing, but the same problems are still around so this can’t be a good thing.
In reality pairing a ‘source key’ to a Data Warehouse key is still considered best practice for various reasons including, but not limited to, having a consistent and fast way of joining (traditionally on integer values), avoiding variable length problems of keys, solving uniqueness and duplication problems, handling concatenation, composite and hard-coded values. Having a Data Warehouse key separate from the identified key of the source system also allows you to adapt when required, for when there are issues with the source key (e.g. re-keying, duplication across systems and many more).
If we still agree that we still need a Data Warehouse key separate from the business key (and I strongly believe so), there are two main established alternatives:
- Using a sequenced surrogate key (usually an integer value)
- Using a hash equivalent of the (combination of) identified keys
I would like to suggest a third option here: a ‘Natural Business Key’. This is the combination of the required keys to identify a given business concept as a concatenated value, with the required sanding applied. A hash key without the actual hashing applied basically: a single attribute created from the concatenation of the required attributes (including hard-coded values) and sanding delimiters. I’ve looked at it in some detail and can’t really fault it.
Let’s investigate the benefits the hash keys provide, and consider if there are other ways to realise these without losing functionality:
- Parallel / independent processing. Due to their deterministic nature, hash keys allow child tables to be loaded before parent tables. Sequence IDs don’t support this, but in principle this can be achieve with the Natural Business Key too.
- Cross-platform joining. Sequence IDs don’t support this as they are non-deterministic, but if we convert the natural business key to an agreed fixed format (e.g. char, varchar, nvarchar) this would work as well.
- Reloading / re-initialisation / refactoring / virtualisation. Same as above. Both hash keys and Natural Business Keys are deterministic, so both produce the same results when rebuilding a table from a Persistent Staging Area.
- Pre-calculation / distributed calculation. The Natural Business Key would need less calculation than a hash key. Concatenation and NULL handling will be needed regardless, but the Natural Business Key is completed at that stage whereas the hash value requires the algorithm to be applied as well.
It seems worth it to look into this a bit further.
Investigating Natural Business Keys
I had the assumption that, at the end of the day, Hash Key would be faster on average when taking into account the occurrence of some very large (combinations of attributes for) business keys. Some business keys would be small, but some would be very large. Since the hash output is always the same it arguably could still have a positive net performance upon comparison.
But I wanted to be more scientific about this, as it wouldn’t be hard to calculate the average field length comparing the hash value to the natural business key. The logic and metadata to figure this out is already available, so I did what every Data Warehouse virtualiser would do: I tweaked the generation code in the Virtual Data Warehouse application to create a ‘Natural Business Key’ using the same ETL generation metadata as is used to generate the regular Data Vault views and ETL.
The version which uses the Natural Business Key as the Data Warehouse key can then be deployed as a separate version next to the version of the Data Warehouse that uses hash keys. The virtualisation approach allows the two instances (versions) of the Data Warehouse to be active simultaneously, which makes it easy to compare.
Of course, strictly speaking this is not really necessary since the length and storage requirements of hash keys is standard and can be easily derived. But I did it anyway, just because it’s cool that it can be done and it took only 30 minutes to implement this new feature, (regression) test it, validate if the referential integrity still holds up, generate the outcomes and deploy. This updated version of the Virtual EDW tool can be downloaded here as always.
When you look at the output, in this case of a Hub, it looks as per the below (MEMBERSHIP_PLAN_HSH):
The above example shows the Natural Business Key as the single attribute of the ‘Plan Code’ and ‘Plan Suffix’ – an example of a composite Business Key. Note the ‘|’ (pipe) sanding value. In some deployments this value is already stored in a concatenated way, but I personally like to maintain the original values. The key with all the zeroes is the generic Hub zero record that is automatically generated.
For comparison’s sake the original hash value version looks like this:
I create a validation script that calculates the average length of all the Data Warehouse keys in all tables within the Natural Business Key version, and compared this to the version that uses a hash value as Data Warehouse key using the same approach. When running these scripts against the two Data Warehouse versions of my sample model and comparing the results the difference in required space is quite large. The results are below.
The output shows that within the sample model that I use for various things there is no instance where, on average, the length of the Natural Business Key is higher than the hash value. The hash value is always 32 (in this case MD5 was used) so the average is, well, 32. The sizes / lengths for the Natural Business Key are significantly lower, which will translate into reduction of storage requirements and better I/O without loss of functionality.
The following graphs show the relative size per entity type. As expected the Hub and corresponding Satellites are smaller as they only contain a single Business Key. Links and Link Satellites are a bit larger, since by definition more than one business key is present.
This is only a small sample of course, but looking at the systems I’ve developed there would not be too many cases where the Natural Business Key would be bigger than the standard values of the hash key. As a result, I now think the Natural Business Key would have an overall net performance benefit. This may differ from your experience, or maybe there are cases where there are just really large values to concatenate which probably defeats the purpose.
But my gut feel is that in most cases, this may actually be a more efficient way of setting up the Data Vault. I can’t really fault it.
Final thoughts – options and considerations as always
In SQL Server terms, the hash key approach benefits from a fill factor with sufficient head room to keep index fragmentation in check. A Natural Business Key solution would need this as well as this is still non-sequential, but a bit more than the hash equivalent (and obviously worse than the sequence ID). The Natural Business Key does allow for a bit of planning around this as you can leverage some of the sequential nature of some of the business keys (scenario permitting). You can use this to your advantage if you add some of the more volatile attributes later in the order if you have to deal with concatenating multiple attributes.
I haven’t specifically considered Massively Parallel Processing (MPP) solutions where the hash key could make for a good distribution attribute across nodes etc. And this is OK because as always (and I’m probably sounding like a broken record here) if you have adopted the Persistent Staging Area concept (why wouldn’t you?) you have the option to change your mind later – worst case.
Having said that, I’m optimistic the Natural Business Key option can be considered in MPP environments as well. In the end, we’re looking into options and considerations here and we as specialists need to take our platform’s strengths and weaknesses into account to define the right solution – as always. Maybe we now have more alternatives to the key distribution concepts, which we could apply where it makes sense.
I can even imagine a mix where in some cases hashing works and in some cases the Natural Business Key works. It’s worth thinking about and using our experience to make the right decision for the given scenario.
I would like to take this a bit further by looking at some production models and incorporating things like the usage of query paths as well in order to step away from relying on an average as a performance indicator. But this is the last stop on the hobby train before we reach obsession, so I will get off here and return to the day job. Next time.
Update – incorporation of feedback
The topic has raised some discussions, and I have added the main themes below. An interesting consideration to add here is that is seems to be a common experience that it is fairly easy to get information into a Data Vault (although Clustered Indexes and Hashes cause issues as outlined above – the scalability issues are evident), but it is sometimes problematic to get information out of it again. Indexing strategies may need to be geared towards the latter, and in some deployments this is why people have reverted back to using integer values. In this case you would take a hit in terms of loading dependencies but querying becomes easier again.
I personally am still a big advocate of the deterministic nature of both hash keys and Natural Business Keys (virtualisation!) so my 2 cents would be to adopt this though. Having said that, there are many other considerations to make and the main ones (the feedback) is here:
- Apparently various teams are already either implementing or investigation the Natural Business Key approach, and there is even a ‘mixed-mode’ deployment.
- A number of people are using other storage types for hash keys. It is worthwhile to consider storing the hash key in a binary / raw format. The binary format will effectively half the required storage space (e.g. 16 bytes for MD5). If using hash keys this would be a very worthwhile consideration. I’ve looked into GUIDs as well but felt the savings in storage did not outweigh the added complexities when joining to other platforms (e.g. the required conversion).
- There seems to be a feeling that while some the few very big Data Warehouse solutions (the multi-petabyte ones) benefit from adopting hash keys, the many smaller deployments out there benefit from a simpler approach such as the Natural Business Key. Opinions strongly vary on this, as some of the largest systems have reverted to a seeded value as Data Warehouse key. The key consideration is to spend a good amount of thought on these implications upfront – and keep sharing experiences with the community!
- Using a Natural Business Key can potentially be mixed with other techniques, such as using hash equivalents for the few very large business keys (the outliers) to get the best performance mix. Opinions vary on this, as this introduces another design decision to manage but on the other hand can be automated as well.
- Dan Linstedt pointed out that the definition of a Hub does not prescribe using hash keys, a matter of separating concepts from physical design. This supports the message in this post that the various viewpoints should therefore be considered as options in the technical architecture.
- Just in case it wasn’t clear enough in this article: Natural Business Keys would be a single attribute containing the concatenated values of the business key attributes including the sanding value, handling of character sets and data type conversions (including date / time). They need to be Unicode. This is 100% the same as the preparation work required for hashing the business key (i.e. to create the hash key), only without the actual hashing algorithm applied.
- I’d like to point out to an article posted late last year by Rogier Werschkull that touches the same topics and provides some additional perspective: https://www.linkedin.com/pulse/hash-keys-enterprise-datawarehousing-pitfalls-rogier-werschkull. These discussions and other articles hopefully contribute further to the growing knowledge base.
- When I talk about hash keys I don’t include hash diffs (the hashed value across attributes used for change comparison). In my view hash diffs are (still) extremely useful to be able to support a single comparison as opposed to a multi-column comparison. While we’re talking about hash diffs; it was brought up that the diff doesn’t preserve the attribute order. This is true of course, just pointing it out here.
- In cases where encryption is a requirement, hashing is the default choice. It’s why it’s there in the first place. It may be a requirement in case certain core data sets (key sets such as Hubs) need to be stored in a central location whereas the ‘real’ data needs to be localised.
In the end, all of this is still being investigated. Hopefully this information will assist in making a balanced decision about the right technical architecture fit for your use-case.