Using (and moving to) raw data types for hash keys

Making hash keys smaller

A few months ago I posted an article explaining the merits of the ‘natural business key‘, which can make sense in certain situations. And, from a more generic perspective, why this is something the Data Warehouse management system (‘the engine‘) would be able to figure out automatically and change on the fly when required.

This article used the common approach of storing the hash values in character fields (i.e. CHAR(32) for MD5 and CHAR(40) for SHA-1), and the main feedback on the article has been that this is not the most efficient way to store hash values in a typical RDBMS. In other words, the difference between a 32 byte hash key and an integer natural business key is not always a fair comparison.

Instead, if the output of the hashing algorithm is stored in its raw (binary) form only half the bytes are required to achieve the same outcomes. This is an important thing to be aware of as it is likely to influence the decision, either by the designer or the ‘engine’ algorithm, to choose between hash keys or natural business keys.

More generally, adopting binary data types instead of character types for hash values is a significant improvement in the Data Warehouse design, because you effectively halve your storage on disk and double your I/O!

Typical storage behaviours

In a typical ‘by the book’ Data Vault solution (in my experience at least) the CHAR(32) hash keys amount to roughly half of the total volume of the complete solution, with Links being some of the biggest tables. From a physical point of view the Link tables are essentially tables that have mostly hash keys as attributes: the Link hash key and the individual Hub hash keys.

These tables are often larger than Satellites in terms of storage space. Satellites may contain more rows by virtue of the historisation but also compress better because of the denormalisation / redundancy. This is of course assuming some form of compression can be used, which depends on the available technical configuration and licensing.  Most people would agree that compression is a big impact improvement in any Data Warehouse environment considering CPU overhead versus I/O gains, so for these investigations this has been enabled.

These examples were created in a SQL Server environment and your mileage may vary. Hopefully though the above helps to paint the picture of how these technical decisions relate together. In any case there is a solid business case to switch to binary / raw storage of hash keys, and I recommend to adopt this if you haven’t done so already.

Raw and character hash keys

First, lets look at the classic / typical way to generate a hash key in SQL Server:

CONVERT(CHAR(32),HASHBYTES('MD5',ISNULL(RTRIM(CONVERT(NVARCHAR(100),[<Business Key>])),'N/A')+'|'),2)

What happens here is that the Business Key value is converted to a text type, converted to a binary hash value and lastly converted to character (CHAR(32)). The ‘2’ convert style essentially cuts off the first two characters of the initial binary output.

For example if we run the following code:

SELECT CONVERT(CHAR(32),HASHBYTES('MD5',ISNULL(RTRIM(CONVERT(NVARCHAR(100),'Roelant')),'N/A')+'|'),2)

The result would be C65BE84D225435AB4667DD5E8172DEDB.

If we run similar code, but without the conversion to CHAR(32) it looks as follows:

SELECT HASHBYTES('MD5',ISNULL(RTRIM(CONVERT(NVARCHAR(100),'Roelant')),'NA')+'|')

This has the output of 0xC65BE84D225435AB4667DD5E8172DEDB. Notice the difference of the initial two ‘0x’ characters.

It is perfectly doable to switch between the character and binary representations of the hash key in case you want to move towards a binary approach. Just convert it back to binary using the same style as was used before:

SELECT CONVERT(BINARY(16),'C65BE84D225435AB4667DD5E8172DEDB',2)

This will return the binary value that can be stored in the corresponding BINARY(16) data type: 0xC65BE84D225435AB4667DD5E8172DEDB.

As you can see, the character and binary types can be switched out which is good news for existing deployments. In most cases though (except virtualised solutions) this will require a rebuild of the ETL (to remove the CHAR conversion) and some scripts to apply the above changes to all tables. This is easy enough to develop though.

Final thoughts

The ideal data types for hash keys are based on the deterministic outcomes of the selected hashing algorithm:

  • 16 bytes for MD5 (and earlier versions) – this would be 32 characters as string i.e. CHAR(32)
  • 20 bytes for SHA1 / SHA – this would be 40 characters as string i.e. CHAR(40)
  • 32 bytes for SHA2 – this would be a 64 byte character string i.e. CHAR(64)

You can easily check this by running a simple query such as this:

SELECT HASHBYTES('MD5', 'Roelant')

The hash representation is visible as 0x70DACD9612153FD11894B1F07F454217. If you then look at the storage requirement you can see this only takes up 16 bytes:

SELECT DATALENGTH(HASHBYTES('MD5', 'Roelant'))

This shows that the data storage is indeed 16 bytes. If we were to store this hash value as character we would store the value as text without the 0x leading value, which is again 32 characters.

In both the binary as the character configuration there is no need to use the variable type. Hash outputs are deterministic and will not vary in length, and not using a var type will save you two bytes per value. All these things add up in the Data Warehouse environments, so spending some time to further investigate this is a worthwhile investment.

 
Roelant Vos

Roelant Vos

You may also like...

3 Responses

  1. zsombor zsombor says:

    I like this approach a lot Roelant using binary type. I do it similarly, I do the hash computation in the schema of the staging table, as non persisted computed column for both the keys and the diffs. It is easier to manage than in the ETL.

    My key formula looks like this:
    HashDiff AS CAST(
    HASHBYTES(‘SHA1’,
    COALESCE(CAST(Colum1 AS VARBINARY(4)), 0x0) + 0x00
    + COALESCE(CAST(Column2 AS VARBINARY(3)), 0x0) + 0x00
    ) AS BINARY(20))

     
  2. Another tip for using hashes in a DV architecture:
    I have designed a scalar function as part of my generation framework that uses metadata of a satellite table to automatically generate a HASHBYTES clause with the required converts and delimiters. This clause is then injected and used in generated stored procedures that take care of processing and end-dating satellite rows.
    Another benefit of having this separate scalar function is that it can also be used to update the hash of active satellite rows whenever you add a new column to a satellite that already exists in production. Adding a new column to an existing satellite will obivously invalidate the current hash and would therefore cause all active rows to be end-dated during the next run if the hash value is not corrected.

     

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.