Unknown keys (zero keys or ghost keys) in Hubs for DV2.0

I am still working towards capturing the generation (using BIML in SSIS) and virtualisation (using views / SQL) of the Presentation Layer (in a Dimensional Model). But before we get there, some topics need to be addressed first. One of these is the requirement to have ‘unknown’ keys available in the Hubs. Thankfully, this is one of the easiest concepts to implement.

The basic idea is that you create a dummy record in the Hub which represents the unknown (NULL or ’empty’)  business key.

Why do we need this at all? It’s primarily to make sure that referential integrity is intact when Link records are inserted where one of the Business Keys is NULL. The fact that some of the Hub keys in a Link are NULL does not mean you should exclude that record entirely, it just means that at that point in time for some reason there is incomplete information. And we need to be able to handle 100% of the data 100% of the time.

There can be various reasons that can cause this to happen. Some examples of this are:

  • The attribute that has been identified as Business Key is declared as an optional Foreign Key in the source system. For instance when ‘XYZ’ is true, then the business key is populated. Otherwise the business key remains NULL
  • The source system declares the attribute as mandatory, but this declaration is broken or not adequately enforced (there is an error in the source application that allows NULLS when it shouldn’t)

As an example you can image there is a Link relationship between Customer, Product and Promotion. If this is modelled as a 3-way Link there may be occasions that Customer and Product are known, but Promotion is NULL.

With the unknown key in place in the Promotion Hub you will be able to confirm that the ‘committing’ of the Staging Area delta to the Data Vault is complete – as the final check of a Batch is the validation of the Referential Integrity before the delta is released. Obviously, this also applies when you run periodic Referential Integrity checks across the complete (or parts of) Data Vault.

The underlying idea behind all this is that you can only ‘close’ an ETL run once all received data delta (in the Staging Area) is applied to all relevant Integration Layer tables. As outlined earlier each source data set is a self-contained unit that loads into all tables, to ensure independent and parallel execution. As such, the last step in the workflow (Batch in my terminology) is to make sure that for that source set all created Hub, Link and Satellite information is consistent.

Interestingly, in my view (and some will disagree) adding Hub unknown keys does not mean that you can rely on inner joins to combine Hubs and Links together. I always use outer joins primarily due to handle potential timing issues related to parallel execution. The fact that you can run everything parallel may mean that at certain points in time there are brief inconsistencies – which are soon rectified by the corresponding batches. But my point is you don’t want to lose records when you just happen to execute your query in that brief potential interval.

The decision to handle NULL values is considered a ‘hard’ business rule, which is implemented on the way into the Data Vault.

So how to do this? It’s very easy  however you want to do this, but it’s important to note that in my designs the Business Key attribute is always NVARCHAR (unicode text). This is really to further future-proof the Data Warehouse solution. You don’t know how the next system handles the business key…

This makes the implementation of the unknown Hub key extremely easy. I use the generated hash key for NULL as the Hub Key, as this is already handled by the hashing built into the ETL generation templates.

The fun part is that this key is the same for all Hubs! So all you need to do is run a script that iterates through your Hub tables and insert what is effectively the same hard-coded record. In my approach I create this record as Hub Key ’48C436D1FBE939CB4E2732836808CE90′ with Business Key ‘Unknown’ but it’s up to you how you want to handle this as long as it’s consistent.

How did I get to that hash? By putting NULL into the hashing logic, as that’s what automatically is done by the generation logic:

SELECT
CONVERT(CHAR(32),HASHBYTES('MD5',
ISNULL(RTRIM(CONVERT(NVARCHAR(100),NULL)),'NA')+'|' ),2)
AS EXAMPLE

As always, there are options and considerations:

  • You can opt to include this logic as part of your Hub ETL. I used to do this (briefly), but don’t anymore as it improves performance a bit. It is a one-off process which is only required when you deploy a new Hub (or truncate the table)
  • In the fully virtualised Data Warehouse this is easily implemented as an additional single-record UNION to the Hub SQL

Lastly, there is potentially more to this than only a NULL or unknown key. There is actually a taxonomy that can be defined to handle various ‘flavours of unknown’, for example:

  1. Missing (-1 key) as a root node / supertype of all ‘missing’ information. This is the typical unknown key, but may encompass:
    1. Missing values (-2 key): parent of all missing values. This can be ‘Unknown’ or ‘Not Applicable’:
      1. Not Applicable (-3 key)
      2. Unknown (-4 key)
    2. Missing attribute (-5 key): parent of all missing values due to columns not being there (yet). This can be defined as for instance:
      1. Missing source attribute (-6 key), which can be implemented when source fails to supply attribute/column
      2. Missing target attribute (-7 key), which may be used for time-variant information presented before the availability of the attribute

Deciding between the various more detailed types of ‘unknown’ is a business question that is decided based on how the source system is designed – and fall in the category of ‘hard business rules’. One of the few examples of this in Data Vault methodology.

 
Roelant Vos

Roelant Vos

You may also like...

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.