Why a separate surrogate key table?
A question I often (and recently) get is why you should create a separate table to manage the surrogate keys. This is compared to star- or snowflaked models where the surrogate key (and distribution) is part of the table. For Data Vault the Hub entity is the place where the keys are managed. There is rationale for splitting this process into separate tables:
- A surrogate key table makes it easy to identify the records in the different tables. A generated DWH ID is true for a particular source item or element and this is always true in all the DWH tables.
- Splitting the processes makes it easy to generate the corresponding ETL processes.
- Future proof: a separate surroagate key entity can adequatly handle the redistribution of source keys.
A DWH generated ID in the Hub always points to the same record. In a star or snowflake every change will lead to a new surrogate key; the key and the contents are unlreated. Creating a similar mechanism to the Hub in a snowflake is not recommended because you would always need to use the natural key to determine the correct record. A Hub handles this only ones; when the key is first encountered in the datawarehouse.
There are benefits when it comes to developing because of the extra options for ETL generation. The more you split things up the more generic the steps will become and the easier it will be to generate the processes. A separate SK table is the easiest to generate: you only need to know the source key which is easy: it’s the PK in the source table and you know all you need to generate your ETL.
Using a separate SK table also has other benefits. For instance when source systems reissue their keys. Validity dates and indicators are usually added to the SK table, this is because source systems may reuse their own keys after a while. Depending on the business this could mean that the new key will actually be something completely different and therefore should not be mixed with the history but rather start tracking its own history. An example would be the re-use of bank account numbers. A bank account number (logical key) will be created as a record in the SK table. At some point the bank account is closed and a logical delete is received which closes the record (i.e. sets the end date to now and the actual flag to ‘N’). If the same customer decides to re-open the account this will result in the hub record being opened again. It’s still the same instance. But, after some time (2 years?) the bank account number may be reissued to a different customer. From that point on, any data regarding that bank account will have to lead to a different DWH_ID. As a result you will have the same <source>_ID twice with a different DWH_ID (while meaning something different).