Designing reference tables for the Data Warehouse

In a typical Data Warehouse it is common to introduce additional descriptive information that is not provided by the operational systems feeding data into the Data Warehouse. However, the exact positioning and implementation of this reference data can cause confusion. Especially when applying this concept in the Integration Layer.

Reference data is additional contextual or descriptive information that is not provided by the source system. Examples are descriptions for industry standard codes. This information can be provided through an external source or as user-managed data. Most documentation regarding reference data positions it as a separate entity which sits in a core EDW layer (Integration Layer), but is not really part of the design.

In these descriptions the reference table does not follow normal design standards and can be viewed as a (logical) foreign key to historical information. For example, if a specific code is present in tables containing history a reference table can be added to provide a description for that code. In this example the code is a character field that acts logically as a Foreign Key. Resolving this code to an identifier is optional but does not fundamentally change the concept.

In some documentation this reference table is a non-historical (Type 1) table meaning that when a description (or any other attribute part of the reference) changes it will change directly for all instances of the Foreign Key it relates to. Because of the Foreign Key role these changes are applied for the historical records as well.

It is also possible to store reference data as a historical reference table (Type 2) so that changes are tracked over time. When using this approach the most sensible thing to do is to follow the same design guidelines as the rest of the Data Warehouse. For instance if the core Data Warehouse layer is modelled using Data Vault the same structure can be used to store reference data as well (using Link intersection entities to function as the Foreign Key role).

This seems like a much more sensible approach towards managing this information since a Type-1 view of the reference data can be configured for the Presentation Layer at any time. For end users this has the same effect as changing a value in a non-historical table (i.e. they only see the most recent value) but for the Data Warehouse there is the additional value of having more presentation options and information about the (changing) nature of the reference data while also staying conform to the Data Warehouse modeling approach.

In the end it would even be possible to change the configuration to a non-historical reference table and back again since the raw archive of changes will always be available in the History Area (if implemented). With this in mind there does not seem to be a good reason to store reference data as non-historical (Type 1) in the Integration Layer.

 
Roelant Vos

Roelant Vos

You may also like...

1 Response

  1. March 28, 2016

    […] to make is how to model this to begin with. Does it make sense to model out the codes as its own (unconnected / reference) Hub / Satellite? There are a couple of advantages to […]

     

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.