Data Vault ETL Implementation using SSIS: Step 3 – Hub ETL – part 2 – SQL selection

The outline ETL for Data Vault Hub style ETL using SSIS has been documented in ‘Hub implementation’ and ‘Hub exceptions’. As a minor addition I would like to dive a bit deeper into the selection logic associated with Hub ETL. Following the experiences ‘from the trenches’ there are basically two ways to create SQL for a Hub selection from the Staging Area;

  1. Straight-up selection of the business key
  2. Selection including a composite business key (e.g. assembled from two or more source attributes)

Conceptually (and technically) the differences are only slight, but it does effect how this type of ETL can be automated because the composite business key SQL is slightly more complex.

To select a normal business key in the easiest scenario (one), I use the following:

SELECT
<Business Key - source> AS <Business Key - Hub> ,
<Record Source>,
MIN(Event Date/Time) AS INSERT_DATETIME
FROM <Staging Area table>
WHERE <Business Key - source> IS NOT NULL
GROUP BY
<Business Key - source>,
<Record Source>

Why is there no DISTINCT statement? This is because the GROUP BY already acts as a distinct selection by nature of the grouping. The Record Source indication is only occurring in one type / value for every staging data set so this does not impact the distinct/grouping. The NULL filtering is done explicitly; there is no use for empty business keys in a Hub table as there is already a placeholder value available.

The minimum Event Date/Time is selected because conceptually this is the first time a Hub entry is presented to the Data Warehouse. A single Staging Area data set may contain many business key values, but if we have to select one we will select the one with the lowest event date/time.

Lastly, I have used the AS statement to name the source attribute (the attribute name of the business key in the Staging Area data set) as the (target) Hub central business key name. The primary reason is that this makes the lookup easier later on, but it also is consistent if the business key requires to be composed from more than one source attributes.

If the business is composite as in scenario two, the SQL logic looks like this:

SELECT
ISNULL(<Business Key source attribute>, '')+ISNULL(<Business Key source attribute>, '') AS <Business Key>,
<Record Source>,
MIN(Event Date/Time) AS INSERT_DATETIME
FROM <Staging Area table>
WHERE ISNULL(<Business Key source attribute>, '')+ISNULL(<Business Key source attribute>, '') != ''
GROUP BY
ISNULL(<Business Key source attribute>, '')+ISNULL(<Business Key source attribute>, ''),
<Record Source>

In this scenario it becomes clear why the AS statement is used, as the target Business Key is always different that the source attributes! The logic is also slightly different, but achieves the same result: a distinct selection of business keys. The only difference is the handling of NULL values as the business key may still be valid if one (or more) of the source attributes are NULL, as long as the combination isn’t.

Needless to say, this applies to all scenarios where business keys needs to be prepared for Hub lookups; so not only for Hubs but also in Satellites, Links and Link-Satellites.

 
Roelant Vos

Roelant Vos

You may also like...

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.