Generating a Data Vault Hub using standard SQL
This is the first post in the ‘simple ETL generation series‘, where we apply standard metadata and technology to speed up Data Warehouse development.
In this post we’ll start with generation the Extract, Transform and Load (ETL) process for the simplest version of a Data Vault Hub entity.
As a reminder, the pattern for loading a Hub is to:
- Select the distinct Business Key(s)
- Select the minimum Load Date / Time Stamp (LDTS) from the source (Staging Area of Persistent Staging Area)
- Optionally, generate a hash of the Business Key (this is the Data Vault standard – but I personally prefer to use Natural Business Keys if possible)
- Check (look up) if the Business Key (or hashed equivalent) already exists. Insert if this is not the case, or discard the record(set) if it does
This is a relatively simplified view of the Hub ETL process and considerations that are relevant, but it will help us to quickly generate a solid ETL process to work with for now.
For example, as part of a Hub loading process we need to be able to handle (or make decisions on) composite keys, concatenated keys, hard-coded key components and potentially data type conversion.
However, for the purposes of explaining the concepts I’ll start with the most straightforward approach.
What metadata do I need?
To generate a Hub ETL process we need to understand the relationship (mapping) between the source table and the Hub. We also require the definition of the Business Key as it is captured in the source and the name of the corresponding target attributes.
In other words: a mapping of the table and key attributes. That’s it!
The TEAM interface provides this information via the interface view (or corresponding JSON file) INTERFACE_SOURCE_HUB_XREF. You can use this interface straight away if you have installed TEAM and generated the sample data.
For ease of demonstration I have also included this sample data as a DDL/DML file here.
The example we’re using is displayed below. This is the result of running the following query:
WHERE [HUB_NAME] IN ('HUB_CUSTOMER', 'HUB_INCENTIVE_OFFER', 'HUB_SEGMENT') -- The simplest examples (i.e. no complex keys)
The ETL generation using this metadata should deliver us six individual ETL processes in SQL – one for each row in the metadata.
For the ETL generation process these four key metadata elements are stored in local variables:
- <targetTable>; the name of the target Hub table
- <targetBusinessKey>; the name of the the target Hub business key attribute
- <sourceTable>; the name of the source table
- <sourceBusinessKey>; the definition (attributes) of the business key as it is known in the source table
Getting started with the Hub pattern
We can automate the generation of Hub ETL processes by creating a loop (cursor in SQL) that iterates over the metadata set and applies the information retrieved from each row to the intended target Hub ETL template / pattern.
During this process way various evaluations need to be done, and depending on individual preferences different parameters and variables can be used for this. Examples of required parameters are the various standard attribute names (i.e. Load Date / Time Stamp attribute name) and names of used databases and schemas.
The TEAM application is able to manage and configure these from within the metadata management environment, but again for ease of explanation I have added these as hard-coded parameters to the example SQL file.
Specifically, to generate Hub ETL processes the following parameters are used:
Parameters / constants:
- <targetDatabase> the target database name
- <targetSchema>; the target schema name
- <sourceDatabase>; the source database name
- <sourceSchema>; the source schema name
- <ldtsAttribute>; the name of the Load Date / Time attribute in the target Hub table
- <recordSource>; the name of the Record Source attribute
- <etlProcessId>; the name of the ETL Process Control attribute in the target Hub table. This is a placeholder to hook in an ETL control framework (every solution needs one!)
Additional parameters can be used, of course, but the demonstration is created with the above and uses conventions for others. Conventions in this context simply are standard ways in which certain things are always done or handled. Conventions are applied by code and don’t require specific metadata or parameters.
An example of a convention that can be used in the ETL generation process is deriving the Hash Key from the name of the corresponding Hub table. The convention here is that the name of the Hash Key in the model is always the same as the table name minus the table prefix ‘Hub_’ and with the key suffix ‘_HSH’ added. In other words; if we know the Hub table name we can infer the name of the Hash Key by virtue of the convention.
Some people argue that adopting conventions is a bit of a subversive way to enforce opinions about doing things a certain way (i.e. an opinionated solution). However, I hope that the examples show how easy it is to adjust patterns if and when required.
The TEAM application and pretty much any Data Warehouse Automation / ETL generation software suite allow for customisation of most parameters and conventions.
The defined input and conventions ultimately should generate SQL output that matches the Hub loading pattern. I have defined this target pattern in SQL as:
INSERT INTO <targetDatabase>.<targetSchema>.<targetTable>
) AS <targetHashKeyName>,
stg.<sourceBusinessKey> AS <targetBusinessKey>,
MIN(stg.<ldtsAttribute>) AS <ldtsAttribute>,
-1 AS <etlProcessId>,
FROM <sourceDatabase>.<sourceSchema>.<sourceTable> stg
LEFT OUTER JOIN <targetDatabase>.dbo.<targetTable> hub
ON stg.<sourceBusinessKey> = <targetBusinessKey>
WHERE stg.<sourceBusinessKey> IS NOT NULL
AND hub.<targetBusinessKey> IS NULL
There are various ways how the Hub pattern can be implemented, and this is one of them. Once we’ve setup the generation process this is easy to modify if and when required.
At this stage we have the inputs defined both in data and constants / parameters and we have defined the output we’re aiming for – so we can start looking at the generation process itself.
Defining the ETL generation process
To describe the ETL generation process itself I have used a UML Activity Diagram. This style of diagram shows the various steps involved and has clear start- and end states.
The Activity Diagram below shows the generation process in a simple way including the in- and outputs.
The first step is to pick up the next row from the metadata table (as long as there are rows left to process). The row is then evaluated for the local variables such as the source- and target table and key names.
The only real interpretation that is applies is the convention to derive the name of the Hash Key. This is done after the Hub table name is evaluated, by removing the ‘HUB_’ table prefix and adding a ‘_HSH’ suffix.
Another convention that is applied is that the Hub ETL process inherits the Load Date / Time Stamp (LDTS) from the Staging Layer – for maximum ability to refactor (the Virtual Data Warehouse).
The diagram can also be downloaded or edited here:
Running the Hub ETL generation process
The above process can be implemented in SQL as a cursor that iterates over the INTERFACE_SOURCE_HUB_XREF metadata row set. For each row the Hash Key is evaluated and the variables are added to the template, which is then presented back to the user.
The following SQL script can be used for this. It is basically a big loop to construct text and report this back to the user – a string builder.
DECLARE @targetDatabase VARCHAR(100) = '[DVI_200_Integration_Layer]';
DECLARE @targetSchema VARCHAR(100) = 'dbo'
DECLARE @sourceDatabase VARCHAR(100) = '[DVI_100_Staging_Area]';
DECLARE @sourceSchema VARCHAR(100) = 'dbo'
DECLARE @loadDateTimeAttribute VARCHAR(100) = 'LOAD_DATETIME'
DECLARE @etlProcessIdAttribute VARCHAR(100) = 'ETL_INSERT_RUN_ID'
DECLARE @recordSourceAttribute VARCHAR(100) = 'RECORD_SOURCE'
-- Variables / metadata (from the metadata database)
DECLARE @targetTable VARCHAR(100);
DECLARE @sourceTable VARCHAR(100);
DECLARE @targetBusinessKey VARCHAR(MAX);
DECLARE @sourceBusinessKey VARCHAR(MAX);
-- Variabels / local
DECLARE @pattern VARCHAR(MAX);
-- The complete selection / generated output
DECLARE @targetHashKeyName VARCHAR(100);
-- The derived name of the Hash Key
DECLARE hub_cursor CURSOR FOR
WHERE [HUB_NAME] IN ('HUB_CUSTOMER', 'HUB_INCENTIVE_OFFER')
-- The simplest examples (i.e. no complex keys)
FETCH NEXT FROM hub_cursor
INTO @targetTable, @targetBusinessKey, @sourceTable, @sourceBusinessKey
WHILE @@FETCH_STATUS = 0
SET @targetHashKeyName = REPLACE(@targetTable,'HUB_','') +'_HSH';
SET @pattern = '-- Working on mapping to ' + @targetTable + ' from source table ' + @sourceTable+CHAR(13)+CHAR(13);
SET @pattern = @pattern+'USE '+@sourceDatabase+CHAR(13)+CHAR(13);
SET @pattern = @pattern+'INSERT INTO '+@targetDatabase+'.'+@targetSchema+'.'+@targetTable+CHAR(13);
SET @pattern = @pattern+'('+@targetHashKeyName+', '+@targetBusinessKey+', '+@loadDateTimeAttribute+', '+@etlProcessIdAttribute+', '+@recordSourceAttribute+')'+CHAR(13);
SET @pattern = @pattern+'SELECT'+CHAR(13);
SET @pattern = @pattern+' HASHBYTES(''MD5'','+CHAR(13);
SET @pattern = @pattern+' ISNULL(RTRIM(CONVERT(NVARCHAR(100), stg.'+@sourceBusinessKey+')),''NA'')+''|'''+CHAR(13);
SET @pattern = @pattern+' ) AS '+@targetHashKeyName+','+CHAR(13);
SET @pattern = @pattern+' stg.'+@sourceBusinessKey+' AS '+@targetBusinessKey+','+CHAR(13);
SET @pattern = @pattern+' MIN(stg.'+@loadDateTimeAttribute+') AS '+@loadDateTimeAttribute+','+CHAR(13);
SET @pattern = @pattern+' -1 AS '+@etlProcessIdAttribute+','+CHAR(13);
SET @pattern = @pattern+' stg.'+@recordSourceAttribute+''+CHAR(13);
SET @pattern = @pattern+'FROM '+@sourceTable+' stg'+CHAR(13);
SET @pattern = @pattern+'LEFT OUTER JOIN '+@targetDatabase+'.dbo.'+@targetTable+' hub ON stg.'+@sourceBusinessKey+' = '+@targetBusinessKey+CHAR(13);
SET @pattern = @pattern+'WHERE stg.'+@sourceBusinessKey+ ' IS NOT NULL'+CHAR(13);
SET @pattern = @pattern+' AND hub.'+@targetBusinessKey+' IS NULL'+CHAR(13); SET @pattern = @pattern+'GROUP BY'+CHAR(13); SET @pattern = @pattern+' HASHBYTES(''MD5'','+CHAR(13);
SET @pattern = @pattern+' ISNULL(RTRIM(CONVERT(NVARCHAR(100),stg.'+@sourceBusinessKey+')),''NA'')+''|'''+CHAR(13); SET @pattern = @pattern+' ),'+CHAR(13);
SET @pattern = @pattern+' stg.'+@sourceBusinessKey+','+CHAR(13);
SET @pattern = @pattern+' stg.'+@recordSourceAttribute+'';
FETCH NEXT FROM hub_cursor
INTO @targetTable, @targetBusinessKey, @sourceTable, @sourceBusinessKey
The file can be downloaded from here as well:
The resulting SQL can be copy/pasted and executed straight away, which will load the target Hub tables from the designated sources. This can further automated with immediate execution of each output of course, which can be added to DevOps pipelines etc. etc. Once you start automating there is always more to do.
In any case, an example of the output is listed below:
-- Working on mapping to HUB_CUSTOMER from source table STG_PROFILER_CUSTOMER_OFFER
INSERT INTO [200_Integration_Layer].dbo.HUB_CUSTOMER
(CUSTOMER_HSH, CUSTOMER_ID, LOAD_DATETIME, ETL_INSERT_RUN_ID, RECORD_SOURCE)
) AS CUSTOMER_HSH,
stg.CustomerID AS CUSTOMER_ID,
MIN(stg.LOAD_DATETIME) AS LOAD_DATETIME,
-1 AS ETL_INSERT_RUN_ID,
FROM [100_Staging_Area].[dbo].[STG_PROFILER_CUSTOMER_OFFER] stg
LEFT OUTER JOIN [200_Integration_Layer].dbo.HUB_CUSTOMER hub
ON stg.CustomerID = CUSTOMER_ID
WHERE stg.CustomerID IS NOT NULL
AND hub.CUSTOMER_ID IS NULL
Later in this series we will add additional complexity to manage specific edge cases and investigate techniques such as template engines.
While the provided SQL cursor example is in a way a templating engine of sorts, a full coding templating engine is more flexible and can be configured to also generate ETL for many different output platforms, compilers and technologies (i.e. SSIS using Biml, Powercenter, SAS etc.).
Watch this space!