Staging Layer

The ‘architecture’ section of this weblog serves as an introduction for the Data Integration Framework Github repository (also see the collaboration section). This repository contains various sets of documentation that constitute a reference Data Warehouse (DWH) architecture, data integration pattern documentation and supporting metadata and control frameworks.

It contains detailed specifications and approaches to expand on the overview provided here.

Introduction

The main purpose of the Staging Layer is to load source data into the DWH environment for further processing (the process from source-to-staging). In other words, the Staging Layer is responsible for the physical movement of data from the source platform onto the DWH platform. While this sounds straightforward, it often introduces specific challenges due to the required connections / interfaces to other (feeding / source) systems.

In terms of structure, there is usually no major difference between the originating data format and the way it is stored in the Staging Layer. However, there are already some concepts built-in this layer that are foundational to the rest of the system.

Examples of this are the definition of the correct event date/time, load date/time, delta selection and streamlining of the data types. No alterations / transformations to the data contents or its structure will be done.

The Staging Layer typically consists of two areas, an optional Landing (or ‘Staging’) Area and a Persistent Staging Area (PSA). In some cases, the functionality of both these areas can be combined into a single data flow.

The Landing / Staging Area

The Landing, or ‘Staging’, Area is non-persistent and is therefore emptied every time a (delta selection from a) table is loaded during the ETL process: a truncate/insert mechanism. In this step the source data is copied into the DWH environment. The purpose of the Landing Area is to:

  • Perform Change Data Capture (CDC) and delta selection
  • Loading source data to the Data Warehouse environment
  • Formatting (streamlining) data types
  • Define the correct event date/time going forward

CDC and delta selection

The way data is provided to the Landing Area depends very much on the type of source application, the company guidelines and general availability. Different ways of approaching and acquiring data in general are:

  • Replication of source records
  • Change Data Capture (CDC) mechanisms of the source database system (if available, depends on the DBMS)
  • ‘Push mechanism’; source applications provide a periodical full or delta dump in flat file/CSV or through a direct connection into the Staging Area (landing) tables
  • ‘Pull mechanism’ ; the Data Warehouse access the source tables and selects the delta using a database link or connection. Usually based on creation and alteration dates of the source system
  • Subscription to Enterprise Service Bus (ESB) / messaging

The preference is to use existing interfaces as much as possible, especially if the company has implemented an Enterprise Service Bus (ESB) which is regarded as one of the available source ‘applications’. This removes a lot of discussion regarding definitions and generally simplifies the design and development of the Data Warehouse.

From a Data Governance perspective care must be taken to verify if a system really is the proper source of the data. If not, a new interface must be configured to another application. Generally all data should be obtained from any source as-is, without alteration. Different methods to retrieve changes in data have varying degrees of accuracy and history of changes and this reflects in the definition of the event datetime.

The event datetime is the moment a change occured in the data and ultimately drives the management of timelines (effective and end dates) in the Data Warehouse. For example: true database log mining CDC gives you a close timestamp when a user changes a value in an application (the difference typically is a slight network latency) whereas a pull mechanism only gives a snapshot of the information at the moment when you run the query.

Loading the source data

Loading the data to the Staging Area should be done using a truncate/insert mechanism making the ETL process ‘rerunnable’ (compliant with the basic ETL rule 1 in the Design Patterns section).  To streamline processing the data set in the Staging Area table always contains a delta (as opposed to a full copy of the source table). If the Data Warehouse requires to derive the delta itself (for instance by executing a full outer join between the recent and previously received full data sets) this must be done in a supporting set of tables. This way the Staging Area table contains the same kind of data for further processing into the History Area or the Operational Data Vault (Integration Layer) regardless of the type of interface.

Formatting  (streamlining) data types

Source system attributes tend to vary wildly because every source system is different. To keep the system manageable the source data will be mapped to data types through a categorisation. I tend to use the following:

  • Character types such as CHAR or VARCHAR =< 100 in length will become a Unicode string type (i.e. VARCHAR (100) or equivalent).
  • The rest of the character attributes will become a Unicode string type (i.e. NVARCHAR (4000) or equivalent).
  • All numeric values will be mapped to a large numeric value such as NUMBER in Oracle or NUMERIC(38,20) in SQL Server.
  • Integer values are retained. Exceptions on the above can be made for integer values, for practical reasons of not having zeroes added for keys.
  • Dates, date/times and so forth will be mapped to a high precision date/time attribute such as DATE in Oracle or DATETIME2(7) in SQL Server.

This leaves you with a small number of data types to work with.

Implementation overview

Technically speaking the Staging Layer contains the following elements:

  • Source and target definitions of the source files/tables and staging tables
  • Scripting necessary for flat file handling
  • Staging area mappings
  • ETL software folder structure
  • At minimum the unique ETL process identifier (ETL_RUN_ID or similar) is added as an extra attribute to each Staging Area table. If the record source is not part of the ETL Framework to which this unqiue ETL identifier links to, then a record source attribute must be added to the table as well. This provides all the runtime ETL control and audit information required. No redundant attributes are stored because direct querying (for troubleshooting) is unlikely
  • Staging area database schema and including all database objects

ETL processes are designed to ‘pull’ data from one layer to the next (basic ETL rule 3), so in this case a mapping resides in the staging layer and pulls data from the source (systems). The source folders are prefixed ‘000_’ because they are outside the architecture. These folders contain the source table and file definitions. The staging folder in the ETL tooling is prefixed ‘100_’ because it’s the first step in the architecture. This folder contains the staging area table definitions, the staging area mappings and all other relevant objects.

A note about the three-digit prefix: every architectural layer gets its own range with plenty of room in between to incorporate sub-steps or helper steps. Especially for the Presentation Layer this proves very useful since it’s possible to have many data/information marts with intermediate summary tables.

Opportunities for generation

Once the source definitions are defined, the staging layer can be fully generated. Some ETL tools allow source definitions to be generated from source file or table structures as well. In that case, everything can be generated dynamically.

  • The table definitions can be derived from the source table definitions
  • Datatype conversions / formatting is done according to the same rule in every situation, so this too can be generated
  • The load strategy will typically be the same for all the staging tables

Relationship to DV2.0

It is worthwhile to note that the methodology of Data Vault 2.0 specifies hash values for intended business keys (Hubs) and relationships (Links) to be calculated by the ETL that loads the data delta into the Staging Area. Optionally, the combination of attributes to track for changes in the Satellites may also be calculated in this process and stored in the Staging Area. This means that the calculation of hash keys needs to be done only once and can be reused many times by upstream processes. The downside is that this creates a direct impact between changes in the Integration Layer (Data Vault) and the Staging Area. If you create a new Hub based on an existing interface (change in the model) you need to add this attribute in the Staging Area as well and make sure the hash is calculated appropriately.

My view on how to adopt these concepts is outlined below. Essentially I don’t want the Staging Area to be fully dependent on the Integration Layer Data Vault model so I have made the pre-calculation of all hash keys optional (but preferred). I use the naming convention HASH_<target table> for this purpose. So for instance if a Staging Area table maps to both the Customer and Product Hubs there would be two separate attributes HASH_HUB_CUSTOMER and HASH_HUB_PRODUCT available in the Staging Area table.

  • If a Hub or Link needs to be pre-calculated the attribute can be added to the Staging Area. If these attributes are not present in the Staging Area the ETL that loads the data from the Staging Area into the Data Vault based Integration Layer will generate the hash keys on the fly
  • Satellite hashes (or ‘hashdiffs’) are still optional and can be added for performance improvements if required; if they are modelled in the Staging Layer the ETL will generate a hash keys for the specified combination of attributes.  If this attribute is not present the ETL from the Staging Area to the Data Vault based Integration Layer will generate the hash values on the fly
  • Every Staging Area table always calculates a full row hash. I use the naming convention HASH_FULL_RECORD for this

The Persistent Staging Area: an historical archive

The Persistent Staging Area (PSA) is an optional area within the Staging Layer. The PSA create an historical view of the source data; all source data is captured as-is and the data is stored in a type 2 fashion. While this means some extra ETL development it can be added for the following reasons:

  • To keep an archive of the source tables in an easy to manage way
  • Provide the opportunity -if worst comes to worst- to completely remodel some parts of the Data Warehouse without losing information. This is most important for the integration layer. All loading mechanisms including a correct presentation of history are untouched and the system can be rebuild or reloaded from there
  • Increased audit and tracking information available (based on source data in its own model)
  • Possible double-function as an ODS

Implementation overview

The PSA captures all the source data as-is and stores it in a type 2 fashion. This is an insert-only type ETL process. End-dating may be added but is optional and typically not implemented in this area of the architecture. This will mean that the structure of the tables in the PSA for the most part is identical to the tables in the Staging Area with the addition of the following attributes:

  • A primary key, which is a hash value of the Natural Key and the Load Date/Time
  • The unique ETL process identifier (ETL_RUN_ID or similar) – this attribute is also present in the Staging Area but needs to reflect the ETL process that inserted records in the PSA

The ETL process loading the data from the Staging Area to the PSA checks the data based on the Natural Key and the date/time information and compares the attribute values. This comparison is executed by comparing both the HASH_FULL_RECORD and the CDC_INDICATOR attributes. This can result in:

  • No history record is found => insert a new record in the history table
  • A record is found and the source attribute values differ from the history attribute values => close the old record and insert a new one
  • A record is found but the source attribute values are exactly the same as the history attribute values => ignore

Opportunities for generation

Both the PSA table structure and ETL processes that load the tables can be fully generated from metadata:

  • The table definitions can be derived from the Staging Area table definitions
  • All PSA tables will get the same extra set of attributes as described above
  • The performed lookup to check for changes will always be on the Natural Key (primary key of the source table)  to the corresponding PSA table