Minimal metadata requirements for ETL Automation / Virtualisation (and prerequisites)

At the Worldwide Data Vault conference in Vermont USA I presented the steps to automate the ETL development for your end-to-end Data Warehouse. We put a lot of thought in what would be the absolute minimum of metadata you would need to insert into the automation logic, as most of the details are already ‘readable’ from the data model (and corresponding data dictionary or system tables).

Data Vault 2.0 defines a complete solution architecture covering three main layers (3-tiered EDW solution design); Staging, Integration (Data Vault) and Presentation. We can already skip the first layer as no metadata other than what is available in the model / data dictionary is required!

Before we begin, it is important to note that there are some prerequisites. You need to have the foundation in terms of approach and standards in place before you can start any (agile) project:

  • An ETL Framework covering ETL process controls, templates, workflows and similar
  • Modeling standards and naming conventions
  • ETL templates in the desired ETL platform
  • Technical Architecture (sizing, servers, I/O, memory etc.)
  • Solution Architecture (validation for design and architecture decisions such as 2 or 3-layered solutions, interfacing approach and agreements etc.)

These things do not go away but you only need to do it right, once. The great thing is that standardisation is required for automation (without it, it doesn’t work), and automation enforces the standards! It’s a win-win.

In the patterns as outlined below you have the option to hash the values going into the Staging Area (DV2.0) or from Staging to Data Vault (DV1.0).


The pattern to load Hub entities is:

  • Select the distinct Business Key and minimum event date/time
  • Hash the business key
  • Perform a DWH key lookup (or SQL equivalent) on the Business Key
  • Insert if it doesn’t exist yet

The metadata requirements are thus as follows:

  • The source table name
  • The target Hub table name
  • The business key as it is known in the source (attribute name)
  • Any combination information such as ‘pivot’, ‘concatenate’ or ‘compose’ to handle exceptions in the source business key mapping

The rest can be derived from the model. This means a simple table is sufficient to cover all source-to-target mappings.


The pattern to load Link entities is:

  • Select the distinct combination of Business Keys
  • Hash this combination and the individual keys, assign a placeholder (dummy) if NULL
  • Perform a DWH key lookup (or SQL equivalent) on the combination of Business Keys
  • Insert if it doesn’t exist yet

Most of the information can again be read from the model, and from the Hub mapping as this contains the map between source/target for the Business key (and Hub Key). So the required metadata is as follows:

  • The source table name
  • The target Link table name
  • Any additional (degenerate) attributes you may want to add to the Link definition, just as an attribute name

That’s it! The reason for this is that the relationship of the source/Hub for the specific Link can be re-used from the Hub automation metadata.


The pattern to load Satellite entities is:

  • Select the distinct set of the applicable attributes (and hash them for comparison)
  • Hash the Business Key (defacto Hub lookup)
  • Map the source-to-target attributes

The required metadata is:

  • The source table name
  • The target Satellite table name
  • Any multi-active (multi-variant) keys you may want to add, as attribute names

As with the Link approach, the Hub information if required (DV1.0) can be reused from the Hub metadata as the source table for the Satellite and Hub are the same.

The process is in many ways similar regarding Link-Satellites, it’s not worth explaining that further 🙂 As you can see, the amount of required metadata is very limited as most information can be read from the models and applied using the patterns. For Data Warehousing, the modelers are the new developers!

We partner with dedicated software such as AnalytixDS Mapping Manager to record this metadata in a user (business) friendly way. Different approaches work , as long as you force yourself to limit the metadata to the absolute minimum you really need to insert into the templates.

Roelant Vos

Roelant Vos

You may also like...

3 Responses

  1. January 9, 2015

    […] with a bit more, or even co-develop this a bit further. The software allows to enter and modify the relevant model metadata and generate either views or corresponding INSERT INTO statements with added logic to make sure […]

  2. February 1, 2015

    […] including a composite business key and a concatenation with some arbitrary text. But as documented here, the essential metadata is captured and this is sufficient to generate the Hub views (and […]

  3. June 11, 2015

    […] virtualisation (and any other form of ETL generation) is the handling of the metadata itself. In a previous post I covered what metadata needs to be captured at a minimum for ETL automation, and this post is all […]


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.