The Data Warehouse Time Machine: synchronising model and automation metadata versions
I’ve completed a fairly large body of work that I’ve been meaning to do for a long time: how to automatically version the Data Warehouse data model in sync with the version of the ETL automation metadata.
Although versioning models and code is relevant (but rarely implemented) in the traditional ETL area, this requirement to becomes very real when moving to a virtualised Data Warehouse / integrated model approach (Data Vault 2.0 in my case). This is covered in more detail in earlier posts but suffice it is to say that by removing a physical, or instantiated, core Data Warehouse layer in the design you need a way to preserve the audit trail and support any Data Mart and / or Business Intelligence components that draw information from this Data Warehouse layer.
Simply put: your Data Warehouse can’t just have one implementation the one day and a different one the next. Well, it actually can: if you incorporate synchronised version control.
In my mind this is very similar to how (canonical) message formats or (SOA) services are sometimes developed to be backwards compatible. In this world you need to allow for adapters (’users’) to change their configuration when a change in the message / service is deployed. For this purpose sometimes one or more older versions are supported for some time. This is more or less what I’ve been working on for ETL and Data Warehouse models and their Data / Information Marts.
In a practical sense: what I am looking for is a way to cycle through previous and current versions and see the model and metadata change (over time – no pun intended) while you look at your designs. A Data Warehouse time machine.
This itself also serves another purpose, which is decoupling the existing dependency on the implemented data model (the table in the database) for ETL automation. Many automation efforts read from the data dictionary / catalog to generate output, but to fully support the above concepts what is needed is a way to capture the model at a point in time. For the virtualisation development efforts this is really useful as it allows you to generate ‘ETL’ (Views, Insert Statements, ETL outputs e.g. packages, mappings, anything) without having a physical model in place.
Not only that, but you can basically have a fully functioning Enterprise Data Warehouse up and running without physical tables. (more information about where ‘history’ lives is here). As of recently, even though the virtual (view-based) Data Warehouse was working fine it was still required a physical model to be present for automation.
This is now something of the past; the only database objects are views now – everything is virtual – which I think is really cool! This is a real-world application of these concepts used in various projects, sometimes as prototyping but also as a smaller production deployment.
A quick overview of how this works is shown in the next few screenshots. The first thing to do is to import the model metadata (via a modelling tool or by reverse-engineering the table structures):
I disabled the current grid-like representation of the model as I intend to create a more visual presentation. In any case since I already have a few versions created I am working on ‘2.1’ and uploaded whatever is currently in the database. The next step is to go to the screen that manages the ETL automation (this replaces the Excel sheet that was previously used). As you can see the software has already snapped to v2.1. The slider on this screen will later be used to cycle through the view output.
The screen that is created to maintain the automation metadata is where the typical Data Vault 2.0 metadata can be maintained. I added some logic to add colour coding and syntax highlighting to some extent, but one of the fun features is that the software calculates the expected Data Vault output. This metadata was previously imported from a file (it’s stored as XML) – and is linked to v2.1 of the model.
When changes are made there is the option to save these into a newer version:
The highlighted slider allows for cycling through the versions so you can see current and past definitions presented in the grid at runtime. We can now look what the difference in the Data Vault 2.0 model is when we cycle through the version, starting with viewing the results of the latest change.
And we can query the virtual Data Warehouse the same way as the traditional deployment with the exact same content:
Finally we cycle to version 2.1 to see how it used to be:
As is visible in the output as insert in the image above due to the consistent definition of business keys the number of records is down to three.
Not having to rely on a data model that is (physically) available in the database allows for even more easier prototyping, and so far it’s working great!