New frontiers – Data Warehouse and Data virtualisation

Automation and Data Warehouse virtualisation?

Why Data Warehouse virtualisation? What is Data Warehouse virtualisation anyway? To cover this it is best to explain how I ended up focusing on this, and then backtrack to more formal definitions. Essentially, I got interested in the virtualisation concepts because I have been working on automating data integration (ETL – or Extract Transform and Load) processes for years and during these period I noticed the Data Warehouse industry maturing when it comes modelling, design and capabilities. To explain how this is connected to virtualisation we have to take a short sidestep into the world of ETL automation.

At present, many of the most successful Data Warehouse solutions automate a large portion of their ETL processes from metadata. I refer to this as model-driven-design: the ability to ‘read’ your physical data models and generate the required ETL processes to populate these data models. There are many exciting trends here, ranging to the translation of conceptual FCO-IM models into physical Data Vault (or other) models, Data Warehouse automation software and custom approaches to translate logical models into physical models and then use ‘the patterns’.

These (ETL / loading) patterns are generic and repetitive processes that you apply over and over again for different sources and targets. That this is now achievable is to me one of the clearer signs that the Data Warehouse industry is finally meeting expectations, and life will get a bit better for all ‘data plumbers’ out there.

Talking about the patterns a bit more, they can be implemented by using the various APIs provided by ETL software vendors or third parties.  The common denominator here is that the real Intellectual Property is the data model and the metadata to map sources against this data model – not the ETL development effort. We literally drop and rebuild ETL over and over again using the model metadata. However, in this context patterns are only fit-for-purpose if they are generic enough.

The role of the ‘hybrids’

Enter contemporary techniques such as Data Vault, Anchor and Head & Version. What these techniques bring to the table is the definition of these generic loading patterns. Data Vault is the leader of these concepts, but they are fairly similar.

Data Vault is a hybrid data modelling technique that focuses on loose coupling of data entities. The premise is that things (data, requirements) change all the time so you might as well make ‘change’ a core element of your design and model. To this extent Data Vault uses a hub and spoke approach to define a small number of core entities: Hubs, Links and Satellites. A business concept such as ‘Customer’ or ‘Product’ is defined as a Hub which in effect is the integration point between various data sets. All context, descriptive information, is stored in Satellite entities and all relationships are recorded in Link entities.

Techniques such as Data Vault are referred to as ‘hybrid’ because the degree of normalisation differs between entity types. Some entities are fairly denormalised (e.g. Satellites), some entities are highly normalized (e.g. Hubs, Links). Data Vault achieves this flexibility partly due to these entity types, but also by pushing business rules towards the delivery of information.  This leads to a robust back-room that ‘just works’ and a flexible ‘front-room’ that can be customized depending on the requirements. Along the way the various Data Warehouse concepts are addressed including key distribution, time variance and structure. It is impressive to see manageable load processes become when the various functions are split and spread across these layers / steps / ETLs.

One of the interesting things in Data Vault is the notion to be able to recreate Data Marts (referred to as ‘Information Marts’ in Data Vault) with a deterministic outcome. You basically drop and recreate the marts and get the exact same output – including the time variance. This relatively simple starting point cascades into a variety of design decisions, which essentially boil down to making sure you integrate data in a non-destructive way by making sure business rules (that actually change the contents) are pushed upstream in the process. This means after you have made sure your information is recorded properly and in the right place.

To put it differently: you know you did a good job if you can truncate your tables and reload them without any difference before and after. To be able to achieve this also means it is not always necessary anymore to persist data in the marts (except maybe for performance reasons) so you can opt to simply use views to simulate or expose a specific dimensional model or any other format that is fit for purpose. In this context this is referred to as ‘virtualisation’.

Is ETL still relevant?

But wait a minute… If we can generate all ETL into the Data Warehouse using model-driven-design and metadata and use views to ‘virtualise’ the delivery of information, do we need ETL tools at all? Are they still relevant? Well, maybe in some scenarios but certainly not as the default solution. There are still places where dedicated integration software is relevant, but you can do without in a lot of scenarios. There are a few different drivers here:

  1. Modern Data Warehouse approaches are more pattern-based and lend themselves much better for generic ETL processes. Gone are the monolithic complex ETLs, everything is now broken up into a large number of small ‘atomic’ processes. These are so simple, they don’t need any documentation at all – just a description of the template will suffice as all ETLs are based on this. This also means you use far fewer of the functions ETL tools have to offer.
  2. Another driver for this is that the balance between performance of the ETL ‘server’ and the database platform shifted massively to the database side. You don’t need that many tweaks to get things running in ETL, and database performance is easily scalable and quick to configure. For years ETL vendors have recognized this and have been working on supporting ‘pushdown’ technology that effectively generates SQL that is executed as opposed to logic running in the ETL engine. The key message is that current RDBMS systems can handle these thing well

These two things together support various related ideas which ‘tap’ into virtualisation. For instance you can support generating ETL processes ‘on the fly’, which works if supported by a fully fledged ETL control framework and built-in (metadata) versioning. Another interesting spin-off is dropping and rebuilding complete DWH environments, especially in the cloud context. There are some great use cases where applying cloud based drop and create Data Warehouse applications causes a drop of the Total Cost of Ownership by many factors and allowing the development team to get started straight away. I’m sure many of you have been in situations where you have to wait for hardware to arrive and get configured – not to mention the challenges in scalability. The last suggestion is almost a combination of the two; creating (database) views that simulate your entire Data Warehouse.

For me this is almost coming back full circle, having started as a developer mostly using SQL and working with ETL software more and more over time. ETL can still have a place in capturing (very) specific business rules, or be used as a means to interface with the wide variety of available data sources. ETL tools have since a long time been exceptional at providing connectivity, but this is now challenged by concepts such as Data Virtualisation and ‘Integration Clouds’.

Virtualisation

There are many forms of virtualisation. Pretty much anything can be ‘virtualised’ in Information Technology. A fairly broad way to look at this is the creation, or instantiation, of something rather than creating an actual physical version of it. This can cover many things covering infrastructure, the storage fabric and of course the hypervisors. One specific example of virtualisation, Data Virtualisation, covers technology to provide access to all kinds of data from various applications so people can access the data without worrying where it is stored and how to access it. For the Data Warehouse an immediate practical application is that this provides a heterogeneous and SQL-friendly interface to all data sources, which effectively means that the interfacing can be SQL based as well!

Generally speaking Data Virtualisation and Data Warehouse Virtualisation want the same thing, to support self-service access to data and information to the greater business. Data Virtualisation provides access to data, but doesn’t provide a schema or model. Data Warehouses provide this model, to an extent, and Data Vault is exceptionally well suited to define this ‘lens’ while still allowing flexibility when things change.

To summarize, what we need for a modern Data Warehouse is that:

  • It is pattern based, because ETL needs to be automated (e.g. generation of ETL, metadata injection)
  • It provides a schema / model that assists in interpreting the data
  • Pushes as many business rules as possible to the users

I basically want to be able to drop and recreate my Data Warehouse the same way as I would do with the Data Marts.

What about the history?

But… what about the history? If you truncate your Data Warehouse, and we know that source systems are prone to data issues and ‘cannot be trusted’, where does the history stay? This is where a concept called the Persistent Staging Area comes in. In my designs I favour a Data Staging that you can truncate; to land data deltas into your environment, your realm of influence. This data delta is then ‘archived’ into a persistent historical archive. This concept has originated (for me at least) as a way to support model changes in the early stages of the model design. Even with Data Vault, you’re bound to make modelling mistakes – or at least things you later say you would have done things differently. My point is: if you keep your original data delta somewhere you can always rebuild.

The Data Lake again???

It’s almost a dump of everything that happens in the data landscape and it can span different environments / technologies. Sounds familiar? This is very similar to the data lake, data swamp, data landfill or data sewer (arranging supporters and detractors in a neat order). There definitely are similarities here, but also small but important differences. One of the differences is that there is a small but fairly rigid set of rules the interfaces need to comply to. You must be able to load multiple changes (for a key) in one go, be scalable, streamline data types and capture logical deletes. One of the most important thing is to ‘set’ or ‘harden’ the date/time. Without setting the date/time early there is no deterministic process to run. You don’t want different results depending on whether you run in the morning or the evening.

This applies a small but meaningful tweak to upstream standardisation. But it also means that whatever you do next, you can rebuild in another way and still be auditable. This way, you are not even committed to a Data Warehouse solution. Whatever we do next will add ‘schema’ to the ‘data lake’.

The full concept is defined in the diagram below:

Data Architecture

The difference between Data Warehouse Virtualisation and Data Virtualisation becomes a bit clearer now, and both concepts can work together. Data Virtualisation provides the generic access layer, and the Data Warehouse ensures all the usual boxes are ticked to virtually ‘interface with these systems’ and ensure adequate maintenance of history and that the ‘schema’ is applied on the data. You still need a separate historical view of data even as operational systems don’t always store this correctly (or at all). You still need to capture soft deletes / logical deletes.

Virtualisation needs versioning

There is one last thing that needs to be mentioned: Data Warehouse virtualisation needs versioning. Think of it this way – if you can drop and refactor your Data Warehouse based on (the changes in your) metadata then your upstream reports and analytics are very likely to not only change structurally but also report different numbers. To remain auditable in this context, you need to make sure your metadata is versioned so you can ‘roll out’ your Data Warehouse in versions that correspond to Business Intelligence or analytics output. Managing end-to-end version control which ties the model, the ETL and the output together has been in place in some environments but when you are able to deliver a new ‘view’ with the speed of a click this becomes a mandatory requirement. Concepts can be borrowed from other areas such as message formats / canonicals or SOA processes. Most of these environments are able to allow at least a prior version active, allowing all subscribers some time to become compatible with the newer version. In any case, it is important to be able to ‘go back’ to previous versions in your virtualized Data Warehouse if numbers need explaining and this all comes down to metadata version control.

It is worth considering the Persistent Staging Area, it may be the time and money saver that further reduces the time to deliver data to the end-users. And, considering this is key to adding value by using data it’s certainly worth giving it a go!