Some Q&A on Data Warehouse Virtualisation
I receive a fair bit of questions on the Data Warehouse Virtualisation ideas and wanted to respond and discuss this via this post. I don’t have all the answer but can share my views and expectations.
When it comes to DWH Virtualisation and the Persistent Staging Area (PSA), the questions generally fall into two categories:
- Isn’t it too slow? How about performance? Surely users don’t want to wait for hours to see results?
- Why bother with Data Vault (or similar) at all? Can’t we just skip this or fall back on Kimball’s techniques? What value do we get from having an intermediate hyper-normalised layer?
What about performance?
If you have billions of billions rows of streaming data every second, virtualisation is not going to work (unless perhaps you have access to unlimited resources). I think this is fairly obvious. However when it comes to performance in general, and virtualisation / PSAs in particular, it is important to remember three things:
- Generally speaking it’s always about balancing ‘performance’ with ‘complexity’. The more performance is tuned, the more complex solutions tend to become – which comes with a maintenance overhead. Sometimes it works out better to sacrifice some performance in favour or simpler solutions.
- Not every solution uses ‘big data’. Many environments are relatively modest in size and can (still) have big business impact, and it wouldn’t be correct to assess concepts only from the perspective of the few big deployments. Scalability is important to understand and consider, but not everyone needs to adopt techniques to scale out to massive sizes.
- Last but most importantly, and in addition to the previous point, it is really all about flexibility and metadata management. If you can virtualise the various layers on top off a PSA – essentially having a logical Data Warehouse – it is relatively easy to switch between deployments and technical architectures. There is no lock-in and you can let the engine figure out how to best leverage various available and connected environments.
In other words, the argument that virtualisation doesn’t work because it does not perform in massive data sets is neither here nor there. The conversation should be how the concepts that are embodied in virtualisation / logical Data Warehousing can be supported in different environments so you can evolve with your technical infrastructure and requirements.
Metadata management and generation, in other words.
There are many ways to deploy the solutions, but the metadata is the same. The best example I can give is using the same metadata to create views (=virtualisation) and persist these (as either indexed / materialised views or physical ETL processes) if query times exceed certain thresholds. When this is not enough, the exact same metadata is used to move files to a MPP platform and interpreted using scalable (cloud) techniques.
With these techniques you can distribute processing everything over many nodes, calculate the result, move this into a cube and power everything down again (except the cube). I’m not saying this is the one-size-fits-all solution, but it can be the right one in some cases (with extremely low TCO).
These are real examples, which hopefully help to explain that purely creating views for virtualisation’s sake is not the end goal – the flexibility is.
Performance requirements are some of the ‘environmental parameters‘ that drive the deployment, how the engine decides how and what to generate!
What value do we get from having an intermediate layer?
Please consider this earlier post on how various concepts together support virtualisation and more.
When you are at the point that you have broken up all data integration components into tiny processes -which are all independent and allow for incremental extension of the data platform- and can throttle and distribute infrastructure resources (CPU, memory…) to rebuild everything from the raw original files / transactions (PSA)… Does it still make sense to develop a normalised core DWH layer (i.e. Data Vault)?
I have a lot of good things to say about Data Vault. For example I believe Data Vault has been instrumental in breaking up DWH techniques into modular components, enabling ETL generation and shaping the mindset that DWH systems need to be designed for change. I think it has a place in DWH architectures for some time to come, but that the focus shifts more from modelling discussions on what is basically a physical model towards becoming commodity components that are part of ‘the engine’ as outlined earlier.
This also means that conversations on hash keys versus business keys or ‘do we need Hubs or not’ are more and more irrelevant. Why? Because the physical implementation / these kinds of technical decisions can be driven from (design) metadata as well. This is fully in line with the Data Vault thinking, which for example defines a Hub as the embodiment of the business concept but doesn’t prescribe using hash keys or not…
I like the ‘unit of work’ style thinking and the natural ‘subject area’ definition of context related to a business concept (business key), and I think this abstraction assists in getting a better grip on information. I am placing my bets on improving conceptual / logical models that can be automatically generated as Data Vault models (for example) in the background with the optimal technical architecture fit.
I accept that this is level of design is not readily available yet, but it’s almost there.
My view is that there is no real intrinsic value in the layer itself, and focusing on the physical DWH model – especially coming from a performance balancing point of view – as the core of the solution will experience diminishing returns. However, breaking up the architecture in layers helps in abstracting out complexity and consequently to keep things manageable. And as I mentioned a few times here we don’t necessarily design for these layers anymore.
The methodologies to date have assisted greatly in compartmentalising the various required concepts but should really be viewed as an engine that focuses on making things work, and as a stepping stone towards more advanced ways to define schema (taxonomies, glossaries).
If you imagine the inverse, for instance having a massive PSA with a Kimball-style Star Schema on top of this, I can imagine the complexities at enterprise scale. Well, not really just imagine because I did experience this in reality. I for one prefer having an abstraction layer to ‘put information in the right context’ first and applying logic a bit later.
Having said that, I use(d) to recommend smaller deployments to start this way, and consider adding an intermediate DWH core layer later (because the PSA supports this). But given the thinking outlines in this and previous posts this will also be absorbed by ‘the engine’.