Data Visualisation, Data Warehousing and Big Data: one pitch to rule them all
Of the concepts that have emerged over the last few years, the ‘Data Lake’ is not one of my favourites. Although it has to be said I had a lot of fun out of various parodies on Data Lakes – which I’ll not repeat here! While I am on board with the cheap redundant storage concept it is clear that data management is still needed in this day and age (more than ever, really) and that concepts like Data Lakes to not change this – whatever vendors may be saying.
Especially misleading are the pitches in the ‘Big Data’ space that outline that the only thing you need to do is to get all of your data into the single (HDFS) environment, plug in some proprietary adapter to ‘query’ the files and you’re good to go! These pitches sometimes go as far as saying everything else (such as Data Warehousing for instance) is a waste of time and money. The end result invariably can be labeled as a ‘Data Wasteland’, ‘Data Landfill’, ‘Data Sewer’ or worse.
Similarly I’m not a big fan of the traditional Data Visualisation, sometimes called ‘Departmental BI’, pitch either. More often than not these quick-to-deploy and good looking dashboards and reports bite off more than they can chew in an enterprise setting. This leads to a re-introduction of the classic differences in reports being produced from various reporting applications working with *roughly* the same data. In many cases these approach start as ‘throw away’ to ‘get an idea’ and are instantiated before you know it.
And, you have the Data Warehouse which has the reputation of being so brittle and slow to deploy that it cannot operate at the ‘speed of business’.
But, all these concepts do have their function in the Information Management domain. There is a lot to argue for adopting HDFS to store raw semi-structured and unstructured data, and even normal structured data export in file format (the ‘Data Lake’). Similarly Data Visualisation is great to create one-off or infrequent analysis or reporting, and of course to quickly provide insights and ideas while exploring the data. You don’t want to build standardised reporting for everything! And the Data Warehouse and Business Intelligence area provides the proper enterprise grade distributed reporting, analysis, accountability, integration and standardization around the (disparate) information across the organisation.
Surely there is a way to make this work together?
In my opinion the answer lies in the adoption of the persistent (Historical) Staging Area concept (also known as Historical Staging or the History Area). This basically adopts the fundamentals of a Data Warehouse and extends these to the visualisation and ‘big data’ (in the Hadoop/HDFS sense) areas of application. In my standard reference I define the Staging Layer of an Enterprise Data Warehouse as encompassing the Staging Area and the History Area. The Staging Area makes sure data deltas are received from various sources and the History Area acts as an archive of these deltas. Both the Staging and Historical Staging Area are in ‘raw’ format, meaning they retain the structure of how the information was received from the provider (no modeling required).
In traditional Data Warehouse context this means that all data delta arrives via the Staging Area, and is simultaneously archived in the History Area and forwarded to the upstream core Data Warehouse layer(s). In this context the Historical Staging Area does not necessarily even need to be a database, it can just as well be a file archive or anything that makes sense. The key thing is that it records all data delta and can be re-instated / read at some point.
The Staging Area takes care of the Event Date/Time concepts and interfacing in general, solving various issues in the process. The great thing about this approach is that:
- It can be fully automated / rapidly deployed (that is true for both the Staging and Historical Staging Area)
- Manages data delta / event date/times. If you just pay a little bit of upfront attention to make sure you interface your systems properly, you are set for any data future
- Provides a fall-back for design issues. You can always reload your Data Warehouse if (big) mistakes are made, you only need to make sure you capture the real event date/time. This is truly ‘agile’.
- Prepares you for full-on virtualisation. More on that later
Basically, what you get is a standardised foundation that you can use for all kinds of purposes – as described above. You can quickly create a data visualisation / exploration set of reports without impacting performance on the operational systems and with time-variant information if you need it. More importantly, this approach allows you to formalise the findings into a full Business Intelligence delivery and ramp up to a Data Warehouse approach on the same original data without rework and without losing traceability / auditability. This means that any upfront effort is reusable without re-engineering.
Similarly, using modern approaches such as Data Vault 2.0 you can model parts of the data into a Data Warehouse model and using hashing concepts to link back to unstructured data in the archive. You’re not dumping data in a data lake or landfill, but you’re neatly planting the seeds on your (data) farm using state-of-the-art equipment!
I have tried to capture this in a diagram below:
The Historical Staging Area effectively ‘acts’ as Data Lake, but in a better defined form as data deltas and event date/times are taken into account.
In my view there are two core messages here:
- Don’t just dump data in Data Lake but make sure you adhere to tried and tested interfacing concepts to receive proper data deltas, and archive them
- You will always need the same layered data architecture on the raw source data delta to properly manage it (make sense of it), but you don’t need to do it all at once and it doesn’t need to be physically instantiated across various layers upfront
I can’t stress enough how easy it is to setup such an interface – in by far most scenarios the effort is measured in hours, not days. I fully realise this is a little bit of upfront work before you can do your Data Visualisation if you’re querying from a Historical Staging Area instead of querying the source directly. But the benefits of taking that little extra effort are huge.
One last thing about the Historical Staging Area is worth mentioning here. In Data Warehouse projects this concept has proven to be invaluable for correcting modeling issues at a later stage. Even with modern data modeling techniques such as Data Vault you can still get things wrong. A Historical Staging Area allows you to rebuild / reroll your core Data Warehouse layers the same way as for instance using a Data Vault allows you to re-roll your Information Mart / Presentation Layer / Dimensional Model. This is a very interesting thought as there is nothing stopping you (short of performance) from not only virtualizing your Information Marts, but also your Data Warehouse…