NoETL (Not Only ETL) – virtualization revisited
For the last couple of weeks I have been working on a simple tool to support the Data Warehouse virtualisation concepts in practice. This is based on the idea that if you can generate the ETL you need, you can also virtualise these processes if performance requirements and / or relevant constraints allow for it. This is why I was looking for a way to virtualise where it would be possible (performance wise), and instantiate (generate ETL) where it’s not.
I label this as NoETL – not only ETL – as I try to adopt a virtualized (metadata driven) representation but also use INSERT INTO statements based on the same virtualisation views to act as ETL if required. This ‘insert’ use of the view is supplemented with a bit of logic to load data (delta) thus using the view using effectively as an ETL process. This way you can mix-and-match virtual and physical components using the same ‘data / model objects’.
Due to the simplicity of the templates and the model driven design you don’t (necessarily) need to use ETL software except maybe for scheduling / orchestration as outlined in this post.
To demonstrate this I have developed a little bit of software I’ll upload to SourceForge when it is tested as ‘version 1.0’. Hopefully I can find some interested people wanting to spar with a bit more, or even co-develop this a bit further. The software allows to enter and modify the relevant model metadata and generate either views or corresponding INSERT INTO statements with added logic to make sure that;
- If you query the view, it represents the table as it would appear / behave in their physical form (e.g. loaded via ETL). For instance the Hub view displays the same Business Keys as would have been inserted by it’s ETL equivalent
- This means the INSERT INTO statement use of the view adds logic to only process data delta into the physical table
In the broader solution design as outlined in this site, this particularly applies to the Integration Layer (Data Vault or similar) and the Presentation Layer (whatever you want to expose). The exceptions are the Staging Area and the Persistent Staging Area (PSA) – the History Area. Both are part of the Staging Layer of the solution design. As outlined in this and other posts, the virtual Data Warehouse runs on top of a PSA as this area contains the full history of all changes that have ever been presented by the interfaces. Because of this, there can be no view that represents the physical table in exactly the same way. However, we can still use a view to populate the PSA and show the data delta. In this post and upcoming posts I’ll demonstrate how this works in the Integration Layer (Data Vault) in particular – but first we need to ingest data into the Data Warehouse environment (Staging Area).
In almost all cases the Staging Area is populated using ETL software, as connectivity is still one of the strong points for ETL software. The interface arranges the data delta to be defined and loaded into the Staging Area table and therefore into the Data Warehouse environment. For demonstration purposes I have created a full compare (Full Outer Join) mechanism to derive delta from a sample set of source data. But, as mentioned this is typically replaced by a (generated) process in an ETL tool.
Let’s test the concept by population some initial data delta into the PSA so we have some sample data to virtualise a Data Warehouse with. To do this I have created a sample model that provides some arbitrary information to source from:
This model is used to create corresponding Staging Area tables (a couple are displayed below, there are 7 in total). I have labelled this system as ‘profiler’ and following the naming conventions this is visible in the Staging Area table name.
And PSA tables (one example shown – there are 7 in total):
As mentioned many times the structure is identical to the source table structure, both are easy transformation and the tables themselves are generated by scripts or macros. Everything is essentially a one-to-one copy of the source table. For testing purposes some sample data is entered, which we will query for data delta.
Now we are at the stage where the environment is ready to be used. I have dubbed the software I developed for this ‘Virtual EDW’; it’s specifically meant and used as quick development tool to assist in playing around with different modelling structures without the requirement to use ETL upfront. The first thing I need to do is to make sure I have the configuration setup properly; especially concerning the naming conventions – as a portion of the automation logic depends on this. For this purpose the settings tab can be used to define generic Data Warehouse and Data Vault specific elements.
For all intents and purposes the required Staging Area logic can now be generated, and this can be done from the corresponding ‘area’ tab for corresponding to where the code needs to be generated for. The SQL generation options can be used to write directly to the database if required (checked in this example). As an output type a View is selected, which enables the creation of a view of the data delta.
When pressing ‘generate’ a series of views will be created:
The views here use a simple Full Outer Join process that leverages the PSA to compare the source with in order to calculate the data delta. Data Vault 2.0 hash keys for Hubs, Satellites and Links are not added yet – as we haven’t explored the upstream model. We will add them later, because we can but also because we can start capturing data delta while we think about this / do the modelling. An example is available for download here (a bit much code to paste).
Now we can access the delta by querying the view, we can also generate what is effectively ETL using the Into Statement:
Because ‘generate in database’ is checked the delta is directly loaded into the Staging Area, again effectively acting as ETL using the view as the source. At this point the view and table are ‘in sync’, which opens the interesting option to not necessarily stage data either.
This is a very easy first step towards using views. Because the Staging Area is not persistent any real virtualization is not possible yet. In the next post I’ll ‘up the ante’ a little bit by generate the PSA and including joins. Here the logic will get a bit more advanced. And, as soon as there is history in the PSA we can virtualize any upstream Data Vault and Dimensional models.
As it is, this first stage our virtual EDW looks as follows. More to be added soon.