NoETL – Data Vault Hub tables
In the previous posts we have loaded a proper data delta (Staging Area) and archived this in the Persistent Staging Area (PSA). In my designs, the PSA is the foundation for any form of upstream virtualisation – both for the Integration Layer (Data Vault) and subsequently the Presentation Layer (Dimensional Model, or anything fit-for-purpose). The Presentation Layer sits ‘on top off’ the Data Vault the same as it would be in the physical implementation so you will effectively have multiple layers of virtualisation.
So, now that the foundations are present we can start using the (automation) metadata not only to generate ETL, but also to virtualise the EDW: the Data Vault based Integration Layer.
In a recent real life scenario I actually had to follow this approach because, although volumes were too big to virtualise in the longer term, we ran out of disk space on our borrowed / temporary server. Due to this we couldn’t load data anymore and were at risk of not meeting our planning. However we could still use virtualisation to ‘test’ our metadata and models and effectively continue development. All that is needed when the proper hardware was in place is to generate the ETL using the exact same metadata. But by then all the testing was already done and any modelling tweaks were added. This is a very appropriate example of the power of model-driven design and development.
In any case, the first entities to generate are the Hubs and the virtualisation pattern as described here is followed for the examples in this post. The first thing to do is to configure the required metadata, in this case the mapping between the source and target (Hub), the interpretation of the Business Key and any filtering criteria. To load the corresponding Hub I have setup the following metadata:
As you see some testcases have been added, including a composite business key and a concatenation with some arbitrary text. But as documented here, the essential metadata is captured and this is sufficient to generate the Hub views (and corresponding INSERT INTO statements).
The generated logic is visible below, and is not particularly complex. Adding more source-to-target mappings to the Hub metadata, for instance another source that points to HUB_CUSTOMER will add another part to the view UNION statement – effectively ensuring that the output in view form is exactly the same as you would have if you were to run two independent ETLs against the same Hub table. In other words, the number of ETLs to load a Hub from different sources corresponds with the number of statements in the subquery, divided by a UNION. This is because the views need to represent the Hub table as populated by the various ETLs.
In its simplest form though the generated view does not contain any UNION statements, as there is only one source that is mapped to the particular Hub. Well, almost simplest, as the displayed example still contains a composite key. When queried the result is the same as the physical view, both in its contents and its structure:
At this stage we can query and use the Hub view just the same as we could query the Hub table and have thus achieved our virtualisation goals. But we don’t need to stop here as it’s just as easy to reuse the view to ‘act’ as an ETL process if this is required. To do this, we can generate an INSERT INTO statement that uses the same view.
It is important to note the following: the view should always represent the complete entity both in terms of structure and contents and therefore should not be modified. Instead, the INSERT INTO statement is amended to support ETL-like behaviour:
The above example highlights that the same view is used, but a left outer join has been added to prevent the insertion of records that already have been added to the physical table – simple as that. This can be executed against the database directly again effectively executing an ETL process. This shows how easy it is to switch between a virtualised and physical representation of the table.