Tagged: ETL

0

Some Q&A on Data Warehouse Virtualisation

I receive a fair bit of questions on the Data Warehouse Virtualisation ideas and wanted to respond and discuss this via this post. I don’t have all the answer but can share my views and expectations. When it comes to DWH Virtualisation and the Persistent Staging Area (PSA), the questions generally fall into two categories: Isn’t it too slow? How about performance? Surely users don’t want to wait for hours to see results? Why bother...

 
2

Beyond ETL Generation & DWH Virtualisation – what’s next?

Bridging the gap At the recent Data Modelling Zone (DMZ) in Germany I presented an overview of the ideas around Data Warehouse Virtualisation and the thought processes leading up to this. In this post I wanted to elaborate on some of these ideas a bit further, as together they can be combined to allow something even more powerful. This post provides an overview of how various (technical) concepts together can help faster delivery of meaningful...

 
4

Using a Natural Business Key – the end of hash keys?

Do we still need Hash Keys? Now there is a controversial topic! I have been thinking about the need for hash keys for almost a year now, ever since I went to the Data Vault Day in Germany (Hamburg) end of 2016. During this one-day community event, the topic of stepping away from hash keys was raised in one of the discussions after a case study. Both the presentation and following discussion were in German,...

 
0

When a full history of changes is too much: implementing abstraction for Point-In-Time (PIT) and Dimension tables

When changes are just too many When you construct a Point-In-Time (PIT) table or Dimension from your Data Vault model, do you sometimes find yourself in the situation where there are too many change records present? This is because, in the standard Data Vault design, tiny variations when loading data may result in the creation of very small time slices when the various historised data sets (e.g. Satellites) are combined. There is such a thing as too...

 
2

When is a change a ‘change’?

This is a post that touches on what I think is one the essential best-practices for ETL design: the ability to process multiple changes for the same key in a single pass. This is specifically relevant for typical ETL processes that load data to a time-variant target (PSA, Satellite, Dimension etc.). For non-time variant targets (Hubs, Links etc.) the process is a bit easier as this is essentially built-in the patterns already :-). In a given...

 
0

Creating Data Vault Point-In-Time and Dimension tables: merging historical data sources

Beyond creating Hubs, Links and Satellites and current-state (Type 1) views off the Data Vault, one of the most common requirements is the ability to represent a complete history of changes for a specific business entity (Hub, Link or groups of those). If a given Hub has on average 3 or 4 Satellites, is it useful at the very least to see the full history of changes for that specific Hub across all Satellites. How to...

 
1

Denormalise (flatten) XML using Xquery

ETL software have ways to interpret XML files or structures that vary in maturity, but recently I had to find another way to import XML structures that are received as blobs using (transact) SQL only. Because the upstream patterns (Staging to Integration) can easily handle denormalised source data sets the easiest way was to cast each blob as the XML file it is anyway and completely denormalise it (‘flatten’, more rows…). This is opposed to normalising...

 
1

Do we still want to automate against ETL tools?

In the various Data Warehouse initiatives I have been involved with recently I tend to use ETL software less and less. Over the last few years I spent a lot of time figuring out how to automate/generate ETL in various tools – most notably in SSIS, Pentaho, Powercenter and recently SAS (but various others as well). To the point that most of the intended solution can be generated from metadata. But as recently outlined in this...

 
1

Using an ETL platform for your Data Warehouse, is it still relevant?

When I started my career in Data Warehousing and Business Intelligence 15 years ago there was a massive push towards adopting ETL software. Traditionally, specialised ETL software such as Informatica Powercenter, Oracle Warehouse Builder (later superseded by Oracle Data Integration), Microsoft DTS (later superseded by SSIS) and many similar platforms were very successful because of two main reason: ETL software provided a way to ‘explain’ or document what was happening in a way that made...

 
2

Data Vault ETL Implementation using SSIS: Step 2 – Historical Staging ETL

The archiving process such as the Historical Staging ETL are also not conceptually part of Data Vault but can play an important factor in the complete Enterprise Data Warehouse (EDW). While the Staging Area directly supports the Data Vault message the Historical Staging is really optional and does not impact Data Vault processes. For the sake of completion the ETL overview for the Historical Staging is provided regardless. As with any ETL process within the...