Tagged: Data Vault

0

When is a change a ‘change’?

This is a post that touches on what I think is one the essential best-practices for ETL design: the ability to process multiple changes for the same key in a single pass. This is specifically relevant for typical ETL processes that load data to a time-variant target (PSA, Satellite, Dimension etc.). For non-time variant targets (Hubs, Links etc.) the process is a bit easier as this is essentially built-in the patterns already :-). In a given...

 
0

Creating Data Vault Point-In-Time and Dimension tables: merging historical data sources

Beyond creating Hubs, Links and Satellites and current-state (Type 1) views off the Data Vault, one of the most common requirements is the ability to represent a complete history of changes for a specific business entity (Hub, Link or groups of those). If a given Hub has on average 3 or 4 Satellites, is it useful at the very least to see the full history of changes for that specific Hub across all Satellites. How to...

 
1

Advanced row condensing for Satellites

When it comes to record condensing, DISTINCT just doesn’t cut it. I’ve been meaning to post about this for ages as the earliest templates (as also posted on this site) were not flexible enough to work in all cases (which is what we strive for). Record condensing Record condensing is making sure that the data delta (differential) you process is a true delta for the specific scope. It’s about making sure no redundant records are processed into...

 
0

Why you really want a Persistent Staging Area in your Data Vault architecture

Recently at the Worldwide Data Vault Conference in Vermont USA (WWDVC) I had many conversations about the Persistent Staging Area (PSA) concept, also known as Historical Staging Area. I have been using this idea for years and really can’t do without it. I would even go as far as saying you really want a PSA in your architecture. However there is a common opinion that having a PSA isn’t the best idea as it introduces a ‘2nd...

 
0

Unknown keys (zero keys or ghost keys) in Hubs for DV2.0

I am still working towards capturing the generation (using BIML in SSIS) and virtualisation (using views / SQL) of the Presentation Layer (in a Dimensional Model). But before we get there, some topics need to be addressed first. One of these is the requirement to have ‘unknown’ keys available in the Hubs. Thankfully, this is one of the easiest concepts to implement. The basic idea is that you create a dummy record in the Hub which...

 
0

Data Vault ETL Implementation using SSIS: Step 7 – Link Satellite ETL – part 3 – End Dating

I’m catching up on old drafts within WordPress, and in the spirit of being complete on the older SSIS series felt I should pick this one up and complete it. While most of my focus is on developing the virtualisation concepts I still work a lot with more traditional ETL tools, one of which is Microsoft SSIS. Recently I merged the metadata models that underpin the virtualisation and SSIS automation and I am retesting everything...

 
1

Zero / ghost records in Data Vault Satellites versus Point In Time (PIT) tables

As posted earlier recent evolution of the Data Vault 2.0 conventions aim to remove the creation of zero records (or ‘ghost records’) in Satellites. Zero records have the sole aim of making sure that every business key in a Satellite has a complete timeline (e.g. 1900-01-01 to 9999-12-31) so that records are always returned when you query the state of the world at any given date. For instance if a certain record is created in...

 
0

Data Vault 2.0 Staging Area learnings & suggestions

With Data Vault 2.0 the Data Vault methodology introduces (amongst other things)  a more formalised solution architecture which includes a Staging Area. In the designs as advocated in this blog this Staging Area is part of a conceptual Staging Layer that also cover the Persistent Staging Area (PSA). While updating the documentation in general I updated various sections in the Staging Layer definition and this prompted me to highlight some experiences specifically with implementing the...

 
1

Do we still want to automate against ETL tools?

In the various Data Warehouse initiatives I have been involved with recently I tend to use ETL software less and less. Over the last few years I spent a lot of time figuring out how to automate/generate ETL in various tools – most notably in SSIS, Pentaho, Powercenter and recently SAS (but various others as well). To the point that most of the intended solution can be generated from metadata. But as recently outlined in this...

 
1

Driving Keys and relationship history, one or more tables?

Handling Driving Key type mechanisms is one of the more challenging elements of Data Vault modelling. It’s not necessarily difficult to implement this concept, but more how to interpret this and get the right information out of the Data Vault again. In this post I’ll explore two ways of storing relationship history over time: Using a separate table to store Driving Key information and a separate table to store normal relationship history (type 2) and,...