Roelant Vos An expert view on Agile Data Warehousing

1

NoETL – Data Vault Hub tables

 

 In the previous posts we have loaded a proper data delta (Staging Area) and archived this in the Persistent Staging Area (PSA). In my designs, the PSA is the foundation for any form of upstream virtualisation – both for the Integration Layer (Data Vault) and subsequently the Presentation Layer (Dimensional Model, or anything fit-for-purpose). The Presentation Layer sits ‘on top off’ the Data Vault the same as it would be in the physical implementation so you...

0

NoETL – Persistent (History) Staging Area (PSA)

 

 After setting up the initial data staging in the previous post we can load the detected data delta into the historical archive: the Persistent Staging Area (PSA). The PSA is the foundation of the Virtual Enterprise Data Warehouse because all upstream modelling and representation essentially reads from this ‘archive of (data) changes’. This is because the PSA has all the information that was ever presented to the Data Warehouse, either in structured or unstructured format....

1

NoETL (Not Only ETL) – virtualization revisited

 

 For the last couple of weeks I have been working on a simple tool to support the Data Warehouse virtualisation concepts in practice. This is based on the idea that if you can generate the ETL you need, you can also virtualise these processes if performance requirements and / or relevant constraints allow for it. This is why I was looking for a way to virtualise where it would be possible (performance wise), and instantiate (generate ETL) where...

1

Denormalise (flatten) XML using Xquery

 

 ETL software have ways to interpret XML files or structures that vary in maturity, but recently I had to find another way to import XML structures that are received as blobs using (transact) SQL only. Because the upstream patterns (Staging to Integration) can easily handle denormalised source data sets the easiest way was to cast each blob as the XML file it is anyway and completely denormalise it (‘flatten’, more rows…). This is opposed to normalising...

1

Hashing in SQL Server and Oracle for the same output

 

 Thanks go out to my colleagues for testing this out further, but we were testing out a way to get Oracle to generate hash values with the same output as the default SQL Server MD5 hashbytes function. In previous projects I used to rely on the DBMS_CRYPTO package in Oracle but this requires additional privileges. So if there is a some red tape to cut through the following code using the DBMS_OBFUSCATION_TOOLKIT works as well, and you...

0

Data Vault 2.0 Staging Area learnings & suggestions

 

 With Data Vault 2.0 the Data Vault methodology introduces (amongst other things)  a more formalised solution architecture which includes a Staging Area. In the designs as advocated in this blog this Staging Area is part of a conceptual Staging Layer that also cover the Persistent Staging Area (PSA). While updating the documentation in general I updated various sections in the Staging Layer definition and this prompted me to highlight some experiences specifically with implementing the...

1

Do we still want to automate against ETL tools?

 

 In the various Data Warehouse initiatives I have been involved with recently I tend to use ETL software less and less. Over the last few years I spent a lot of time figuring out how to automate/generate ETL in various tools – most notably in SSIS, Pentaho, Powercenter and recently SAS (but various others as well). To the point that most of the intended solution can be generated from metadata. But as recently outlined in this...

1

Driving Keys and relationship history, one or more tables?

 

 Handling Driving Key type mechanisms is one of the more challenging elements of Data Vault modelling. It’s not necessarily difficult to implement this concept, but more how to interpret this and get the right information out of the Data Vault again. In this post I’ll explore two ways of storing relationship history over time: Using a separate table to store Driving Key information and a separate table to store normal relationship history (type 2) and,...

3

Zero records, time-variance and point-in-time selection

 

 While finalising the posts for the overall Data Vault ETL implementation I have also started thinking about how to document the next steps: the loading patterns for the Presentation Layer. From here on I will refer to the Presentation Layer, Data Marts and Information Marts simply as ‘Information Marts’. This reminded me that I haven’t yet properly covered the ‘zero record’ concept. This is a timely consideration: the whole reason that zero records exist is to make the...

0

Data Vault implementation overview

 

 I’m almost at the end of the basic outlines for ETL implementation in the context of Data Vault. For a (hopefully) tidy overview I created a page that lists all relevant posts for Data Vault implementation here. I’m working towards writing up the last few topics now, including Zero Records and Link Satellites before focusing on Data Mart automation from a core Data Vault model. I think this covers the essential elements for implementation, but...