Virtualising your Data Vault – Satellites

Once you have nailed the fundamental metadata requirements and prerequisites for Data Vault ETL automation, changing the automation output in terms of target platform or (ETL) tool is relatively easy. This is especially true for Satellites as their implementation in a virtualised setting is usually 1-to-1 with their ETL instantiated counterparts.

To make the distinction; Hubs are handled differently for virtualization as you are essentially combining various ETLs into a single Hub view. For example: an ‘Employee’ Hub that can receive its business keys from 10 different source tables would typically result in 10 separate ETL processes if you use an ETL tool.

But as a virtual approach – using a view – you end up with a single Hub view which includes the distinct union of the 10 (sub)queries. Technically speaking the iteration through your metadata is different to achieve this for a Hub view. A Satellite is easier in a way if you adapt from existing ETL automation efforts.

The interesting point about virtualising a Satellite is the handling of the ‘zero record’. In typical ETL tool development this is implemented as a post-process or separate step. This step adds a record to initialise the timelines for the business entity context (the history of a specific Hub key in the Satellite). But of course using views, this has to be implemented differently. Zero records, although derived, are invaluable in simplifying the upstream processing / querying of time-variant information.

Some minor disclaimers:

  • You need a source that has full history; a Historical Staging Area. Only this, combined with proper interfacing guarantees the correct timelines. Querying a source system directly using available effective dates has the same drawbacks as it always had
  • The Event Date/Time is usually synonymous to a Load Date/Time. However I define the Event Date/Time as the closes you can get to the actual change. It is ideally generated by CDC mechanism as I feel the moment of ETL execution should be decoupled from the effective dates. Doesn’t matter for the SQL below though, as this date/time is set by the processes loading data into the Historical Staging Area
  • The metadata is the same as originally posted for tool-based automation, the only real change is the union in the subquery for the zero records
  • This example is geared towards Data Vault 2.0: using hashes instead of sequences. It’s a small but important tweak!
  • Expiry date/time and Current Record Indicator are derived – thanks to Kent Graziano for sending the snippet. I did change it to SQL Server  style 🙂

I have posted example code below. Let me know what you think!

SELECT 
   <Hash algorithm covering the Business Key> AS <Hub>_SK,
   <Event Date/Time> AS EFFECTIVE_DATETIME,
   COALESCE ( LEAD ( <Event Date/Time> ) OVER
   		     (PARTITION BY <Business Key, including any multi-active attributes if required>
   		      ORDER BY <Event Date/Time>),
   		    CAST( '9999-12-31' AS DATETIME)
   ) AS EXPIRY_DATETIME,
   CASE
      WHEN ( RANK() OVER 
       (PARTITION BY <Business Key, including any multi-active attributes if required>  
          ORDER BY <Event Date/Time> DESC)) = 1
      THEN 'Y'
      ELSE 'N'
   END AS CURRENT_RECORD_INDICATOR,
   <Business Key>,
   <Record Source>,
   <Attributes> AS <Satellite name>, -- I use AS statements to map the source naming to the target (Satellite) naming
   CAST(
      ROW_NUMBER() OVER (PARTITION  BY 
         <Business Key, including any multi-active attributes if required>
      ORDER BY 
         <Business Key, including any multi-active attributes if required>,
         <Event Date/Time> ) AS INT)
   AS ROW_NUMBER, -- Not necessarily required, I use it to instantly see the number of changes per business key in time
   <Hash algorithm covering the context attributes> -- Added to be future proof in case we need to 'physicalise', we might need row comparisons later
FROM 
   (
      SELECT DISTINCT -- No changes here from normal selection, you always need a DISTINCT subquery
        <Event Date/Time>,
        <Record Source>,
        <Business Key>,
        <Attributes>
      FROM <Historical Staging table>
      UNION
      SELECT DISTINCT -- The zero records!
        '1900-01-01' AS <Event Date/Time> ,
        'Data Warehouse' AS <Record Source>,
         <Business Key,
         NULL AS <Attributes> -- The number of attributes in the UNION must match
      FROM <Historical Staging table>
   ) sub
 
Roelant Vos

Roelant Vos

You may also like...

3 Responses

  1. Mael Mael says:

    HI Roelant,

    I am trying to apply the virtualization in a DV DWH I implemented because it solve many problems as DV 2.0 but there are many thing I don’t understand and I’m very certain that It could help me greatly if I understand it completly.

    Please,
    1- What do you mean by Historical Staging Area ? When should we use it ?
    2- Please seriously I don’t understand this part
    CASE
    WHEN ( RANK() OVER
    (PARTITION BY
    ORDER BY DESC)) = 1
    THEN ‘Y’
    ELSE ‘N’
    END AS CURRENT_RECORD_INDICATOR,

    Could you explain it simply to a beginner please ?

    Same for this part please,

    CAST(
    ROW_NUMBER() OVER (PARTITION BY

    ORDER BY
    ,
    ) AS INT)

    Thanks.

     
    • Roelant Vos Roelant Vos says:

      Hi Mael,

      The History Staging Area is an option in your solution design/architecture. It basically is an insert-only ‘archive’ of data (delta) that has passed through the system. It can be a database but not necessarily. A file archive is an option too. Technically it’s a table/file with the same structure as the source but with a Primary Key and Effective Date. The ETL loads this by checking if there’s a change and inserting a new row when there is a change. In other words it shows all changes over time for a natural (source) key as they were presented to the DWH. I outlined this here: http://roelantvos.com/blog/?p=1129

      Regarding the End Date and Current Record Indicator; these are or can be derived from the data and can be considered additional/redundant. In typical DWH implementations you would write the End Date and Current Record Indicator to disk as part of the ETL (standard SCD Type 2 mechanism, which updates the row to set the End Date / Current Row Indicator). But because these values can be derived from the Effective Date you can ‘virtualize them’ so you save on ETL processing. It requires more horsepower from the database of course but then again, disk I/O is always a slow process anyway relatively speaking.

      In terms of the code used have a look at the RANK and ROW_NUMBER analytical SQL functions. You can use these to order information across different attributes. We use this to sort the records by Key (‘business key’) and Effective Date so we can ‘compare’ a key value with its ‘previous’ row counterpart. The idea here is that the End Date for a certain key is the Effective Date of the next record (sorted by Effective Date) for that key.

      The Current Row Indicator answers to the Effective Date; the most recent record for a certain key from the perspective of the Effective Date is ‘Y’.

      Hope this helps,
      Roelant

       
  1. June 3, 2015

    […] to the NoETL Hub post from a while ago. The pattern used is the virtual Satellite as explained here. As with the virtual Hubs the really cool concept is that you can use the view to act as a normal […]

     

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.