Zero records, time-variance and point-in-time selection

While finalising the posts for the overall Data Vault ETL implementation I have also started thinking about how to document the next steps: the loading patterns for the Presentation Layer. From here on I will refer to the Presentation Layer, Data Marts and Information Marts simply as ‘Information Marts’. This reminded me that I haven’t yet properly covered the ‘zero record’ concept. This is a timely consideration: the whole reason that zero records exist is to make the (ETL/query) step towards the Information Marts easier!

The zero record is not considered ‘standard’ in Data Vault and isn’t often mentioned in the available materials (at least not at the time of writing). However, it does appear in many implementations so there must be a good reason for this…

Zero records essentially make sure that every (information for a) key you track over time (is ‘effective dated’) has a starting point in time. Therefore this concept only applies to entities that contain time-variant context information such as the Satellites and Link-Satellites entities in Data Vault. However, it also applies to pretty much every historical data set including 3NF history tables and others.

When a new key/record is created, this is driven by a change in the operational (source) systems. Regardless of which effective date/time you select (read this for more information) the timeline for this key’s context information starts at a that effective date. A certain point in time, onwards. The available timeline for that context is therefore from ‘now’ (the effective date/time) to ‘eternity’ (high date such as 9999-12-31).

When changes occur for this key/context after this point in time, the record is (conceptually) end-dated and new records are added to reflect this change over time. The key’s timeline will become more fragmented.

For example, if a person record ‘John’ with favourite color ‘Red’ is created on the 5th of June 2013 a timeline is created as follows. In this example key 25 refers to ‘John’.

Zero_Record_Timeline_1

With subsequent changes being picked up by the system, this might result in the following recordset:

Zero_Record_Timeline_2

Please note:

  • Which values are used for the low-date and high-date are subject to discretion, and sometimes even to the platform that is in place. I favour ‘1900-01-01’ and ‘9999-12-31’ but different dates can be used to implement the timeline concept. Bear in mind that different date/time data types (per system) may have different ranges of allowed content!
  • End dating can be derived, as it’s defined as the effective date of the previous record. This can be virtualised as is covered in this post
  • This is all very standard, but long term discussions apply for the exact nature of the end date/time (is it exactly the same as the previous effective date/time, or the date/time minus 1 second etc.). As with many of these concepts, there are many options and considerations and I will leave this conversation out of scope

So far so good, and with this information you can query the Data Warehouse to represent information ‘as it was’ at a point in time. For instance in the first example we can query the favourite colour on for instance the 7th of December 2013 as being ‘Red’, and in the second example the colour at the same date is ‘Blue’ (used to be ‘Red’).

But if you use,say, the 1st of February 2012 as your date of measurement you won’t see the record: it will not be returned by the query. This is where the zero records come in; they make sure that there always is ‘something’ to make the row return in the query. This looks as follows:

Zero_Record_Timeline_3

And for the second example:

Zero_Record_Timeline_4

To conclude, the zero record concept ensures that a point-in-time query always returns all records. This is especially important when merging two time-variant datasets; this will be covered in the next post as this is the final step towards virtualising the Information Marts. You only need a single zero record per key to be able to complete each individual timeline. This means that you add an extra record for every key, which in itself can amount for a relatively large overhead in storage/performance depending on the volumes and rate of change.

Zero records are essentially redundant, and can be omitted at the cost of (significant) query complexity. This is important to realize as a solution designer can opt for this query overhead and save on storage of redundant rows if sufficient horsepower is available (cloud MPPs come to mind). In terms of additional complexity think of a factor 4-5 increase in the query lines due to multiple self outer-joins when merging multiple time-variant datasets. As before, this is material for the next post and an interesting discussion in itself.

When choosing the context values for the dummy record I actually tend to leave them empty (NULL). The other option would be to evaluate the data type and add a default value (‘Unknown’ for varchars, -1 for numeric). Regardless, you can derive this any which way for different presentation requirements in the Information Marts. The key thing is to make sure the rows are returned. With this in mind I favour using NULL values in the Integration Layer as I can be a purist sometimes, and you really don’t know anything about the key for that point in time. If the business requirement is to represent this this early state as ‘unknown’ I can evaluate this accordingly towards the Information Mart.

On a final note: I find that it is not correct to refer to zero records as mere dummy records. The zero records appear to be similar to dummy placeholder values (-1,-2,-3 etc.) and the whole taxonomy of ‘unknowns’. I usually refer to these records as ‘initial timeline records’ to make the distinction to true dummy values as placeholders, which themselves can’t really be omitted.

 
Roelant Vos

Roelant Vos

You may also like...

3 Responses

  1. says:

    Roelant:
    I am assuming that one of the benefits of this approach is that you can be sure you can always do INNER JOINS between HUBS and SATELLITES and always get records, not to worry that in some cases, there will be no satellite records and therefore using LEFT JOINS in some cases.
    Can you tell me what is your technique to indicate that a particular instance of a Hub doesn’t exist at all for a particular period in the entire business? Do you use a specific satellite for that (i.e. something like a “Effectivity Satellite” to track only dates when they exists, deletes and undeletes)?
    I’m thinking that one way to achieve that is not to insert any record for those Effectivity Satellites for the period of non-existence, but obviously this is against your technique above.
    I guess I may need to wait for your future post to understand what you mean by “…merging 2 time variant data sets…”
    Best regards. –Oscar–

     
  2. Roelant Vos Roelant Vos says:

    Hi Oscar,

    You usually still need to outer-join from Hubs to Satellites to cater for the business keys without any context information, but you can evaluate/coalesce those for a point in time in the same selection (they would be NULL initially). The zero record as I implement it is only to make sure that time-variant information that does exist has a complete time line.

    Another option is to make sure that a record is created for a Hub keys in all outlying Satellite tables but it’s a balancing act and I prefer to only do this for context records that already exists. I found that trying to keep this consistent comes with a big overhead when you have multiple Satellites.

    With this query you can select a point in time view of the world, including the information if context for a Hub doesn’t exist at a particular time. It’s all available in the same Satellite(s), including the deletes and undeletes. The latter is essentially just another type 2 attribute, provided by the interface. So if you query a point in time and the hit is a dummy, you know that nothing about that key was known at that point in time. If you want to query this for the Hub key itself (not the context) you can include the LOAD_DTS or equivalent to find out when the key was inserted for the first time.

    The technique mainly applies to the merge as you indicate; working on that now. Things really come to life there!

    Kind regards,
    Roelant

     
  1. June 13, 2015

    […] earlier approaches – most notably the Satellite virtualisation and processing. Concepts such as zero records and ‘virtual’ or computed end-dating are all there again, as are the constructions of using […]

     

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.