Zero records, time-variance and point-in-time selection
While finalising the posts for the overall Data Vault ETL implementation I have also started thinking about how to document the next steps: the loading patterns for the Presentation Layer. From here on I will refer to the Presentation Layer, Data Marts and Information Marts simply as ‘Information Marts’. This reminded me that I haven’t yet properly covered the ‘zero record’ concept. This is a timely consideration: the whole reason that zero records exist is to make the (ETL/query) step towards the Information Marts easier!
The zero record is not considered ‘standard’ in Data Vault and isn’t often mentioned in the available materials (at least not at the time of writing). However, it does appear in many implementations so there must be a good reason for this…
Zero records essentially make sure that every (information for a) key you track over time (is ‘effective dated’) has a starting point in time. Therefore this concept only applies to entities that contain time-variant context information such as the Satellites and Link-Satellites entities in Data Vault. However, it also applies to pretty much every historical data set including 3NF history tables and others.
When a new key/record is created, this is driven by a change in the operational (source) systems. Regardless of which effective date/time you select (read this for more information) the timeline for this key’s context information starts at a that effective date. A certain point in time, onwards. The available timeline for that context is therefore from ‘now’ (the effective date/time) to ‘eternity’ (high date such as 9999-12-31).
When changes occur for this key/context after this point in time, the record is (conceptually) end-dated and new records are added to reflect this change over time. The key’s timeline will become more fragmented.
For example, if a person record ‘John’ with favourite color ‘Red’ is created on the 5th of June 2013 a timeline is created as follows. In this example key 25 refers to ‘John’.
With subsequent changes being picked up by the system, this might result in the following recordset:
- Which values are used for the low-date and high-date are subject to discretion, and sometimes even to the platform that is in place. I favour ‘1900-01-01’ and ‘9999-12-31’ but different dates can be used to implement the timeline concept. Bear in mind that different date/time data types (per system) may have different ranges of allowed content!
- End dating can be derived, as it’s defined as the effective date of the previous record. This can be virtualised as is covered in this post
- This is all very standard, but long term discussions apply for the exact nature of the end date/time (is it exactly the same as the previous effective date/time, or the date/time minus 1 second etc.). As with many of these concepts, there are many options and considerations and I will leave this conversation out of scope
So far so good, and with this information you can query the Data Warehouse to represent information ‘as it was’ at a point in time. For instance in the first example we can query the favourite colour on for instance the 7th of December 2013 as being ‘Red’, and in the second example the colour at the same date is ‘Blue’ (used to be ‘Red’).
But if you use,say, the 1st of February 2012 as your date of measurement you won’t see the record: it will not be returned by the query. This is where the zero records come in; they make sure that there always is ‘something’ to make the row return in the query. This looks as follows:
And for the second example:
To conclude, the zero record concept ensures that a point-in-time query always returns all records. This is especially important when merging two time-variant datasets; this will be covered in the next post as this is the final step towards virtualising the Information Marts. You only need a single zero record per key to be able to complete each individual timeline. This means that you add an extra record for every key, which in itself can amount for a relatively large overhead in storage/performance depending on the volumes and rate of change.
Zero records are essentially redundant, and can be omitted at the cost of (significant) query complexity. This is important to realize as a solution designer can opt for this query overhead and save on storage of redundant rows if sufficient horsepower is available (cloud MPPs come to mind). In terms of additional complexity think of a factor 4-5 increase in the query lines due to multiple self outer-joins when merging multiple time-variant datasets. As before, this is material for the next post and an interesting discussion in itself.
When choosing the context values for the dummy record I actually tend to leave them empty (NULL). The other option would be to evaluate the data type and add a default value (‘Unknown’ for varchars, -1 for numeric). Regardless, you can derive this any which way for different presentation requirements in the Information Marts. The key thing is to make sure the rows are returned. With this in mind I favour using NULL values in the Integration Layer as I can be a purist sometimes, and you really don’t know anything about the key for that point in time. If the business requirement is to represent this this early state as ‘unknown’ I can evaluate this accordingly towards the Information Mart.
On a final note: I find that it is not correct to refer to zero records as mere dummy records. The zero records appear to be similar to dummy placeholder values (-1,-2,-3 etc.) and the whole taxonomy of ‘unknowns’. I usually refer to these records as ‘initial timeline records’ to make the distinction to true dummy values as placeholders, which themselves can’t really be omitted.