Zero / ghost records in Data Vault Satellites versus Point In Time (PIT) tables
As posted earlier recent evolution of the Data Vault 2.0 conventions aim to remove the creation of zero records (or ‘ghost records’) in Satellites.
Zero records have the sole aim of making sure that every business key in a Satellite has a complete timeline (e.g. 1900-01-01 to 9999-12-31) so that records are always returned when you query the state of the world at any given date. For instance if a certain record is created in 1980-01-01, a snapshot of 1970-01-01 would not directly return a ‘hit’ for that record unless you add some query complexity.
But with the existence of a 1900-01-01 zero record for that specific key at least the information is returned that nothing is known at this point in time (e.g. you get a hit). Of course this is redundant information (e.g. can be derived) but it makes upstream processing in time-variant tables (e.g. Type2 dimensions) a lot easier. However, this means that for every individual key in the Satellite table at least an additional zero record is written to disk and end-dated. If you have 50 million keys in your Satellite, you will have 50 million additional corresponding zero records.
One of the alternative approaches (which some of you have been doing for a while) that has recently been incorporated in Data Vault is to only create these zero records / complete timelines when really necessary. This means far less zero records are required saving storage and improving performance (ETL and database) at the same time. This changes the Satellite templates slightly, in the following two flavours:
- Normal history tracking in Satellites (e.g. with end-dating and potentially current record indicators) but without the zero records
- Insert-only Satellites (e.g. without end-dating etc.) without zero records
When adopting the insert-only approach another potential saving is the ‘condensing’ of timelines, when you are only interested in a few attributes and therefore have fewer ‘changes over time’. This approach effectively means you only have to calculate history (in the form of creating a zero record and end-dating) once. Zero records can be inserted in this set to simplify upstream processing again here, supporting the idea to only create them when necessary.
Another advantage is that in most cases a current state view of the information (the traditional ‘Type 1’ view) is sufficient to meet information requirements, and this information is immediately available.
This doesn’t mean storing the zero records in the original Satellite is ‘wrong’ – it’s just another design decision you need to make. Thankfully, since all zero records are derived it will always be possible to swap between approaches at any given moment.
In any case, history is therefore now calculated somewhere else and this can be done using a dedicated Satellite or Point In Time (PIT) table. PIT tables essentially are a joined set of a Hub and its Satellites for a period of time.
It is worth mentioning that PIT tables can now be part of the ‘business’ Data Vault which means you can store all kinds of additional information in here to make life easier at the corresponding granularity. This is also something that some of you have been doing for a while – but it’s now recognised by the broader community.
In this updated DV2.0 solution the Satellites do not have zero records, instead PIT tables (or dedicated Business Data Vault Satellites) are used to create the timelines for the scope (in terms of attributes required and timelines required) that fits the purpose.