Loading too fast for unique date / time stamps – what to do?
Let’s start by clarifying that this concerns the RDBMS world, not the Hadoop world 😉
It’s a good problem to have – loading data too quickly. So quickly that, even at high precision, multiple changes for the same key end up being inserted with the same Load Date/Time Stamp (LDTS). What happens here?
A quick recap: in Data Vault the Load Date/Time Stamp (LDTS, LOAD_DTS, or INSERT_DATETIME) is defined as the moment data is recorded in the Data Warehouse (more information here). Typically this is defined at database level as a default value, trigger or equivalent to support accurate timestamping of the data. Alternatives being implemented in ETL usually end up becoming more expensive from a performance perspective.
So what happens? Data is ingested so quickly by the database that thousands of records are inserted with identical timestamps (bear in mind that the datetime attribute has high precision e.g. sysdatetimestamp, datetime2(7)). Across the bigger sets you see a gradual increase in the time across records so the database does it’s job.
This itself is not an issue, but problems arise when you handle multiple changes for the same (natural) key in a single data delta. When this occurs, these changes (that end up having the same Load Date / Time Stamp) result in Primary Key violations in the Satellites or the Persistent Staging Area because the LDTS is part of the Primary Key (acting as ‘effective date/time’). You basically try to store more than one version of context for a Business Key at the same point in time. As mentioned, this only occurs when you actually detect multiple changes for a natural key – for example when a customer changes address a couple of times and you pick up and process these changes in one go.
I have posted many times about this requirement for ETL to handle multiple changes in a delta to support flexibility and scalability. A simple Full Outer Join (FOJ) delta mechanism will only give you a single change for each key, but (transactional) CDC mechanisms will buffer changes for ETL to be picked up at any given time or frequency.
To be able to handle processing genuine changes with the exact same LDTS I made the following design decisions:
- In the Persistent Staging Area, the Source Row ID is added to the Primary Key. The PK now consists of the Natural Key (source key), the LDTS and the Source Row ID. The Source Row ID is traditionally defined as a sequence value generated starting a 1 within each data delta
- In the Satellite I use the Source Row ID to add this as a factional second to the date/time
Blasphemy! I’m changing the values going into the Data Vault (and arguably is a bit dodgy).
The reason I think this is acceptable is that the original values are still available in the PSA so we can refactor / rerun from here if needed. Also, the auditability is maintained (as long as the PSA is there). Because the additions are done at the lowest time level the date/time change is very small. If, say, you have three changes in a row with Source Row ID 4,5 and 6 you basically offset the date/time respectively with 4, 5 and 6 nanoseconds.
This also explains why you should add and not subtract the time, as the Source Row ID is sequential and subtraction will change the order in time (whereas addition doesn’t). Will subtraction will certainly cause problems, addition only has a theoretical risk to overtake future deltas but in case that happens I’ll retain my PSA for the time being.
Dan Linstedt has some advanced views on applying these concepts when adopting Hadoop to maintain the raw Data Vault – in these scenarios you can let the platform handle the Load Date / Time Stamp.
As part of ‘options and considerations’ an alternative is to replace the Load Date / Time Stamp in the key with the Row Number (Source Row ID in my examples). This would work as well, but I didn’t opt for this approach as in my designs the Source Row ID identifies rows within each delta – and to be able to be a PK candidate the row number needs to be incremental across deltas (e.g. continuing at every run).