Handling logical deletes in the Data Warehouse
While working on a recent project I had a brief discussion on the implementation of logical deletes. This prompted me to define for once and for all how ETL should handle these occurences. This is, of course, assuming that some CDC mechanism provides the required details or you have the capacity to compare full sets of data to derive the logical deletes yourself. For this purpose I drafted a Design Pattern which should explain this process in detail. Working on this document has in itself led to an upgrade of basically this entire website. This includes an overhaul of the met(ETL) details of loading to the main table types. I’ve added these documents to the Design Decisions section of the site.
Back on topic of the logical deletes: what it comes down to is to treat every logical delete as a new insert in the SCD2 table, and closing the previous record. In this process the latest record (the one which comes through as a logical delete) becomes the current record (current record indicator is set to ‘Y’) and a deleted record indicator is set to yes (‘Y’). What this means is that the record at that time is closed (the end date is set) and a new phase in life as deleted record starts. This is now the most current state of the record with a new effective date and a 9999-12-31 expiry date.This provides the Data Warehouse with an image of the latest state of the record when it was deleted. In other words, the deleted row indicator is treated as any other SCD2 attribute.
There can be situation where a record is ‘undeleted’, if (source) applications allow the reuse of the same key (with the same meaning). Assuming this is still the same record (if a key is redistributed it’s another discussion) the process still stands: the then most recent record (at that point that’s the logical deleted one) will be expired and a new record will be inserted. This bridges the gap in history in a natural way.