Ideas for general error handling, why need error tables?
The possible scenarios regarding error and exception handling are limited. You can either:
– Detect an error, stop the process and present the error code.
– Detect an error and write the record in an error table with the corresponding code.
– Detect an error and write the record in both the target and the error table with the error code.
– Detect an error, flag the record but write it to the DWH table anyway including the error code.
This type of error handling is determined in the general (project) architecture and the functional design. The information requirement is very dependant of the situation. For instance, some financial systems require completeness in records so that the total sums show a number that matches the reality. Even if the details / reference data is dodgy! On the other hand some systems require data with 100% quality so no (detected) errors may pass to the target table. This is why the type of error handling (= business requirement) should be determined in the general (data) architecture and the functional design.
Using error tables has its impact on ETL architecture and this is why the general concept of error handling should be examined early on. Error records should be updated if the errors in the record change and should be deleted (or flagged) if the record does not contain errors anymore.
I would always try to allow an error-flagged record to be written to the DWH tables, regardless of using separate error tables or not. The advantage is (especially in a Data Vault like model) is that ETL processes keep on loading data to the main DWH tables without interruption and that you will have the flexibility to select which quality of data you forward for further processing. By using this method you can even indicate data quality for your datamarts using weighed error codes (more on this later on). Additionally, if you look at the error code as a type 2 attribute you can even track the history of error solving! And as a bonus, you don’t really need error tables anymore. Well, at least in this part of the architecture.
If you use your central model (Data Vault) to store any data even if it is flagged for errors you can also create a derived table with ‘clean’ records to select data from. This works perfectly, because the surrogate keys are centrally distributed earlier on anyway (in the hubs / surrogate key tables). You will have a ‘raw’ and a ‘cleaned’ satellite table, or more if it suits you. I’ll post a few examples later on.