Ideas for general error handling, why need error tables?

The possible scenarios regarding error and exception handling are limited. You can either:

–          Detect an error, stop the process and present the error code.
–          Detect an error and write the record in an error table with the corresponding  code.
–          Detect an error and write the record in both the target and the error table with the error code.
–          Detect an error, flag the record but write it to the DWH table anyway including the error code. 

This type of error handling is determined in the general (project) architecture and the functional design. The information requirement is very dependant of the situation. For instance, some financial systems require completeness in records so that the total sums show a number that matches the reality. Even if the details / reference data is dodgy! On the other hand some systems require data with 100% quality so no (detected) errors may pass to the target table. This is why the type of error handling (= business requirement) should be determined in the general (data) architecture and the functional design. 

Using error tables has its impact on ETL architecture and this is why the general concept of error handling should be examined early on. Error records should be updated if the errors in the record change and should be deleted (or flagged) if the record does not contain errors anymore. 

I would always try to allow an error-flagged record to be written to the DWH tables, regardless of using separate error tables or not. The advantage is (especially in a Data Vault like model) is that ETL processes keep on loading data to the main DWH tables without interruption and that you will have the flexibility to select which quality of data you forward for further processing. By using this method you can even indicate data quality for your datamarts using weighed error codes (more on this later on). Additionally, if you look at the error code as a type 2 attribute you can even track the history of error solving! And as a bonus, you don’t really need error tables anymore. Well, at least in this part of the architecture.

If you use your central model (Data Vault) to store any data even if it is flagged for errors you can also create a derived table with ‘clean’ records to select data from. This works perfectly, because the surrogate keys are centrally distributed earlier on anyway (in the hubs / surrogate key tables). You will have a ‘raw’ and a ‘cleaned’ satellite table, or more if it suits you. I’ll post a few examples later on.

 
Roelant Vos

Roelant Vos

You may also like...

3 Responses

  1. Apex says:

    Are you talking here about errors due to business rules? Because there is no way to flagged some record if it haven’t been inserted at all due to data convert error or something like this.

     
  2. Roelant Vos Roelant Vos says:

    Hi,
    Yeah I’m basically talking about not filtering records out based on business rules. As I see it you can only do so much against environmental errors (power failure and so on) and the only real place an error table is needed is while converting datatypes from the source systems to the staging area. That’s where a real error (32 January for instance) can occur and a recycling process should be in place. I don’t add recycling on technical errors like missing surrogate keys since I feel the modeling approach already caters for that. If there is a surrogate key (Hub ID) missing I’ll just add it there since it is just another instance of a certain entity. No need in failing the ETL for that! I have seen situations where facts arrived before dimensions and in these cases I say: just add the new surrogate key there and link the fact to it. That leaves you with minimal error handling in the entire system: only in the staging area.

    Regards,
    Roelant Vos

     
  3. Apex says:

    Hello Roelant Vos!

    >No need in failing the ETL for that!

    Absolutely. Thanks for explanation of your point!

     

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.