Guidelines for Data Integration Processes

The following are core principles I use for implementing data integration / logistics (‘ETL’). This is a component (design pattern) from the Data Solution Framework repository, specifically this pattern.

The purpose of this pattern is to define a set of minimal requirements that every single data logistics process (i.e. procedure, mapping, module, package, data pipeline) should conform to. These fundamental guidelines direct how all data integration processes should behave under the architecture of the Data Solution Framework.

These principles are also covered in great detail in Data Engine Thinking, as part of the architecture sections.

Regardless of its place or purpose in the architecture, every data logistics process should follow a concise set of base rules. For instance making sure data integration processes don’t cause problems when running them twice, or at the same time. Concepts like these simplify testing, maintenance, development, and troubleshooting – making the overall system much easier to administer.

The overarching motivation is to develop data integration processes that don’t cause errors due to unplanned or unwanted execution. Essentially, data integration processes must be able to be run independently at any point in time to support a fully flexible scheduling and implementation.

Fundamental data integration principles

  1. Atomic; data integration processes contain transformation logic for a single specific function or purpose. This design decision focuses on creating many processes that each address a specific function, as opposed to few processes that perform a wide range of activities. Every data integration process attempts to execute an atomic functionality. Examples of atomic Data Warehouse processes are key distribution, detecting changes and inserting records.
  2. Target-centric; A data integration process can read from one or more sources but only write to a single target. This further supports the atomicity principle.
  3. Forward-only. The direction of data logistics is always towards the target area or layer. It is always ‘downstream’. No data integration process should write data back to an earlier layer, or use access this information using a (key) lookup. This implies that the required details are always available in the same layer in the architecture.
  4. Idempotent; each process should execute in a way that it ‘picks up where it left off’. This manifests itself in different ways throughout the data solution architecture. For instance, data integration processes should be able to rerun without the need to manually change settings. All processes must be able to execute without requiring manual intervention, and without causing data loss.
  5. Deterministic; data integration processes must always produce the same outcome based on the same input values. Deterministic processes greatly simplify the overall maintenance and reliability of the solution.
  6. Independent; processes must be able to be executed at any time, and in any order, without any dependencies on other processes. This principle enables flexible orchestration, and addresses any scheduling or frequency constraints.
  7. Auditability; each unique process execution is numbered, identifiable, and can be related to the data that was involved. Using the control framework, and supported by the Persistent Staging Area, the flow of data can always be tracked.
  8. Scalable; data integration processes must be able to process multiple intervals (changes) in one run. Every time a data integration job runs, it needs to process all available data. This is a critical requirement for flexibility in orchestrating data integration processes, and to support near real-time processing. For instance, if the address of an person changes multiple times during the day and the process is run daily, all changes are still captured and correctly processed in a single execution of the process.
  9. Fault-tolerant; data integration processes must be robust and implement a certain degree of resilience. For instance, they should detect if data can be inserted and not fail on constraints. If they do fail, data integration processes should automatically recover (‘rollback’). If an error does occur, the process automatically reverts. This is tracked by the control framework and helps to maintain the internal consistency of the solution – an important feature in delivering eventual consistency.

Further reading

For additional information, and implementation of, these principles please visit the Data Solution Framework or consider getting a copy of Data Engine Thinking.

Roelant Vos

Ravos Business Intelligence admin

You may also like...

Leave a Reply