This is a very high level overview of the content that is managed in the ‘Data Integration Framework’ Github repository. Please have a look at the collaboration page for more information and updates, or send me an email. The information below serves to provide context on the terminology used throughout this blog, but the depth and detail is maintained in the Github to ensure continuous improvement, extension and validation.
Overview and terminology
When defining the architecture and its components I use the the following terms:
- Layer. A Layer is a high level container for functionality that adds to the same (high level) purpose. Process flows (implemented as ETL) always work from one Layer to the next.
- Area. An Area is a defined part of a Layer that has a dedicated role or function. Areas within a Layer split functionality which logically belongs together; they share the same modelling approach or techniques. This is typically recognisable by the type of data modelling. An example is a Raw Data Vault area and a Business Data Vault area. These are both part of the Integration Layer, and share the same modelling characteristics (they both form the model) but the behaviour may be different.
- Design Patterns. A Design Pattern or ‘design decision’ is a detailed description for a concept as mentioned in the architecture: a ‘how-to’. Design Patterns are unrelated to technology: they explain how to actually implement the concept and provide the best-practice approaches. The Design Patterns also show the implications a decision has on the implementation and where this occurs.
- Implementation Pattern. An Implementation Pattern is a technology specific implementation of a Design Pattern or part of the architecture. Due to software limitations or options a Design Pattern may be implemented in different ways while always achieving the same result.
The Enterprise Data Warehouse (EDW) architecture contains three Layers, each housing two Areas. The reference architecture describes each Layer and its Areas and refers to Design Patterns for details how to handle the concepts and meet the requirements. The Implementation Pattern finally creates the actual code or ETL mapping and is therefore by definition tool and platform specific. Each Layer in the architecture will be described in detail in separate pages on this site.
The Staging Layer
The first layer in the reference architecture is the Staging Layer, which is split into the Staging Area and the Persistent Staging Area (PSA). The basic purpose of the Staging Layer is to collect the data from the various source systems into the Data Warehouse environment and prepare them for further processing, but the role this layer plays is more important than just this. For instance, the PSA is an important component in virtualisation, Disaster Recovery (DR) and re-initialisations if parts of the Data Warehouse have to be reloaded. Similarly, the Staging Area prevents a wild variety of data types to be present in the core Data Warehouse layer (the Integration Layer) and aligns timing issues (date/times) without losing information. This is very important to meet essential ETL requirements later on, such as the ability to rerun ETL at any given point in time without corrupting information.
The Integration Layer
The second layer in the EDW reference architecture is the Integration Layer (also known as System of Records, Integration layer, Data Vault, Central Transaction Repository or Core model). This layer is the core of the Data Warehouse and is made up of the Raw Data Vault area and the Business Data Vault area. The purpose of this layer is to collect all data in an integral data model. This layer used the Data Vault modelling technique with some extra additions to support ETL metadata. The Integration Layer is not accessible to end users or reporting and analysis software. In this layer the data from the different source systems is remodelled into a Data Vault and the core Data Warehouse concepts are implemented:
- Data Warehouse (surrogate) key distribution
- Managing history
- Managing relationships
- Handling changes in source structures
The combination of the Raw and Business Data Vault areas provides the EDW with a flexible solution to apply common business logic without losing the ability to change the interpretation; the ‘raw’ data will always be available in the Raw Data Vault.
The Business Data Vault area is closely linked to the Raw Data Vault area and in most cases provides separate cleaned or changed instances of tables that already exist in the Operational Data Vault. The Business Data Vault is not a full copy of the Raw Data Vault. Typically the Business Data Vault tables will refer to Raw Data Vault area Hub tables and provide an alternative perspective to Operational Data Vault Satellite or Link tables.
Enterprise wide business rules are, of course, specific for each implementation but usually provide a solution for deduplication, determining a single customer view or data quality updates. This also includes detecting data quality issues, errors and/or exceptions. These are treated as interpretations, and are flagged (tagged) for further processing using advanced ETL concepts such as Error Bitmaps.
The Presentation Layer
The third layer is the Presentation Layer. This layer will use the available relationships and structures from the Data Vault and aggregate the data into a format that is most suitable for front-end reporting and analysis software. This can be any type of data model, but typically will be some form of Dimensional Model such as a Star- or Snowflake schema (as defined by Ralph Kimball in the Lifecycle Toolkit). This means the the essential purpose of the Presentation Layer from a Data Warehouse and ETL point of view is to create structure (hierarchies) and implement datamart specific business logic. This is another example of how Data Warehouse concepts are split over the different areas in the EDW reference architecture, providing more flexiblity while streamlining ETL and reducing complexity at the same time.
The Presentation Layer is meant to present the information to the front-end tooling with as seamlessly and easily as possible. This layer is modelled towards the end user’s functional, technical (performance) and software requirements. The front-end structuring area is the final part of the architecture. The (data)model within the front-end structuring area is modelled for a specific purpose, audience and technical requirement. The complete Data Warehouse can contain very different structures with different models and different ‘versions of the truth’ depending on the business needs. These models can have many forms such as traditional Data Marts or fully normalised models. In the process from loading the data from the Integration Layer to the Presentation Layer most of the business logic is implemented.