Overview

When defining the architecture and its components I use the the following terms:

  • Layer. A Layer is a high level container for functionality that adds to the same (high level) purpose. Process flows (implemented as ETL) always work from one Layer to the next.
  • Area. An Area is a defined part of a Layer that has a dedicated role or function. Areas within a Layer split functionality which logically belongs together; they share the same modelling approach or techniques. This is typically recognisable by the type of data modelling. An example is a Raw Data Vault area and a Business Data Vault area. These are both part of the Integration Layer, and share the same modelling characteristics (they both form the model) but the behaviour may be different.
  • Design Patterns. A Design Pattern or ‘design decision’ is a detailed description for a concept as mentioned in the architecture: a ‘how-to’. Design Patterns are unrelated to technology: they explain how to actually implement the concept and provide the best-practice approaches. The Design Patterns also show the implications a decision has on the implementation and where this occurs.
  • Implementation Pattern. An Implementation Pattern is a technology specific implementation of a Design Pattern or part of the architecture. Due to software limitations or options a Design Pattern may be implemented in different ways while always achieving the same result.

The Enterprise Data Warehouse (EDW) architecture contains three Layers, each housing two Areas. The reference architecture describes each Layer and its Areas and refers to Design Patterns for details how to handle the concepts and meet the requirements. The Implementation Pattern finally creates the actual code or ETL mapping and is therefore by definition tool and platform specific. Each Layer in the architecture will be described in detail in separate pages on this site.

Architecture_Outline

The Staging Layer

The first layer in the reference architecture is the Staging Layer, which is split into the Staging Area and the Persistent Staging Area (PSA). The basic purpose of the Staging Layer is to collect the data from the various source systems into the Data Warehouse environment and prepare them for further processing, but the role this layer plays is more important than just this. For instance, the PSA is an important component in virtualisation, Disaster Recovery (DA) and re-initialisations if parts of the Data Warehouse have to be reloaded. Similarly, the Staging Area prevents a wild variety of data types to be present in the core Data Warehouse layer (the Integration Layer) and aligns timing issues (date/times) without losing information. This is very important to meet essential ETL requirements later on, such as the ability to rerun ETL at any given point in time without corrupting information.

The Integration Layer

The second layer in the EDW reference architecture is the Integration Layer (also known as System of Records, Integration layer, Data Vault, Central Transaction Repository or Core model). This layer is the core of the Data Warehouse and is made up of the Operational (Raw) Data Vault area and the Business Data Vault area. The purpose of this layer is to collect all data in an integral data model. This layer used the Data Vault modelling technique with some extra additions to support ETL metadata. The Integration Layer is not accessible to end users or reporting and analysis software. In this layer the data from the different source systems is remodelled into a Data Vault and the core Data Warehouse concepts are implemented:

  • Data Warehouse (surrogate) key distribution
  • Managing history
  • Managing relationships
  • Handling changes in source structures

The combination of the Operational and Business Data Vault areas provides the EDW with a flexible solution to apply common business logic without losing the ability to change the interpretation; the ‘raw’ data will always be available in the Operational Data Vault.

The Business Data Vault area is closely linked to the Operational Data Vault area and in most cases provides separate cleaned or changed instances of tables that already exist in the Operational Data Vault. The Business Data Vault is not a full copy of the Operational Data Vault. Typically the Business Data Vault tables will refer to Operational Data Vault area Hub tables and provide an alternative perspective to Operational Data Vault Satellite or Link tables.

Enterprise wide business rules are, of course, specific for each implementation but usually provide a solution for deduplication, determining a single customer view or data quality updates. This also includes detecting data quality issues, errors and/or exceptions. These are treated as interpretations, and are flagged (tagged) for further processing using advanced ETL concepts such as Error Bitmaps.

 The Presentation Layer

The third layer is the Presentation Layer. This layer will use the available relationships and structures from the Data Vault and aggregate the data into a format that is most suitable for front-end reporting and analysis software. This can be any type of data model, but typically will be some form of Dimensional Model such as a Star- or Snowflake schema (as defined by Ralph Kimball in the Lifecycle Toolkit). This means the the essential purpose of the Presentation Layer from a Data Warehouse and ETL point of view is to create structure (hierarchies) and implement datamart specific business logic. This is another example of how Data Warehouse concepts are split over the different areas in the EDW reference architecture, providing more flexiblity while streamlining ETL and reducing complexity at the same time.

The Presentation Layer is meant to present the information to the front-end tooling with as seamlessly and easily as possible and for this purpose consists of two areas: the Supporting Area and the Presentation Area .

The Supporting Area is an optional area where semi-aggregates or any useful tables can be stored. These types of tables are usually added for either performance reasons or the wish to implement the same business logic in as few places as possible. Supporting tables tables can be modelled in any way as long as they benefit the presentation area. They are not accessible by users or front-end reporting and analysis tools. By thoughtfully creating aggregate tables which can be shared by the different schemas one could for instance create a fact table on a certain aggregate level and have different datamarts aggregate this table further depending on their needs. The business logic and performance demanding calculations only have to be done once this way.

The front-end structuring area on the other hand, is fully modelled towards the end user’s and software requirements. The front-end structuring area is the final part of the architecture. The (data)model within the front-end structuring area is modelled for a specific purpose, audience and technical requirement. The complete datawarehouse can contain very different structures with different models and different ‘versions of the truth’ depending on the business needs. These models can have many forms such as traditional datamarts or fully normalised models. In the process from loading the data from the integration layer to the presentation layer most of the business logic is implemented. The following properties are defined:

  • Star- and snowflaked model
  • Contains 1-on-1 views on the database tables. These views are the dividing line between datawarehouse (back-end) and the reporting environment
  • Datamarts are organized for end user’s needs: per subject area, project or any way the end user needs to see it organized
  • Dimensions can be shared between datamarts
  • Contains semi-aggregated / helper summary tables which are not accessible to the front-end tool but function as precalculation for different dimensions
  • Error handling and selection of data quality