Enterprise Data Warehouse Architecture

The ‘architecture’ section of this weblog serves as an introduction for the Data Integration Framework Github repository (see the collaboration section). This Github repository contains the various levels of documentation that constitutes the Enterprise Data Warehouse (EDW) architecture, pattern documentation and supporting metadata and control frameworks.

Introduction

To date, there are (still) many Data Warehouse specialists who argue that an EDW is always expensive, complex and monolithic and therefore only ‘for the big companies’. We are on a mission to change this by improving our delivery so that the EDW becomes driven by business cases, adaptive and quick to deliver. One of the goals of this site is to define the EDW in a way that is true to this statement and to prove that it is effective in any situation regardless of the platform or technology that provides the data.

The key element of the solution is the adoption of an architecture that differs fundamentally from the common approach to Data Warehousing, and what has been used to design Data Warehouses in the past. This architecture delivers -out of the box- on flexibility in both design and implementation, provides true resilience, is ‘future-proof’ (changes are a given for any Data Warehouse) and offers complete traceability of data. The designs will start on a conceptual level and will gradually be documented in more detail in order to provide the specifications required for ETL generation.

These pages will provide a brief overview on the various components that define the EDW architecture and describe how they meet the ambitions. Please consider looking into the Framework Github for more detailed documentation. The components are mostly establised and well known Data Warehouse concepts, but the careful positioning makes a big difference. Combined, they define the EDW that is truly consistent, adaptable and future-proof while ensuring traceability and, equally important, works the same way in every scenario.

The approach provides a framework for the EDW which will be used as context for the future (more in-depth) documentation and code on this blog. It is based on various experiences in data modelling, ETL and best-practices, which are continuously updated on the Github.

At a high level, the solution defines a number of architectural layers each with an explanation of which functionality will be implemented.

Fundamental EDW architecture principles

No reference architecture can be conceived without certain basic principles (or ambitions):

  • Data Vault (2.0) approach. Traditionally EDW models were classified as either fully normalised (early Inmon) or fully denormalised (Kimball). A hybrid approach such as Data Vault can provide a best-of-both-worlds solution while still being able to support existing approaches. In a way, Data Vault separates concerns by splitting up larger tasks into smaller ones which makes them easier to manage. Please have a look at this introduction overview on Data Vault. In Data Vault, the normalisation part ensures that every meaningful entity has its own separate table for distributing surrogate keys and one where history is stored in a traditional Type-2 (denormalised) fashion. Data Vault doesn’t make complexity go away, but it provides handles to simplify managing what is inherently complex matter.
  • Separation of EDW concepts. The main functionality within an EDW is divided in separate steps. This is opposite to a typical Kimball / Dimensional Bus Architecture where keys, structure and history are combined in the same entity. Splitting these functions provides additional flexibility and maintainability in the future. This includes:
    • Surrogate key distribution
    • Storing and tracking history
    • Structure and hierarchy
    • Cleaning and integration
  • Design for flexibility. A common pitfall for EDW solutions is the complete focus of the design to model the information according to current requirements, or view of the world. Because requirements change over time this approach to data modelling limits the future and flexibility of the Data Warehouse. For this reason the new EDW shifts the focus from upfront information modelling to managing and the data as an asset using established principles such as surrogate keys, relationships and history tracking. But because no business logic is implemented during the loading processes into the EDW this structured and historical System of Record can be used for any type of future reporting. The Data Vault model provides a limited (3) number of entity archetypes which result in a flexible yet manageable model where relationships and structure can be easily changed. A flexible architecture also offers:
    • The ability to load data even when relationships between data change. Most hybrid modelling techniques use many-to-many relationships separate of the main entities, even when the data could currently be modelled as one-to-many without a separate relationship table.
    • To cater for different levels of completeness of data. This impacts the way errors are handles / rejected and the specification of failure.
    • To provide multiple versions of the truth (multiple marts based on the same data).
    • Handling changing business rules. To be able to always represent data in another way, this impacts the integration approach and the martconcept.
    • Separate data ‘as-is’ and converted data, while still retaining the relation to the original data.
    • Applying business rules / logic as late as possible: in the Presentation Layer.
    • Full scaleability, which manifests itself in many ways. Due to the modular design the Data Vault is extremely flexible (yet easy and consistent) when it comes to adding or changing data sources. It also enables the Data Warehouse team to change the table structure when database size becomes an issue to introduce a degree of normalisation without impacting the Data Vault concepts. A properly management Data Vault is indefinitely scaleable.
  • Future proof. As mentioned earlier, resilience when coping with changes is a major goals of the architecture
  • Real-time ready. All concepts are checked against the possible future of handling real-time data sources.

Ultimately, the goal of the EDW is to provide a meaningful reporting and analysis or data mining structure that Business Intelligence software can access. The main difference with previous EDW architectures is that this EDW can accommodate all types of structures (star schemas, snowflakes, full table, 3NF etc.) and even change its approach without losing data.