To date, there are (still) many Data Warehouse specialists who argue that an Enterprise Data Warehouse (EDW) is always expensive, complex and monolithic and therefore only ‘for the big companies’ whereas the real message should be that an EDW is in fact driven by business cases, adaptive and quick to deliver. One of the goals of this site is to define the EDW in a way that is true to this statement and to prove that it is effective in any situation regardless of the platform or technology that provides the data.
The key element of the solution is the adoption of an architecture that differs fundamentally from the common approach to Data Warehousing, and what has been used to design Data Warehouses over the past 15 years. This architecture delivers -out of the box- on flexibility in both design and implementation, provides true resilience, is ‘future-proof’ (changes are a given for any Data Warehouse) and offers complete traceability of data. The designs will start on a conceptual level and will gradually be documented in more detail in order to provide the specifications required for ETL generation.
These pages will focus on the various components that define the EDW architecture and describe how they meet the ambitions. The components are mostly establised and well known Data Warehouse concepts, but the careful positioning makes a big difference. Combined, they define the EDW that is truly consistent, adaptable and future-proof while ensuring traceability and, equally important, works the same way in every scenario.
The approach provides a framework for the Enterprise Data Warehouse which will be used as context for the future (more in-depth) documentation and code on this blog. It is based on various experiences in data modelling, ETL and best-practices.
The solution defines a number of architectural layers each with an explanation of which functionality will be implemented and how this is implemented the best way. Each layer has its own page with detailed information including the logic and attributes.
Fundamental EDW architecture requirements
No reference architecture can be conceived without certain basic principles (or ambitions):
- Data Vault 2.0 approach. Traditionally datawarehouse models were classified as either fully normalized (early Inmon) and fully denormalized (Kimball). A hybrid approach such as Data Vault provides a best-of-both-worlds solution while still being able to support existing approaches. In Data Vault, the normalisation part ensures that every meaningful entity has its own separate table for distributing surrogate keys and one where history is stored in a traditional type-2 (denormalised) fashion.
- Separation of datawarehouse concepts. The main functionality within a datawarehouse is divided in separate steps. This is opposite to a typical Kimball / Dimensional Bus Architecture where keys, structure and history are combined in the same entity. Splitting these functions provides additional flexibility and maintainability in the future. This includes:
- Surrogate key distribution
- Storing and tracking history
- Structure and hierarchy
- Cleaning and integration
- Design for flexibility. A major pitfall of Data Warehouse models is the complete focus of the design to model the information according to current requirements or view of the world. Because requirements change over time this approach to data modelling limits the future and flexibility of the Data Warehouse. For this reason the new EDW shifts the focus from upfront information modelling to managing and the data as an asset using established principles such as surrogate keys, relationships and history tracking. But because no business logic is implemented during the loading processes into the Data Warehouse this structured and historical System of Record can be used for any type of future reporting. The Data Vault model provides a limited (3) number of entity archetypes which result in a flexible yet manageable model where relationships and structure can be easily changed. A flexible architecture also offers:
- The ability to load data even when relationships between data change. Most hybrid modelling techniques use many-to-many relationships separate of the main entities, even when the data could currently be modelled as one-to-many without a separate relationship table.
- To cater for different levels of completeness of data. This impacts the way errors are handles / rejected and the specification of failure.
- To provide multiple versions of the truth (multiple datamarts based on the same data).
- Handling changing business rules. To be able to always represent data in another way, this impacts the integration approach and the datamart concept.
- Separate data ‘as-is’ and converted data, while still retaining the relation to the original data.
- Applying business rules / logic as late as possible: in the Presentation Layer.
- Full scalability, which manifests itself in many ways. Due to the modular design the Data Vault is extremely flexible (yet easy and consistent) when it comes to adding or changing data sources. It also enables the Data Warehouse team to change the table structure when database size becomes an issue to introduce a degree of normalisation without impacting the Data Vault concepts. A properly management Data Vault is indefinitely scalable.li>
- Future proof. As mentioned earlier, resilience when coping with changes is a major goals of the architecture
- Real-time ready. All concepts are checked against the possible future of handling real-time data sources.
Ultimately, the goal of the EDW is to provide a meaningful reporting and analysis or data mining structure that Business Intelligence software can access. The main difference with previous EDW architectures is that this EDW can accommodate all types of structures (star schemas, snowflakes, full table, 3NF etc.) and even change its approach without losing data.