Comparisons between Data Warehouse modelling techniques
This post provides an overview of the main pros and cons for various Data Modelling techniques:
- Third Normal Form (3NF) – The Corporate Data Model.
- Dimensional Modelling – Facts and Dimensions.
- Hybrids – Best of both worlds? Data Vault, Anchor Modelling and similar.
It has become a bit of a large post but then again, there is a lot of ground to cover.
Third Normal Form (3NF)
The pros for 3NF are:
- Most IT professionals are familiar with normalisation principles, even if they’re not DWH or Data Integration specialists. It is therefore relatively easier to get access to skilled resources.
- Highly structured.
- Reasonably suitable for near real-time loading.
- Easy to understand and relatively easy to extend.
- Information is stored in an efficient way (that’s why normalisation was invented).
The downsides for 3NF are:
- Long developing time (mainly due business changes and corresponding rework).
- Difficult to model out for an enterprise.
- Requires a Presentation Layer (not conductive to BI tools and OLAP).
- Coupling of key distribution introduces dependencies, and therefore reduces parallelism.
- ‘Subject Area Database’, modelled to current views.
- Definitions changing (‘customer’ means something else now).
- Growth of new relationships.
- Duplicate data sources require a priority / trust layer which effectively destroys the audit trail as it is handled inbound to the Data Warehouse.
- Cascading impact: changes ripple through to underlying tables (parent-child complexities).
- Increased number of ETL load depenencies due to referential integrity, making ETL harder to implement, test and manage.
- In many cases 3NF can be misleading as Referential Integrity is seen as Data Quality (Data Quality != Referential Integrity).
- Issues with time driven Primary Keys (new parent leads to key change).
- In 3NF relationships between Business Keys in the same table are ‘solidified’ to, for example, 1:M. When past or future data requires different relationships (e.g. new set of data requires M:M instead of 1:M) the whole structure has to be changed in 3NF. In comparison the Data Vault Link entity allows flexibility in handling relationships between the Business Keys (M:M, M:1, 1:M, 1:1, etc.) without changing the model structure.
The most significant problem related to the 3NF / Corporate Data Model / CIF approach is that first of all even the inventor of CIF (Bill Inmon) doesn’t think a normalised model is suitable for Data Warehousing. The reason is that the information is modelled to the way the company wants or thinks the data to be structure (the ‘ideal’ of our business) instead of the reality. This typically leads to very specific (unique and therefore hard to maintain) and complex ETL logic to map the data to this model. It also requires a big upfront modelling exercise before development can start and changes along the way are very difficult to implement as this without exception requires rework of related ETL.
The pros for Dimensional Modelling are:
- Perfectably suitable for multi dimensional analysis.
- Organised per subject area; easy to understand for business users.
- Minimise data replication (data landing).
- Suitable for historical storage.
- Easy to link in aggregation points (higher levels / aggregate awareness).
The downsides for Dimensional Modelling are:
- Expensive updates and deletes due to denormalisation.
- Dimensions’ over time (Type 1, 2 and 3) and uses set paths which are cemented in ETL.
- Architecture needs many different kinds / flavours of tables to implement a real application (helper, bridge, junk, mini etc.).
- Grain issues difficult to resolve.
- Real-time loading impractical.
- Issues with transactions appearing before dimension data.
- Complex loading and changing of history.
- Begins to fail under very heavy loads.
- Changes are invariably high-impact.
- Inflexible modelling of basic elements such as history, structure and key distribution.
- No data mining or exploration.
- Difficult to automate / no templates.
- Large data volumens / dimensions.
The most significant problem related to a Dimensional Model as advocated by Ralph Kimball (which means Data Staging + Star Schema) is the fact that is it is defined for a particular purpose or solution. Because the history of the system is encapsulated in the Facts and Dimensions as well this leads to major issues when the definition or business changes as it is not easy to change the particular purpose without losing or corrupting (part of) the embedded historical data. This is also true for additions to the model. While there are tried and tested methods of splitting dimensions it always is a cumbersome task and doesn’t take away the fact that history is only build up from the moment a particular attribute is defined as Type 2 or 6.
Nuances: Snowflake Models These are very hard to implement in a SCD2 environment (in a pure way) as changes in lower levels (outer snowflake tables) trigger changes in related (higher level) tables.
Currently Hybrid techniques compete mainly with 3NF in the realm of the Enterprise or Corporate Data Warehouses. In these situations there is already consensus that a three-layered solution is required. Hybrid techniques are not meant to compete with pure reporting environments, except in some cases when run on MPPs where views can replace the Dimensional Model. In most cases a Hybrid model serves as core DWH model and is extended with a Dimensional Model as Presentation Layer or Data Mart.
The pros for Hybrids (e.g. Data Vault) are:
- Completely auditable architecture.
- DWH model is aligned with the business model.
- Extremely adaptable to (business) changes.
- Designed and optimized for the EDW.
- Durable, consistent and predictable.
- Consistency pays back over time.
- Lends itself for real-time processing.
- Simple and consistent.
- Decoupling of key distribution enables a very high degree of parallelism (due to a reduction of ETL dependencies).
- Easy to load to a Dimensional Model.
- Layered isolation from change.
- Can be incrementally built, easily extended. When properly modelled no rework is required when adding additional information to the core Data Warehouse model. If new information is added there is always work involved in the greater picture such as development of an interface or update of the view, report or Data Mart but no existing components of the core model require rework.
- Scalability; provides for multi-terabyte storage.
- Easier detection of ‘dead data’.
- Handles combinations of datasets with different arrival speeds (loading frequencies).
- Isolated, flexible and incremental development (organic growth).
The downsides for Hybrids (e.g. Data Vault) are:
- Scaling versus performance: lots of outer joins and tables in queries. Because Hybrid techniques allow denormalisation to a large extent this typically leads to more tables compared to a Dimensional Model (though not that more compared to a 3NF model). When building a Report, Cube or Dimension typically a lot of tables need to be joined to achieve the result. The counter argument is that Hybrid techniques are extremely consistent many of these tasks can be automated, either by using clever scripting or ETL automation. In a sense it’s a lot of the same type of join but a very easy to understand one.
- Additional storage of data. As with 3NF an Integration Layer / Core Data Warehouse layer is essentially a copy of the data which can be regarded as redundant. Counter argument: storage is cheap. Not all storage requires a database, other types of archives can be used as well. There always is a balance between this, performance and complexity. In this case, when weighting costs of additional storage, memory of performance against ease of maintenance, flexiblity and the other benefits (i.e. audit trail, ability to regenerate, no single point of failure etc.) it is clear to see that adding some additional hardware is by far the better option.
- Not intended for ad-hoc end user access (including BI tools and OLAP). In other words: requires a Presentation Layer. A counter argument is that a Hub-and-Spoke architecture lends itself really well for Data Mining and in larger implementation Analytics / Data Scientist teams usually do have direct access. Another counter argument is that a Presentation Layer is implemented to remove the need for users to write complex queries and present data in an easily understandable structure.
- Aging relationships. The counter argument is that there is no reason to remove older relationships as they still show the state of the system at a point in time and they can easily be removed from use in the Presentation Layer.
- Currently not an ‘open’ platform as implementation guidelines are fairly hard to get hold off and/or get agreement on. There are no strict or formal freely available implementation guidelines. The conceptual side is fairly well documented. This may lead to different people having different views on the implementation side.
- Does not provide solutions for Data Marts. The counter argument is that a Hybrid core Data Warehouse model is a perfect solution for the Data Staging concept in Dimensional Modelling and together they reduce some of the downsides of having a Dimensional Model.
- Requires highly structured and experienced (data) architect role (more so in comparison to other techniques).
- A higher than average number of database and ETL objects. The number of objects in a DV vs. 3NF is approximately a 3:1 ratio at the beginning of the EDW build and approx. 2:1 for mature warehouses. Counter: Objects in DVs are generally smaller and easier to confederate.