Data Vault versus the persistent Staging Area
One of the questions I regularly get during presentations is what the benefits of Data Vault are over a persistent Staging Area. In other words: why go through the effort of defining a Data Vault model when you can receive the same ‘regeneration / recreation’ capabilities with a persistent Staging Area which directly feeds a Dimensional Model or similar presentation model.
First off; in my reference architecture I use both the Data Vault (in the Integration Layer), a non-persistent Staging Area (the Staging Area in the Staging Layer) and a persistent Staging Area (the History Area in the Staging Layer). From my point of view one concept does not exclude the other and everything can be used to complement the EDW. The purpose of the Areas is very different with the Staging Area focusing on retrieving the information and organising delivery to the EDW and in some (real-time) cases only serves as an archive only needed for Disaster Recovery. In this context the Staging Area is a sideline in the ETL processing with source systems feeding into the Data Vault directly.
The reason for having both models or in other words a decoupling between the ‘unintegrated set of raw data’ (Staging Area) and ‘managed raw data – by business key’ (Data Vault) is important is the way the Data Vault keys are used througout the system to identify the unique entities upstream and provide the audit trail. This also means that everything will become generic and template driven which is the requirement for ETL generation and the ETL itself will become more straightforward. The ETL from the source / Staging Area to the Data Vault also defines the entities and relationships which simplify the creation of the datamarts (which can also now be made template driven).
ETL required to load Data Marts directly is typically complex and it doesn’t help that Staging Area tables are stand-alone and typically have no Foreign Keys (recommended!) whereas Data Vault has streamlined this process and has additional capabilities to manage parallelism and dependencies. Additional arguments are accesability in terms of performance. The persistent Staging Area is optimised to store information as an archive function as best as possible whereas the Data Vault is normalised to a degree around information entitities which can be managend independantly (i.e. splitting of satellites, indexing).
In the end Data Vault is the only methodology that defines a separation of storage, history, structure, auditablity and recovery from presentation (data/information) models. All other approaches force a logical or conceptual data model of the (target) business directly on the ‘storage model’ greatly reducing flexiblitity and increasing complexity of the data model and the ETL required to update it.