How different modelling approaches impact your Data Vault
Data Vault methodology, and by extent all Ensemble Logical Models, are designed to be flexible. Approaches such as these accept that the model, as an interpretation of reality, is always changing.
Change can be driven through technology, for example when operational systems within an organization are replaced by others. Or, change can occur because understanding of the subject matter is progressively increasing. What was initially thought to be the case may turn out to be slightly different than anticipated. Or, earlier assumptions may simply have been incorrect.
Additionally, requirements on what to use the data for also may become more specific over time. This, in turn, can force a more detailed definition and clarification for certain parts of the model.
The model, and therefore the complete data solution, is always evolving. To support this, having access to reliable automation tooling is imperative.
Even when accepting that models do change over time, and investing in automation technology that supports this, the modeling process itself can be started from very different perspectives.
This starting point may greatly impact how the model is likely to evolve over time.
Structure does not mean compliance with methodology
Having a certain model structure does not necessarily mean that a given methodology is implemented correctly. Or, the model structure itself does not mean compliance with certain concepts.
If a model looks like a data warehouse model, swims like a data warehouse model and quacks like a data warehouse model doesn’t mean the overall solution necessarily qualifies as a data warehouse.
I find this to be especially confusing in the Data Vault community where terms such as ‘raw’, ‘business’, ‘interpreted’, ‘operational’, ‘source’ Data Vaults can have different meanings to different people.
It also makes for a great way to show how the modelling starting point changes the outcome, and to what extent data warehouse criteria are satisfied.
When it comes to defining what a data warehouse is, I’ll revert some the key characteristics as originally provided by Bill Inmon. While there are various definitions, this one seems to remain the most popular.
The main themes of this definition are that a data warehouse is:
- Subject-oriented; meaning that data is organised for a specific subject area or goal, as opposed to based on how the operational systems that provide the data may organise it
- Integrated; covers consistency in conventions and of course the ability to bring together disparate data sets from various places
- Non-volatile; which means that the data is not modified once it is processed
- Time-variant; which means that data contains an element of time and can be analysed in time series
It is interesting to consider how these requirements apply to the model when different starting points for the design are chosen.
Starting points for modelling
When designing a Data Vault, at a high level the starting points for data modeling can be:
- Source-based, using the available (source) data
- Industry reference models and ontologies
- Business driven – using the collective understanding of the business model and processes
Source-based approaches are popular by software vendors, because allow immediate and recognisable results by deriving a Data Vault model from the structure of the data sources. For example, by using primary key information and foreign-key references. This makes the data source look like a Data Vault, but is created from a technical perspective and biased by how the operational system organises its data. Generally speaking, the resulting model does not fully represent the business view of the data. Often, there will be issues with this related to managing model changes over time, as we will see later.
Alternatively, sometimes organisations opt to use industry reference models to kick-start the modelling efforts. Industry models tend to use more generic and abstract terms and definitions, which may not align to how your company operates or refers to things. Although different levels of abstraction can be considered, the resulting models tend to be more generic, using less familiar names. This can lead to issues to delivering the data, as we will see later.
Lastly, a common approach is to ‘workshop’ a model drawing from the collective understanding of the business and its processes from the people involved in the design. The resulting ‘business model’ is more likely to contain definitions and terms that are familiar and meaningful, at least to the involved team(s). These models tend to be stable, because they are modelled after business processes and these usually don’t evolve as fast as operational systems do.
Depending on which starting point you choose, the Data Vault model may turn out very differently – at least in the beginning.
An example
Let’s say there are three operational systems involved in a new data solution initiative for a Fast-Moving Consumer Goods company.
- A Human Resources (HR) application that organises employee records for each store. The administration is done by employee ID against a store hierarchy. The store and the store hierarchy are the central constructs around which the HR department is organised
- An operations Asset Management System (AMS) application that manages the various brick-and-mortar assets including lease, maintenance, and general upkeep
- A finance system to manage the General Ledger. Expenses, budgets, and scenarios are tracked at account code level
In this example, the HR system organises personnel according to a store hierarchy. This is the process that HR team is familiar with, and uses every day. The AMS system focuses primarily on building maintenance. Operations is familiar with brick-and-mortar ‘assets’ that are uniquely numbered in the AMS and are referenced for delivery and maintenance purposes.
Lastly, there is the Finance system. Finance is focused more on the balance and ledger. The Finance team works with General Ledger ‘Account’ codes that identify costs and profits at store level.
When it is time for resource planning, HR provides Operations with an Excel sheet that maps the employees related to each asset. This is not maintained in the Operations system directly, but as a process it is owned by the HR team. It may not be widely known.
There is also a system interface between between the Finance and Operation, which runs on a daily basis. It was developed by IT and makes sure that each account can be mapped to an asset. This is ultimately recorded in the Operations system as a free-format notes field that can be used for reference if certain cost-attribution needs to be discussed with the Finance team.
There is no way to relate accounts to stores directly (from the Finance to the HR system) which makes it hard to manage expenses at a more detailed level.
At a high level, this setup looks like this:

The resulting Data Vault will vary greatly depending on the modeling starting point.
Modelling based on the available data
The first approach, the source-based or bottom-up approach uses the available data, including Primary Keys and Foreign Key references between the data sets to derive a model for a specific target methodology such as Data Vault. However, this model will be created from a technical perspective.
When directly relying on the original data and its metadata and structure, the model will represent a Data Vault version of the original data. The added value of this model will be limited, and it may be more effective to simply working with a historized version of the original data directly – for example as provided by a Persistent Staging Area (PSA).
As covered in many books and articles by Dan Linstedt, Hans Hultgren, John Giles and others, the true value of the Data Vault model lies in being able to define the key elements in the business and use these to integrate the various disparate data sets against.
Deriving a Data Vault model from the original data is sometimes referred to as a Raw Data Vault. Following Data Vault methodology, however, this is strictly speaking not correct.
In Data Vault, a Raw Data Vault contains data which original meaning has not been fundamentally changed. Instead, it is re-organized in a such a way that it resembles the business view of the data.
In other words be more straightforward to do so, because the business naming will be more accommodating for the new data without requiring re-engineering.
If we consider the data that is received from the three example systems outlined earlier, the model may look like this:
Note that there is no systematic way to relate an ‘Account’ from the Finance system to a ‘Store’ and ‘Branch’. With some effort, it may be possible to establish a relationship between Store and Branch by also incorporating the Excel sheet.
This model is relatively more subject to changes in the operational systems. For example, if the HR system would adopt a new system, what used to be the ‘Store’ concept could still be mapped to the Store Hub even though this concept does not exist anymore in the operational system. Or a new Hub could be created.
Either way, this approach is not really integrated. Each concept in each system is modelled as it’s own Hub entity. It is not really subject-oriented either for the same reason. Following standard Data Vault patterns information is recorded and not subsequently modified. It is also recorded with a load date/time stamp (LDTS) so would satisfy non-volatility and time-variance requirements.
Adopting (industry) reference models
If a top-down approach for modelling is desired, there are many reference models that can be adopted to kick-start the data modeling efforts.
Reference models are provided by some software vendors, described in data modeling books or can be purchased off-the shelf or as part of a consultancy package.
In many cases, these are defined in a more abstract form and typically used for conceptual description of a given business. Using industry reference models may lead to a data solution model that is not necessarily meaningful for the team working with the model, because more generic and abstract definitions are used.
Also, while reference models are carefully crafted to describe a certain type of company the way it does so may not necessarily align to all such companies operate or name things.
For Data Vault, these more abstract terms from a reference model are best changed to more specific concepts and relationships. This is because sub- and super typing of model entities is recommended to be avoided in Data Vault methodology. Using super types, such as for example a ‘Party’, ‘Arrangement’ or ‘Service’ is likely to lead to complex role-playing for relationships and difficulties in extracting the right information.
Is a relationship between a Party and Arrangement a working agreement or a building lease? And the relationship between Party and Party between an employee and manager or service agent and customer?
A reference model is likely to evolve into more detailed and specific entities that are more familiar to the business, and more accurately describe the concept.
An initial model could look like this:

All concepts described for the systems in use can be categorized as a ‘Party’, and their relationships can be recorded in a self-referencing relationship for when this applies. For example, it is not immediately visible that the relationships between ‘Account’ and ‘Store’ does not exist in data.
Because the relationship carries many different meanings, it is necessary to record exactly what the nature of each relationship is. An additional Satellite entity (Link-Satellite) is added for this purpose.
As mentioned earlier, there are many levels of abstraction that can be applied for this. And the ‘party’ level is arguably one of the highest and therefore less useful as a physical model. It is used to explain the way abstraction impacts the Data Vault model.
The Link entity would be heavily involved in role-playing, and this can make it hard to extract the correct data. It will be a complex query involving many self-references.
In all cases, the model itself doesn’t ‘read’ as well. Without looking at the data it is not immediately clear what kind of data is involved with these concepts.
Distilling the shared understanding of the business into a model
To avoid introducing a bias for the way model objects, relationships and columns are named many organisations choose to start the modelling with a series of workshops that attempt to capture the essence of the information that is used.
By initially discussing more generic definitions and terms and considering the business processes, a model is created that is elevated from the way the data is physically recorded by the operational systems in production.
Business processes are generally more stable than the systems that support them. New systems and applications are introduced and modified regularly, but the business process does not fundamentally change at the same pace.
Business modelling workshops identify what type of things are being used in the business, and how these interact with each other. It also defined a way to identify a certain thing in a way that is not necessarily directly tied to the (technical) way a certain application records in data.
The result is a model that is specific to the business, containing definitions and terms that are familiar and meaningful to the people that work there.
Even though this model will also change over time, it is expected to evolve at a slower rate than the more technical starting point of using the original data to model the Data Vault.
If we apply this to the example systems used in this article, the model will look different yet again. Based on conversations and workshops, it is found that all three systems work with a concept that can best be described as a ‘Branch’ even though this is called differently in each system.
Functionally, they all mean the same thing, and acknowledging this results in the following model:

Note that the ‘Account’ and ‘Store’ are integrated into the ‘Branch’ concept. Both datasets load data into the Branch entity. By virtue of this, they are automatically integrated – a concept known as ‘passive integration’.
If desired, their descriptive information – the context – can still be separated in the Data Vault so that the original data is preserved for each system that provides the data.
Raw, or Business?
Note that all these models still can be referred to as a ‘Raw’ Data Vault according to the intended definition as originally defined. The mapped data has not been fundamentally altered, but it has been placed in the right context and given holistic, agreed and purposeful names.
From here, interpretation can be made on the raw data to complete the model and to ultimately be able to provide the requested information. This is where the ‘Business’ Data Vault concept comes in. In terms of structure, the Business Data Vault uses the same conventions as the Raw Data Vault. In the context of modelling as for example in this article, the terms ‘Raw’ and ‘Business’ can be a bit unhelpful so I wanted to clarify this again here.
The difference lies is that the Business Data Vault uses the Raw Data Vault information to manipulate data in a way that would otherwise change the meaning of the data. In Data Vault methodology, this is referred to as applying destructive transformations – often called ‘business logic’.
An important consideration is the approach that will be taken with regards to implementing destructive transformations. Is the intent to create a source data based Data Vault first, and then add business logic? Or will the approach be to start with a business model, map the available data and define a Raw Data Vault that way?
Wrapping up
Using the same scenario, the three resulting modelling starting points result in very different Data Vault models. The recommended approach is to start each data solution iteration with a business modeling exercise to at least find the common ground on definitions and terms that exists across the involved stakeholders. To get the best possible starting point with a level of detail that is fitting to the current state of the business.
But there is more to it than this. Taking the proverbial ‘step back’, it is important to consider that modelling and modelling workshops can be just ‘words’. Definitions will change over time, becoming more precise, specific or lose some relevance to the bigger picture when discussed in more detail over time. It is OK to change your mind, and actually a good thing, to incorporate progressive understanding of the subject matter and revise the model. No one, no where will ever have the full understanding of everything at a given moment, and on top of this interpretations change.
There are advanced data modelling approaches that accurately capture the terms, context, and meaning so that the chosen business concepts are more than just ‘words’ – and allow collaboration on this. Fact-Oriented Modelling, for example, provides mechanisms to collect, represent, and reuse this knowledge and use it to generate information models and even technical artefacts. These approaches use the structure of natural language, as well as examples and semantics to accurately represent information models.
If this is of interest, please have a look at CaseTalk, FIML, or FCO-IM.
Automation approaches exist to cater to translating the design into a physical implementation, and allow easier refactoring of the solution when certain parts of the solution are defined differently, for example by modelling a Context property into a Core Business Concept.