Modelling a hierarchy in Data Vault

This is an updated version of a post written way back in 2010 (!) because some of the content was lost. Interestingly, the concepts haven’t changed that much. But the imagery has hopefully improved!


Hierarchies are essentially relationships, and therefore implementing these concerns Natural Business Relationships (NBRs) and Data Vault’s version of this – the Link entity.

The purpose of a Link is to record relationships between Business Concepts, which are Hubs in Data Vault. As such, the Link contains contains the unique (list of) relationships between Hubs. And a hierarchy is just that; a series of relationships where each involved Hub record represents a higher- or lower level in a relationship.

Essentially, implementing hierarchies is one of the ways you can use the Link entity to drive a certain behaviour. However, it’s not the only use case and various variations of the same theme exist.

Different ways of using Links

Consider the image below, which depicts two of the most common Link use-cases beyond ‘just’ establishing a relationship between two or more Hubs:

Hierarchical and Same-As Links

Hierarchical and Same-As Links record a specific relationship behaviour between instances of the same Hub type. For example, the reporting line relationship between employee and manager (for a hierarchical relationship) and relationship that two customer records are the same for a ‘single view of customer’ (for a same-as relationship).

From a technical perspective, these Links are structurally the same as any other Link, but they relate to the same Hub – as opposed to two or more different Hubs. The fact that the same Hub is referenced at least twice means that the (Hub) keys in the Link must be aliased because you usually can’t have the same column/attribute name in the table object twice. The name must be unique, so an alternative name (alias) is required.

So, even though the reference Hub is called ‘Employee’ in the above example the referenced keys are ‘Manager_SK’ and ‘Employee_SK’. SK stands for ‘Surrogate Key’ in this example.

If we focus on the hierarchical relationship, this is essentially capturing a recursive structure in the Link. An employee may report to a manager, but the manager may also report to another manager. These hierarchies can be ragged, skip-level, circular – anything.

The Link can be supported by a Link-Satellite to capture effectivity information to record the validity period for a specific relationship. And each relationship can be described using a Link-Satellite also. Due to the (data) dynamics around storing effective dates on relationships and tracking changes on context it is often recommended to separate this into two separate Link-Satellites that both describe the Link.

Modelling out a relationship

Capturing a hierarchy can be implemented in a recursive structure as shown above. But it is also an option to ‘model out’ the hierarchy into separate objects, where each ‘level’ in the hierarchy is defined as its own Hub.

In this example, where both the hierarchical and same-as use-case are shown, the difference is that each level has been modelled as its own Hub. There are no double references to the same Hub, so this looks more like a regular relationship Link again.

If your hierarchy has multiple levels, you model separate Hubs and create relationships between those levels using Links. As before, each relationship can be described using Link-Satellites to capture any context or validity information that may apply.

What to choose?

Both approaches, the recursive approach and the modelling-out approach, are very similar and functionally achieve the same thing. The recursive structure is arguably easier to load, and adding levels doesn’t require structural changes such as adding new Hub and Link tables.

On the other hand, the modelled-out approach is usually easier to query and/or load into Point-In-Time or Dimension tables. Also, the model is a bit easier to read because it is clear from the objects what elements/levels exist for a given relationship. However, data logistics require a bit more effort because each separate Hub or Satellite requires one or more data logistics processes to load the data, and depending on the source of the data this may involves adding filters or predicates to ‘route’ the data into the right object (e.g. a WHERE clause).

Your decision on this is likely based on your own philosophy on this, and that works because both approaches achieve the same outcome. Personally, I prefer to model-out the relationship where possible because I prefer data models that are more specific on the purpose and data relationships. But this is really optional.

Roelant Vos

Ravos Business Intelligence admin

You may also like...

7 Responses

  1. Apex says:

    Wow, so many technical fields. Are they all really required?

    • Ravos says:

      Hi,
      It’s not so bad as it looks 🙂 For any table you’d want to logwhen a record was inserted and from which interface or source it originated. Surrogate keys are also mandatory. The rest I guess could be optional but they will certainly help to make life easier. For instance a record valid indicator will save you from querying date attribute to determine which record is the most recent one. Now you can just select on this indicator being true (it’s faster too).

      The valid time periods are always a source of discussion. In general you could derive a valid time period by checking the previous related record, but this is performance intensive and more hassle query wise. The inclusion of a source valid time period is the only one that is really optional! While a useful concept I’ll probably remove this from the definition later on since they are just regarded as ordinary attributes from a Data Vault point of view.

      Data Vault in its leanest form only requires the surrogate key, the datetime of insert and the record source indication.

      Regards,
      Roelant

  2. Apex says:

    Hello,

    OK, I got you. I am agreed that traceability very important, but drawback is space consumption… It’s always hard to decide for yourself, what is more evil wide big tables in DWH or lack of trace info within it:)

    Could you explain what meaning do you put into EVENT_DATETIME field? How does it differ from RECORD_INSERTED_DATETIME? And what difference between DWH_VALID_xxx axis and ORIGINAL_VALID_xxx axis?

  3. Ravos says:

    Hi,

    You’re right, storage space can be an issue and is definitely something to keep in mind. You can adapt your model to consume space relatively easy though. When you take Data Vault as a starting point you could split a big satellite into a ‘fast changing’ and ‘slow changing’ satellite (Kimball theory states the same regarding dimensions). This alone will save enough room to make up for the extra auditing columns. In theory you can also normalize the process information bit although I wouldn’t recommend it. Remember though, once you’ve stored your record there is no way to retrace its steps if questions do occur. This can be extremely important especially in the financial world with their compliance rules (Sox, Basel). I would say the extra bit of hardware is worth the cost 🙂

    About the event_datetime. This would be an additional process attribute to store the date/time the event in the source system which caused the change took place. In other words: when someone updates your file in the HR system (=event datetime) it will be done earlier than the record is inserted into the DWH (=record inserted datetime). Knowing when the change in the source took place can come in handy when backtracing information.

    This is especially important when your source is a message based platform where due to various reasons messages are not always guaranteed to arrive in the same order as they were sent. An event datetime can be used to process these in the correct order upon arrival.

    All process attributes are always added to be ready for future change!

    Regards,
    Roelant Vos

  4. Ashwini Bhoite says:

    Hi, really curious to look at the hierarchy model but not able to load the image, shows IP address could not be found.

  5. Ravos says:

    Hi Ashwini, apologies, somehow your reply fell through my notifications. I will have a look at this today and dig up / re-link the image.

Leave a Reply