Modelling a hierarchy in Data Vault

Lately I have been having interesting discussions regarding modelling hierarchies in Data Vault. It seems that it’s the easiest way to create a new link table between the hubs which contain the hierarchy elements. In this case, the hub contains all the possible nodes and the link table contains all possible relations (parent/child) between these nodes. This can be expanded to add more information about the hierarchy relations including level indication and of course validity periods. The hierarchy itself can also be modelled generically but creating a single hierarchy hub. But this is all optional.

Modelling hierarchies in Data Vault.

Hierarchy example.

Roelant Vos

Roelant Vos

You may also like...

4 Responses

  1. Apex says:

    Wow, so many technical fields. Are they all really required?

    • Roelant Vos Roelant Vos says:

      It’s not so bad as it looks 🙂 For any table you’d want to logwhen a record was inserted and from which interface or source it originated. Surrogate keys are also mandatory. The rest I guess could be optional but they will certainly help to make life easier. For instance a record valid indicator will save you from querying date attribute to determine which record is the most recent one. Now you can just select on this indicator being true (it’s faster too).

      The valid time periods are always a source of discussion. In general you could derive a valid time period by checking the previous related record, but this is performance intensive and more hassle query wise. The inclusion of a source valid time period is the only one that is really optional! While a useful concept I’ll probably remove this from the definition later on since they are just regarded as ordinary attributes from a Data Vault point of view.

      Data Vault in its leanest form only requires the surrogate key, the datetime of insert and the record source indication.


  2. Apex says:


    OK, I got you. I am agreed that traceability very important, but drawback is space consumption… It’s always hard to decide for yourself, what is more evil wide big tables in DWH or lack of trace info within it:)

    Could you explain what meaning do you put into EVENT_DATETIME field? How does it differ from RECORD_INSERTED_DATETIME? And what difference between DWH_VALID_xxx axis and ORIGINAL_VALID_xxx axis?

  3. Roelant Vos Roelant Vos says:


    You’re right, storage space can be an issue and is definitely something to keep in mind. You can adapt your model to consume space relatively easy though. When you take Data Vault as a starting point you could split a big satellite into a ‘fast changing’ and ‘slow changing’ satellite (Kimball theory states the same regarding dimensions). This alone will save enough room to make up for the extra auditing columns. In theory you can also normalize the process information bit although I wouldn’t recommend it. Remember though, once you’ve stored your record there is no way to retrace its steps if questions do occur. This can be extremely important especially in the financial world with their compliance rules (Sox, Basel). I would say the extra bit of hardware is worth the cost 🙂

    About the event_datetime. This would be an additional process attribute to store the date/time the event in the source system which caused the change took place. In other words: when someone updates your file in the HR system (=event datetime) it will be done earlier than the record is inserted into the DWH (=record inserted datetime). Knowing when the change in the source took place can come in handy when backtracing information.

    This is especially important when your source is a message based platform where due to various reasons messages are not always guaranteed to arrive in the same order as they were sent. An event datetime can be used to process these in the correct order upon arrival.

    All process attributes are always added to be ready for future change!

    Roelant Vos


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.