The ‘architecture’ section of this weblog serves as an introduction for the Data Integration Framework Github repository (see the collaboration section). This Github repository contains the various levels of documentation that constitutes the Enterprise Data Warehouse (EDW) architecture, pattern documentation and supporting metadata and control frameworks.
This page describes the entities in the Integration Layer. Also known as ‘Core’, ‘System of Records’ or ‘Data Warehouse’ area. As a layer the Integration Layer is made up of the Raw (Operational) Data Vault and the Business Data Vault areas. In this context (obviously) the Data Vault hybrid modelling technique is used for this part of the solution architecture but other approaches fit in this framework as well. The chosen modelling technique applies to both areas and there will a lot of overlap between them. Essentially both areas together form the Data Warehouse model and they should not be stored in separate environments / databases.
The reason to split this layer into two conceptual areas lies in the flexibility and auditability. Essentially you will store data in its raw form while issuing surrogate keys early on. At the same time you create alternate views (cleansed) for this data and you can relate this raw and cleansed data to each other by referring and using the surrogate keys. From an architectural point of view it is advised to split the entities into logical layers because different processes are involved. Technically these tables should be located in the same schema or environment because they are linked to each other by primary/foreign key relationships.
In general, new information in the Data Vault consist of three entities:
- A defining entity (Hub) with a clear name which covers the subject
- A separate entity containing the historical information (Satellite)
- Between the defining entities relationship tables (Link) exist which describe the relationship between different entities. This relationship or link table can have its own Satellite (‘Link Satellite’) containing descriptive attributes of the relationship
The following is a brief overview of this process. The domains and naming conventions are explained in the domains page.
The business key entity (Hub)
From a modelling perspective the the defining entity (Hub) is meant to capture the business concept / business entity. From a technical Data Warehouse perspective the role is to distribute the (meaningless) Data Warehouse key based on the business key. In Data Vault 2.0 the Data Warehouse key is a hashed version of the business key value. The hub entity (HUB) effectively contains the unique list of all the business keys and contains the following attributes:
- <Hub Key>. This is the meaningless Data Warehouse key (or hash value) and also the unique identifier / primary key
- <Business Key>. The business key value. This can be a composite key and is usually not a meaningless sequence ID. The ETL checks this key for existence in the hub table, if it does not exist already, it will be added
- <ETL process ID>. The pointer to the ETL control framework or metadata which uniquely identifies the ETL execution instance and may be related to other ETL metadata including the record source
- <Load Date/Time>. This value provides additional auditing metadata. More explanation about the nature of date/time attributes is detailed here. In default Data Vault 2.0 specification this is the date/time the record was loaded in the database
- <Record Source). The relation to the source interface (file, table, system) where the record originated from
Data Vault also specifies the optional Last Seen Date/Time but I feel business key tracking is better handled in a dedicated tracking Satellite so I have not incorporated the Last Seen Date/Time in the standards.
After some time the Bub entity will become a unique list of all the individual instances of that entity that have ever existed (as in presented to the Data Warehouse).
The database key structure is:
Primary Key: <Hub Key>
Unique Key / Alternate Key: <Business Key>
The main history entity (Satellite)
The Satellite entity containing the descriptive historical data (SAT) contains at least the Hub Key (from it’s parent Hub table) and a start Data Warehouse date/time identifier (the effective date/time concept). The Load Start Date/Time attribute is used for this purpose. The primary key of this entity is the combination of the Hub Key and the Load Start Date/Time attributes.
This entity has all the attributes which are extracted from the source system, so that history can be tracked accordingly. The Satellite entity contains at least:
- <Hub Key>. The key (integer or hash value) from the defining / parent Hub entity. This attribute is a part of the Primary Key of the table
- <Load Date/Time>. The date/time when this record became active. This attribute is also part of the Primary Key. This is usually the date/time the event took place in the source (if known) or the date/time when the ETL mapping has been run. Similar to the Hub explanation this is a highly contested initial key design decision
- <ETL process insert ID>. The pointer to the ETL control framework or metadata which uniquely identifies the ETL execution instance that inserted the record. This identifier and may be related to other ETL metadata including the record source
- <ETL process update ID>. The pointer to the ETL control framework or metadata which uniquely identifies the ETL execution instance that updated the record. This identifier and may be related to other ETL metadata including the record source
- <Record Source>. The relation to the source interface (file, table, system) where the record originated from
- <Change Data Capture operator>. Attribute that contains information for Inserts, Updates or Deletes. This is required to be able to track logical (soft) deletes
- <Source Row ID>. A row identifier to label each individual row within an ETL data set. This may is needed to add granularity or track near real-time data sources
- <Record hash value>. A hash across all attributes (including the Change Data Capture operator) which is used for record comparison (change tracking / Type 2 behaviour)
- All descriptive attributes from the source
- <Load End Date/Time>. The date/time when this record was closed. A record is closed based on changes in the history, the value of this attribute is the <Load Start Date/Time> of the previous related record minus 1 second. This is redundant, but can be convenient.
- <Current Record Indicator>. An indicator whether the record is active / most recent. This is redundant, but can be convenient.
The database key structure is:
Primary Key: <Hub Key> and <Load Start Date/Time>
Unique Key / Alternate Key: Not applicable
A source system can have multiple time periods in the source but these are treated just as any other regular attribute. Interpretation of these periods will be done during the loading processes from the integration layer to the Presentation Layer. Satellites can be split for many reasons and one of the possibilities splitting the satellites has to offer is creating a split between fast-changing and slow-changing attributes is also an option. In theory every attribute can be modelled into a separate satellite (but this is not very practical).
No ‘Last Seen Date/Time’ is needed here because the present date/times supply this information: a change in information means that the table is updated through a type 2 mechanism.
The relationship entity (Link)
The link entity (LINK) contains two or more Hub Keys including the date/time record has been created (Load Date/Time). It shows the relation or transaction between defining entities. The link table contains the unique list of the relationships between business concepts (technically: Hub Keys).
- DWH_LINK_ID_<hub>. A meaningless key added for future expansion purposes. Possibly a change in granularity (i.e. extra hub key added), or the need to store history on the relationship. In this case a new satellite based on the link table will be created. Possible uses for this satellite: defining if the relationship is active or has been deleted.
- RECORD_INSERTED_DATETIME. The datetime the record was inserted in the table by the ETL process.
- SOURCE_SYSTEM_ID. The relation to the source interface (file, table, system) of which this record originated from. Not the batch because the batch ID changes every run! This is in fact a shortcut to the MD_ORIGINATING_SYSTEM table in the metadata schema.
- MODULE_INSTANCE_ID. The relation to the operational metadata information.
- DWH_ID_<hub 1>.
- DWH_ID_<hub 2>.
Primary key: DWH_LINK_ID
Unique key: The combination of the migrated ID’s. All hub or link ID’s.
A link can connect to different types of satellites depending on their nature.
Raw (Operational) Data Vault
The operational Data Vault is the area where data from the staging layer is re-modelled into a Data Vault. The data in this area remains largely in its source form and nothing gets changed or altered. In doing so, it offers the flexibility of presenting the data in more than one way, while the original data remains unchanged. Auditing of data is also made easy as original image of data (obtained from source system) is retained and therefore traceable. In the integration area, data from the staging area is first processed by identifying new business keys in the Hubs and assigning surrogate keys for newly discovered business keys. The same data is then further processed by loading them into the integration (hybrid) model. Changes to the satellite tables are tracked by means of using Slowly Changing Dimension Type-2 (SCD2) approach for every attribute.
Business Data Vault
In the business Data Vault area, enterprise wide business rule is applied to the data residing in the integration area. This is done to accomplish any of the following objectives:
- Data cleansing
- Data enrichment
- Data standardisation
The intention of the business Data Vault is to reduce replication of the rules downstream. An example would be the formatting of addresses where abbreviations such as “st” or “rd” get expanded to “street” and “road”, and the validation of phone numbers. And in separating the state of the data into two physical forms (original and modified), it gives the flexibility of applying multiple sets of rule to a specific data set, in order to suit specific needs.
In this example the data set starts with client records which are split between an client surrogate key table (Hub) and a table in which the history of the client attributes are tracked (Satellite). In this case attributes include address information. Since this is the raw data coming from the staging area the contents will be stored in the operational Data Vault. After the data has been loaded certain enterprise wide business rules are effectuated; this is done in the business Data Vault. This results in the creation of a subset of clients which have certain qualities into an entity ‘integrated clients’. The relationship of which client is an updated client is stored in a link table. Also, address information is updated for the client. Technically this will result in a new satellite table with the updated values while still retaining the link to the original values from the operational Data Vault.
As written, the structure of the business Data Vault closely follows the operational Data Vault. The same type of tables, metadata attributes and rules apply. The only real difference is that for a certain table not every sibling has to be present in this area. Depending on the applied business rules the business Data Vault data set can be a subset of the operational Data Vault dataset, or a fully updated one. The business Data Vault contains derived data from the integration area and this can be done for specific tables or for complete or mixed entity sets. For instance, if only cleansing of a specific satellite table is necessary it will be sufficient to only store a new derived satellite table which still links to the operational Data Vault Hub / surrogate key table. But if advanced algorithms like deduplication / match and survive are used this might require the creation of a new main entity with both a Hub table, Satellite and perhaps a Link table.