{"id":57,"date":"2010-02-10T12:30:58","date_gmt":"2010-02-10T02:30:58","guid":{"rendered":"http:\/\/ravos.com.au\/wordpress\/?page_id=57"},"modified":"2018-11-20T14:30:22","modified_gmt":"2018-11-20T04:30:22","slug":"the-data-vault-layer","status":"publish","type":"page","link":"https:\/\/roelantvos.com\/blog\/enterprise_bi_architecture_overview\/the-data-vault-layer\/","title":{"rendered":"Integration Layer"},"content":{"rendered":"<p>The &#8216;architecture&#8217; section of this weblog serves as an introduction for the <strong>Data Integration Framework Github<\/strong> repository (see the <a href=\"https:\/\/roelantvos.com\/blog\/?page_id=1919\">collaboration section<\/a>). This Github repository contains the various levels of documentation that constitutes the Enterprise Data Warehouse (EDW) architecture, pattern documentation and supporting metadata and control frameworks.<\/p>\n<h3 style=\"text-align: justify;\">Introduction<\/h3>\n<p style=\"text-align: justify;\">This page describes the entities in the Integration Layer.\u00a0 Also known as &#8216;Core&#8217;, &#8216;System of Records&#8217; or &#8216;Data Warehouse&#8217; area. As a layer the Integration Layer is made up of the Raw (Operational) Data Vault and the Business Data Vault areas.\u00a0In this context (obviously) the Data Vault hybrid modelling technique is used for this part of the solution architecture but other approaches \u00a0fit in this framework as well. The chosen modelling technique applies to both areas and there will a lot of overlap between them. Essentially both areas together form the Data Warehouse model and they should not be stored in separate environments \/ databases.<\/p>\n<p style=\"text-align: justify;\">The reason to split this layer into two conceptual areas lies in the flexibility and auditability. Essentially you will store data in its raw form while issuing surrogate keys early on. At the same time you create alternate views (cleansed) for this data and you can relate this raw and cleansed data to each other by referring and using the surrogate keys. From an architectural point of view it is advised to split the entities into logical layers because different processes are involved. Technically these tables should be located in the same schema or environment because they are linked to each other by primary\/foreign key relationships.<\/p>\n<p style=\"text-align: justify;\">In general, new information in the\u00a0Data Vault\u00a0consist of three entities:<\/p>\n<ul style=\"text-align: justify;\">\n<li>A\u00a0defining entity (Hub) with\u00a0a clear name which covers the subject<\/li>\n<li>A separate entity containing the historical information (Satellite)<\/li>\n<li>Between the defining entities relationship tables (Link) exist which describe the relationship between different entities. This relationship or link table can have its own Satellite (&#8216;Link Satellite&#8217;) containing descriptive attributes of the <em>relationship<\/em><\/li>\n<\/ul>\n<h3 style=\"text-align: justify;\">Entities<\/h3>\n<p style=\"text-align: justify;\">The following is a brief overview of this process. The domains and naming conventions are explained in the domains page.<\/p>\n<h4 style=\"text-align: justify;\">The business key\u00a0entity (Hub)<\/h4>\n<p>From a modelling perspective the the defining entity\u00a0(Hub) is meant to capture the business concept \/ business entity. From a technical Data Warehouse perspective the role is to distribute the (meaningless) Data Warehouse key based on the business key. In Data Vault 2.0 the Data Warehouse key is a hashed version of the business key value. The hub entity (HUB) effectively contains the unique list of all the business keys and contains the following attributes:<\/p>\n<ul style=\"text-align: justify;\">\n<li>&lt;Hub Key&gt;. This is the meaningless Data Warehouse key (or hash value) and also the unique identifier \/ primary key<\/li>\n<li>&lt;Business Key&gt;. The business key value. This can be a composite key and is usually <span style=\"text-decoration: underline;\">not<\/span>\u00a0a meaningless sequence ID. The ETL checks this key for existence in the hub table, if it does not exist already, it will be added<\/li>\n<li>&lt;ETL process ID&gt;. The pointer to the ETL control framework or metadata which uniquely identifies the ETL execution instance and may be related to other ETL metadata including the record source<\/li>\n<li>&lt;Load Date\/Time&gt;. This value provides additional auditing metadata. More explanation about the nature of date\/time attributes is detailed <a title=\"A brief history of time in Data Vault\" href=\"https:\/\/roelantvos.com\/blog\/?p=1174\" target=\"_blank\" rel=\"noopener\">here<\/a>. In default Data Vault 2.0 specification this is the date\/time the record was loaded in the database<\/li>\n<li>&lt;Record Source). The relation to the source interface\u00a0(file, table, system) where the record originated from<\/li>\n<\/ul>\n<p>Data Vault also specifies the optional Last Seen Date\/Time but I feel business key tracking is better handled in a dedicated tracking Satellite so I have not incorporated the Last Seen Date\/Time in the standards.<\/p>\n<p style=\"text-align: justify;\">After some time the Bub entity will become a unique list of all the individual instances of that entity that have ever existed (as in presented to the Data Warehouse).<\/p>\n<p style=\"text-align: justify;\">The database key structure is:<br \/>\nPrimary Key:\u00a0&lt;Hub Key&gt;<br \/>\nUnique Key \/ Alternate Key:\u00a0&lt;Business Key&gt;<\/p>\n<h4 style=\"text-align: justify;\">The main\u00a0history entity (Satellite)<\/h4>\n<p style=\"text-align: justify;\">The Satellite entity containing the descriptive historical data (SAT) contains at least the Hub Key\u00a0(from it&#8217;s parent Hub table)\u00a0and a start Data Warehouse date\/time identifier (the effective date\/time concept). The Load Start Date\/Time attribute is used for this purpose. The primary key of this entity is the combination of the Hub Key and\u00a0the Load Start Date\/Time attributes.<\/p>\n<p style=\"text-align: justify;\">This entity has all the attributes which are extracted from the source system, so that history can be tracked accordingly. The Satellite entity contains at least:<\/p>\n<ul style=\"text-align: justify;\">\n<li>&lt;Hub Key&gt;.\u00a0The key (integer or hash value)\u00a0from the defining \/ parent Hub\u00a0entity. This attribute is a part of the Primary Key of the table<\/li>\n<li>&lt;Load Date\/Time&gt;. The date\/time when this record became active. This attribute is also part of the Primary Key.\u00a0This is usually the date\/time the event took place in the source (if known) or the date\/time when the ETL mapping has been run. Similar to the Hub explanation this is a highly contested initial key design decision<\/li>\n<li>&lt;ETL process insert ID&gt;. The pointer to the ETL control framework or metadata which uniquely identifies the ETL execution instance that inserted the record. This identifier and may be related to other ETL metadata including the record source<\/li>\n<li>&lt;ETL process update\u00a0ID&gt;. The pointer to the ETL control framework or metadata which uniquely identifies the ETL execution instance that updated the record. This identifier and may be related to other ETL metadata including the record source<\/li>\n<li>&lt;Record Source&gt;. The relation to the source interface\u00a0(file, table, system) where the record originated from<\/li>\n<li>&lt;Change Data Capture operator&gt;. Attribute that contains information for Inserts, Updates or Deletes. This is required to be able to track logical (soft) deletes<\/li>\n<li>&lt;Source Row ID&gt;. A row identifier to label each individual row within an ETL data set. This may is needed to add granularity or track near real-time data sources<\/li>\n<li>&lt;Record hash value&gt;. A hash across all attributes (including the Change Data Capture operator) which is used for record comparison (change tracking \/ Type 2 behaviour)<\/li>\n<li>All descriptive attributes from the source<\/li>\n<\/ul>\n<p>Optional:<\/p>\n<ul style=\"text-align: justify;\">\n<li>&lt;Load End\u00a0Date\/Time&gt;. The date\/time when this record was closed. A record is closed based on changes in the history, the value of this attribute is the &lt;Load Start\u00a0Date\/Time&gt;\u00a0of the previous related\u00a0record minus 1 second. This is redundant, but can be convenient.<\/li>\n<li>&lt;Current Record Indicator&gt;. An indicator whether the record is active \/ most recent. This is redundant, but can be convenient.<\/li>\n<\/ul>\n<p style=\"text-align: justify;\">The database key structure is:<br \/>\nPrimary Key:\u00a0&lt;Hub Key&gt; and &lt;Load Start Date\/Time&gt;<br \/>\nUnique Key \/ Alternate Key: Not applicable<\/p>\n<p style=\"text-align: justify;\">A source system can have multiple time periods in the source but these are treated just as any other regular attribute. Interpretation of these periods will be done during the loading processes from the integration layer to the Presentation Layer.\u00a0Satellites can be split for many reasons and one of the possibilities splitting the satellites has to offer is creating\u00a0a split between fast-changing and slow-changing attributes is also an option. In theory every attribute can be modelled into a separate satellite (but this is not very practical).<\/p>\n<p style=\"text-align: justify;\">No &#8216;Last Seen Date\/Time&#8217; is needed here because the present date\/times\u00a0supply this information: a change in information means that the table is updated through a type 2 mechanism.<\/p>\n<h4 style=\"text-align: justify;\">The\u00a0relationship entity\u00a0(Link)<\/h4>\n<p style=\"text-align: justify;\">The link entity (LINK) contains two or more Hub Keys including the date\/time\u00a0record has been created (Load Date\/Time). It shows the relation or transaction between defining entities.\u00a0The link table contains the unique list of the relationships between business concepts (technically: Hub Keys).<\/p>\n<ul style=\"text-align: justify;\">\n<li>DWH_LINK_ID_&lt;hub&gt;. A meaningless key added for future expansion purposes. Possibly a change in granularity (i.e. extra hub key added), or the need to store history on the relationship. In this case a new satellite based on the link table will be created. Possible uses for this satellite: defining if the relationship is active or has been deleted.<\/li>\n<li>RECORD_INSERTED_DATETIME. The datetime the record was inserted in the table by the ETL process.<\/li>\n<li>SOURCE_SYSTEM_ID. The relation to the source interface\u00a0(file, table, system) of which this record originated from. Not the batch because the batch ID changes every run! This is in fact a shortcut to the MD_ORIGINATING_SYSTEM table in the metadata schema.<\/li>\n<li>MODULE_INSTANCE_ID. The relation to the operational metadata information.<\/li>\n<li>DWH_ID_&lt;hub 1&gt;.<\/li>\n<li>DWH_ID_&lt;hub 2&gt;.<\/li>\n<\/ul>\n<p style=\"text-align: justify;\">Primary key: DWH_LINK_ID<br \/>\nUnique key: The combination of the migrated ID&#8217;s. All hub or link ID&#8217;s.<\/p>\n<p style=\"text-align: justify;\">A link can connect to different types of satellites depending on their nature.<\/p>\n<h3 class=\"mceTemp\" style=\"text-align: justify;\">Raw (Operational) Data Vault<\/h3>\n<p style=\"text-align: justify;\">The operational Data Vault \u00a0is the area where data from the staging layer is re-modelled into a Data Vault. The data in this area remains largely in its source form and nothing gets changed or altered. In doing so, it offers the flexibility of presenting the data in more than one way, while the original data remains unchanged. Auditing of data is also made easy as original image of data (obtained from source system) is retained and therefore traceable. In the integration area, data from the staging area\u00a0is first processed by identifying new business keys in the Hubs\u00a0and assigning surrogate keys for newly discovered business keys.\u00a0 The same data is then further processed by loading them into the integration (hybrid) model. Changes to the satellite tables\u00a0are tracked by means of using Slowly Changing Dimension Type-2 (SCD2) approach for every attribute.<\/p>\n<h3 style=\"text-align: justify;\">Business Data Vault<\/h3>\n<p style=\"text-align: justify;\">In the business Data Vault area, enterprise wide business rule is applied to the data residing in the integration area. This is done to accomplish any of the following objectives:<\/p>\n<ul style=\"text-align: justify;\">\n<li>Data cleansing<\/li>\n<li>Data enrichment<\/li>\n<li>Data standardisation<\/li>\n<li>Deduplication<\/li>\n<li>Validation<\/li>\n<\/ul>\n<p style=\"text-align: justify;\">The intention of the business Data Vault is to reduce replication of the rules downstream. An example would be the formatting of addresses where abbreviations such as \u201cst\u201d or \u201crd\u201d get expanded to \u201cstreet\u201d and \u201croad\u201d, and the validation of phone numbers. And in separating the state of the data into two physical forms (original and modified), it gives the flexibility of applying multiple sets of rule to a specific data set, in order to suit specific needs.<\/p>\n<p style=\"text-align: justify;\">In this example the data set starts with client records which are split between an client surrogate key table (Hub) and a table in which the history of the client attributes are tracked (Satellite). In this case attributes include address information. Since this is the raw data coming from the staging area the contents will be stored in the operational Data Vault. After the data has been loaded certain enterprise wide business rules are effectuated; this is done in the business Data Vault. This results in the creation of a subset of clients which have certain qualities into an entity \u2018integrated clients\u2019. The relationship of which client is an updated client is stored in a link table. Also, address information is updated for the client. Technically this will result in a new satellite table with the updated values while still retaining the link to the original values from the operational Data Vault.<\/p>\n<p style=\"text-align: justify;\">As written, the structure of the business Data Vault closely follows the operational Data Vault. The same type of tables, metadata attributes and rules apply. The only real difference is that for a certain table not every sibling <span style=\"text-decoration: underline;\">has to be present in this area<\/span>. Depending on the applied business rules the business Data Vault data set can be a subset of the operational Data Vault dataset, or a fully updated one. The business Data Vault contains derived data from the integration area and this can be done for specific tables or for complete or mixed entity sets. For instance, if only cleansing of a specific satellite table is necessary it will be sufficient to only store a new derived satellite table <span style=\"text-decoration: underline;\">which still links to the\u00a0operational Data Vault Hub \/\u00a0surrogate key table<\/span>. But if advanced algorithms like deduplication \/ match and survive are used this might require the creation of a new main entity with both a Hub table, Satellite and perhaps a Link table.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The &#8216;architecture&#8217; section of this weblog serves as an introduction for the Data Integration Framework Github repository (see the collaboration section). This Github repository contains the various levels of documentation that constitutes the Enterprise Data Warehouse (EDW) architecture, pattern documentation and supporting metadata and control frameworks. Introduction This page describes the entities in the Integration Layer.\u00a0 Also known as &#8216;Core&#8217;, &#8216;System of Records&#8217; or &#8216;Data Warehouse&#8217; area. As a layer the Integration Layer is made&#46;&#46;&#46;<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":49,"menu_order":3,"comment_status":"open","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-57","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/roelantvos.com\/blog\/wp-json\/wp\/v2\/pages\/57","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/roelantvos.com\/blog\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/roelantvos.com\/blog\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/roelantvos.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/roelantvos.com\/blog\/wp-json\/wp\/v2\/comments?post=57"}],"version-history":[{"count":0,"href":"https:\/\/roelantvos.com\/blog\/wp-json\/wp\/v2\/pages\/57\/revisions"}],"up":[{"embeddable":true,"href":"https:\/\/roelantvos.com\/blog\/wp-json\/wp\/v2\/pages\/49"}],"wp:attachment":[{"href":"https:\/\/roelantvos.com\/blog\/wp-json\/wp\/v2\/media?parent=57"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}