Software & Collaboration

Overview

The following diagram provides an overview of areas that are under development, as well as some context on the vision and how to deliver on this. 

The intent is to foster collaboration in the Business Intelligence / Data Warehouse community, and work together on a toolchain using agreed APIs. The idea is that various people / teams can follow their passion while knowing the work will fit in somewhere in the overall scope.

An index of this page is provided below:

As part of introduction into the concepts mentioned, please refer to the following white papers explaining some of the foundational concepts in more detail:

  • The Engine; providing an overview of how the major concepts work together to deliver a reliable solution.
  • Virtual Datawarehousing; a white paper explaining (the mindset) of designing for change in a data context.

Areas of collaboration

The image below displays the essential ‘building blocks’ towards delivering a scalable and flexible Data Warehouse solution, and their mappings to technology and repositories that are either available or under development.

Data Integration Framework

The Data Integration Framework contains a repository of architecture, design and solution patterns in MarkDown format. It also provides the context of the collaboration, as well as an overview of the options & considerations across the various areas and layers.

Contents / functionality of the Github:

  • Overview and context for Enterprise Data Warehouse
  • Design Patterns (conceptual how-to’s)
  • Solution Patterns (implementation in specific technical context i.e. tooling and environments)

Data Integration Logistics & Control Framework (DIRECT)

The Data Integration Run-time Execution Control Tool (DIRECT) is a generic execution and control framework for data integration logistic. DIRECT orchestrates the execution of data integration processes. It provides various hooks into a data integration process to manage topics such as restartability, recovery from failure, logging, classification and event handling.

DIRECT also covers features necessary for the eventual consistency including the execution queue management concepts.

There are many data integration control frameworks. They should be considered as a commodity as they are needed in every project. Let’s make this the best one! The Github is available here.

Metadata Management (TEAM)

The Taxonomy for ETL Automation Metadata (TEAM) is a management tool for data solution metadata, the so-called source-to-target mappings.

It offers metadata mapping validation, connection management, data entry and visualisation. The metadata within TEAM is used to generate output conforming to the generic schema for Data Warehouse Automation.

The Github is available here.

Virtual Data Warehouse (VDW)

VDW (formerly VEDW) is a virtualisation and rapid prototyping software for code generation. VDW generates SQL-based ETL processes as well as test data, and can be used to quickly test out models and ideas using real data.

Generating a Virtual Data Warehouse is the equivalent of creating tables, generating ETLs and running these processes to see the data. VDW does this in one pass. More information is available here.

Data Warehouse Automation Generic Schema / Interface

The interface for Data Warehouse Automation intends to provide a generic API / information exchange format for source-to-target mapping information. This is the metadata that is needed to generate Data Warehouse structures and data integration processes, such as ETL jobs. The interface consists of a Json schema definition with examples on how to use, as well as libraries for validation and code snippets.