Collaboration

Areas of collaboration

The following diagram provides an overview of areas that are being addressed on an ongoing basis. If you are interested feel free to reach out at roelant@ravos.nl. For now most collaborations are limited open source until we’re certain all IP has been sorted out properly.

The intent of sharing this code is to foster increased meritocracy in the BI/DWH community and generally work (together) on something that can be combined using agreed APIs. The idea is that various people / teams can chase their passion while knowing the work will fit in somewhere in the overall scope. This overview may drastically change over time, as will the composition and scope of the projects – but that’s the nature of the work.

Data Integration Framework

The Data Integration Framework provides the context of the collaboration, as well as an overview of the options & considerations across the various areas and layers.

Contents / functionality of the Github:

  • Overview and context for Enterprise Data Warehouse
  • Design Patterns (conceptual how-to’s)
  • Solution Patterns (implementation in specific technical context i.e. tooling and environments)

ETL Control Framework (DIRECT)

The Data Integration Run-time Execution Control Tool (DIRECT) is a generic execution and control framework that orchestrates the execution of ETL processes. It provides various hooks into an ETL process to manage topics such as restartability, recovery from failure, logging, ETL classification and event handling.

There are many ETL control frameworks, as they are needed in every project. Let’s make this the best one! Ideally this becomes a commodity.

  • The datamodel and sample code can be found here in the SQL DBM model (online) or in the Github. Note that this was previously on QuickDBD but moved to SQL DBM after their pricing model was changed
  • The corresponding DML is available in the Github
  • The DIRECT code and content is managed via Github here. This is a private Github for the time being, but more than happy to expand the circle of collaborators. Send me an email if interested. The Github also contains the documentation for the ETL Control Framework as a generic process control framework

Contents / functionality of the tool:

  • Runtime execution monitoring & logging
  • Models (DDL and DML)
  • Disabling / enabling ETL in the control framework
  • Recovery, retries
  • Managing dependencies and parallelism
  • GUI
  • Supporting automation code (re-initialisation, zero key generation, generating process registration records)
  • Exception reports (SQL – currently integrated in Confluence)
  • SSIS, Powercenter and Oracle wrappers (SSIS fully up to date, others available). Probably have some Pebble as well.

Metadata Management (TEAM)

The Taxonomy for ETL Automation Metadata (TEAM) is a management tool for Data Vault metadata, a component also integrated in the VEDW software. It offers metadata mapping validation, data entry and visualisation. The metadata within TEAM is used to generate ETL (i.e. using Biml, SQL) using the interface / APIs.

  • The latest version (v1.5) can be downloaded here as an executable / installer
  • The datamodel and sample code can be found here: http://bit.ly/2A609Nq
  • The API structure is available here: http://bit.ly/2kn4PqZ (WIP)
  • The TEAM code and content is managed via Github here. This is a private Github for the time being, but more than happy to expand the circle of collaborators. Send me an email if interested.

Contents / functionality of the tool:

  • Connectivity settings and configuration including opening and saving to file.
  • Metadata management (grid, mapping Source-to-Target Mapping – STM) including exporting to graph model etc.
  • Activation and validation of metadata (checking and pushing into target DV generation data model)
  • Switch between JSON and SQL Server repository
  • Repository creation, including standard interfaces for ETL automation
  • Physical model versioning and reverse-engineering, to enable virtualisation
  • Test data generator (also to support RI check, but that’s in the virtualisation tool)
  • Graph based interface
  • Source system register (NOT DEVELOPED YET)

Virtual Enterprise Data Warehouse (VEDW)

VEDW is the virtualisation and rapid prototyping software for Data Vault that can be downloaded from this site. More information is available here.

As per version 1.4.0.1 this is now separated from TEAM components, which means VEDW is only the generation engine. All metadata management is done in TEAM, which makes this a requirement to have installed as well.