Schema definition for data solution code generation

Introduction

The generic schema for Data Solution Automation covers the efforts towards defining a standard interface to exchange the metadata that is necessary to forward-engineer / generate code for all components of a data solution, such as data definitions and data logistics (‘ETL’) code.

The documentation is found on https://data-solution-automation-engine.github.io/data-warehouse-automation-metadata-schema/.

The schema provides a generic API / information exchange format for source-to-target mapping information. The schema definition covers the metadata that is needed to generate data solution structures and processes.

The interface itself is a JSON schema definition (JSD) with examples on how to use, as well as libraries for validation and code snippets.

Most work takes place in the designated (public) GitHub: https://github.com/data-solution-automation-engine/data-warehouse-automation-metadata-schema.

At present this repository contains various documentation artefacts as well as a testing checklist, the JSON schema and sample code to consume this schema using C#.

The concepts are outlined in this post explaining the reasoning and intent. Also, please consider reading the read.me and the overview on Github.

Please join in for feedback and discussion on GitHub!

Feedback is always welcome.

Concepts – an introduction

The below provides an quick overview, but please refer to the documentation website for all the details.

Base objects

The interface is a Json Schema Definition that has been designed following draft 7 of the Json schema. It contains a series of reusable defined objects (‘definitions’), of which the following two elements are the most fundamental:

Data Object. A Data Object is a data set that can be read from, or written to. It can be defined as either the source or target of a Data Object Mapping. A Data Object can optionally have a connection defined as a string or token as well as an optional set of Data Items. A Data Object and can be a query, file or table.
Data Item, which belong to a Data Object and represents an individual column or calculation (query) in a Data Object, or Data Object Mapping. Data Items have a range of properties that can be read or set, including data types, ordinal position etc.

Simply put, the Data Objects and Data Items together constitute the data model. Please note that not all properties are displayed here for the sake of brevity.

High level overview of the Data Warehouse Automation scehma.

Mapping objects

In the schema definition, both the Data Objects as well as the Data Items are mapped to define ‘source to target mappings’. These are called Data Object Mappings and Data Item Mappings.

The Data Object Mapping is literally a mapping between Data Objects. It is a unique ETL mapping / transformation that moves, or interprets, data from a given source to a given destination.

A Data Object Mapping reuses the definitions of the Data Object and Data Item. The Data Object is used twice: as the SourceDataObject and as the TargetDataObject – both instances of the DataObject class / type.

The other key component of a Data Object Mapping is the Data Item Mapping, which describes the column-to-column (or transformation-to-column) and reuses the Data Item class.

The Source Data Object, Target Data Object and Data Item Mapping are the mandatory components of a Data Object Mapping.

There are many other attributes that can be set, and there are mandatory items within the Data Objects and Data Items also. These are described in the Json schema, and the concept is that the validation functions will make it easy to try out different uses of the schema.

One of the goals of defining this schema has been to find a good balance between being too generic and too specific (restrictive). For this reason there are only a few mandatory elements.

It is possible to add a specific class to a Data Object Mapping: the Business Key Definition. This construct again reuses the earlier definitions but can optionally be added to the Data Object Mapping as an special classified set of transformation.

Schema definition for data solution code generation

Introduction

Concepts – an introduction

Base objects

Mapping objects

Search this site

Upcoming Events

Recent Posts