New collaboration on a common ETL metadata model

Pretty much everyone I know in the Data Warehouse / Business Intelligence industry is, or has worked on, some kind of metadata management model. This may have been developed to explore smarter ways of development or maintenance of documentation. It could be because of disappointment about vendor / software capabilities, a desire to store certain designs or structure and of course to enable code generation – including ETL.

This is innovative work, and has lead to some really inspiring ideas. It also means that there are many different ways these metadata models (and their supporting technology) have been prepared. Solutions have been built embedded in data modelling software (Powerdesigner, ERwin), using different technology stacks (.Net, Java) or extended in off-the-shelf products (Informatica, WhereScape RED). Some vendors even have ‘market places’ where these ideas can be purchased as modules, after they have been uploaded there by entrepreneurial developers.

The underling metadata repositories often differ in the way they have been implemented (i.e. physical model, storage formats etc). Also, there is variety in maturity and support – an often quoted reason to opt for an off-the-shelf product.

The TEAM, VEDW and DIRECT applications that I have developed are evolving along these lines as well. Even though these have been developed with collaboration in mind, there is still (always?) work to do to extend interoperability. This is also why the Github repositories where the code is managed are still private at the moment (although individuals are welcome to be added – send me an email).

Florian Harder, Torsten Gluende, Michael Schaefers, Christian Haedrich and myself talked about this topic again the other day. Since the idea of using templating engines for ETL generation is gaining a lot of interest, and the aforementioned fact that many people have designed proprietary approaches and models that do more-or-less the same thing, the idea was to explore ways we could work together.

We feel it is may be possible to agree on a common interchange format we can align our approaches against. This has the advantage that everyone is ‘free’ to use their own favourite software, technology and / or approach while still allowing interoperability.

Perhaps agreeing on a standard underlying meta-model is too big of a step to take, and I think not everyone may be ready to let go of certain ideas and implementations, or sees value in making code changes to conform to a new model. That is fine – assuming we can accept underlying metadata models can be different and create an adaptor to a common agreed format.

For myself and many others who all have their own slightly different metadata models, this means no one has to go back and align to a new model and change the code. Rather, everyone can just write an extract to conform to whatever standard we can agree on. It’s also good decoupling practice; the interfaces will be simpler and allow others to develop connecting tools in the ecosystem.

For example, I am still devoting more time on making administration of metadata easy-to-use than I spend on developing improvements in the generation components. With a common format others can pick this up more easily, and the superior ideas will gain more followers – a meritocracy.

This is being explored in a public Github (https://github.com/RoelantVos/ETL_Generation_Metadata_Interface), and I welcome everyone to have a look and participate. We literally just started this week with an initial conversation, and only some starter content has been added.

The current uploaded structure of the interface is a direct extract-to-JSON from the TEAM interface views as currently available in TEAM v1.5.5.0. This is no more than a starting point, and there are various reasons this structure (format) needs to change for the purposes outlined in this post (see the issues section in the Github). Even though the format should be standardised and improved, the content should be complete. It really is about finding a better common information exchange format: a canonical model.

These concepts are nothing new, but with the current maturity in understanding, experience and momentum using templating engines it’s worth giving this another go.

I will definitely be organising additional evening sessions on using these approaches and generation techniques while in The Netherlands for the implementation & automation training. Hope to see you there. And if not, keep an eye out on the Githubs and this blog!

 
Roelant Vos

Roelant Vos

You may also like...

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.