The Engine – adapting to ever changing data interpretation

by Roelant Vos · Published May 25, 2020 · Updated September 7, 2021

This week I presented one of my favourite topics at the Knowledge Gap conference: a concept I refer to as ‘the engine’, the driving force behind the Virtual Data Warehouse way of thinking.

The engine represents an ecosystem of tooling and ideas that aim to make it easier to shape data into the desired delivery formats. It is the concept that drives much of my development in the data warehouse automation ecosystem.

Refactoring, it’s not so bad

My experience is that, while data modelling is important, it often is not realistic to expect that the resulting data models will be ‘right’ the first time. Defining models that really fit your needs is something that grows and evolves over time, and my philosophy is to make the delivery mechanisms that apply the data to the models are as ‘smooth’ as they can be to support this evolution.

Think of this as refactoring a solution.

Someone once said to me that refactoring just means that you ‘didn’t get it right the first time’. I can’t argue with that, of course, but don’t think this is necessarily a bad thing either. I personally enjoy refactoring code a lot; making it all work easier and better.

The remark about refactoring highlights a fundamental difference in mindset and approach. Does the model need to be fully correct so we can map our data on to this? Can we just define the agreed concepts and figure things out along the way?

I for one don’t believe that even with the best skills one can get the model truly right in one delivery, and so I have organised my work around this mindset.

The ability to refactor easily also helps modifying the solution when the business changes – over a longer period of time. Businesses change often, and in different ways, for example through mergers and acquisitions, new products and business models and of course new operational systems and processes. All of this is reflects on the data.

As a rule, I approach data as something that doesn’t necessarily mean that much to me. Data is just ‘stuff’, recorded evidence of something that has happened, generated by things doing something somewhere. I can come up with assumptions, supported by people that understand the subject matter, and create an interpretation. But I will treat this as an initial approximation of what the model probably needs to become over time. This naturally means the logic to interpret the data will change over time also.

Hence, my focus is on ensuring that the refactoring of the solution is made as easy as it can be, so that if we do change our minds we can do so without further consequences. I’m not saying data modelling is not important (it is), but there is no harm in hedging your bets.

To me, data solutions such as data warehouse, Business Intelligence and reporting applications are fluid platforms that morph and adjust over time. One of the challenges here is to be able to match the speed of the organisational change as best as possible.

This is what I mean by having a virtual data warehouse mindset, and the engine is what powers this capability.

Assembling the engine

The engine is the collection of concepts that I believe you need to (automatically) refactor your data platforms. It can be viewed as extension of Data Warehouse automation, and it helps to think about it in this context as many concepts will be similar.

We’re looking to combine the various components that you need to deliver a reliable and robust data solution into something capable of operating autonomously. This idea is founded on the realisation that many technical implementations share a common ground, in which the exact physical implementation of the model is something of a byproduct – or rather a result driven by optimisation of the implementation.

Sometimes in data modelling discussions the boundaries between design, concepts and physical implementation are not always clear, and I think the engine concept can help here. It forces a clear distinction between modelling decisions and implementation decisions at physical level – by absorbing the implementation into the engine itself.

To assemble our engine, we need to define the various contributing components. But what is needed to build an engine such as this?

As a metaphor, consider the database management engine. A typical relational database management system (RDBMS) has built-in functionality such as:

A transaction log, where all the fundamental transactions are stored as read-only (insert only) and allow deterministic representation of the same base data.
Configuration and settings to manage transaction control and isolation, so you can guarantee consistency and durability.
A query optimiser, that uses statistics and parameters to predict the most optimal way to execute a query against a database and return the results to the user.
Collecting statistics on usage patterns and overall fitness of the environment.
Managing versioning of rows automatically.
Translating metadata into code for example when SQL is run (compilation).

The engine concept performs similar functions, but at application level for the data platform overall. Through the engine, the physical model can be refactored depending on behaviour patterns and cost of performance – or whatever parameters make sense for the use-case.

The engine is an optimisation that can automatically run behind the scenes and refactor the data solution into the best technical fit for desired direction.

The engine idea was sparked a long time ago by discussions about Data Vault implementation, especially on concepts such as implementing transactions / events as either Link or Hub entities and performance related constructs such as the ‘Transactional Link’ and Point-in-Time tables. To me, these discussions mostly concern the physical implementation. At a slightly higher level of abstraction, the models essentially look the same. For example, all would have a notion of a business concept, relationships to other business concepts and contextual (descriptive) elements.

I often use the phrase ‘options and considerations’, when it comes to how to approach design choices such as these. Know what the pros and cons are and make an informed decision that best suits your scenario. But how amazing would it be to embed these decisions as parameters, and let them drive the best suitable physical structures?

Or better yet, let an application optimiser decide how to best deploy this?

If we elevate these data models, and modelling decisions, to this higher level of abstraction where they become the same, then these implementation decisions become a moot point. Do we really care how we model a relationship at physical level, when we agree at more abstract level it exists? Do we really need to physically instantiate key tables (such as Hubs)? We may never use them in the queries, or can guarantee integrity in other ways. Can we generate them automatically when we do need them?

All this can be done in principle using the engine concept. All we need is to do is to assemble the components we already need to develop a good solution anyway in the first place.

It’s actually not hard at all.

Components of the engine

One of the very fundamental requirements of the engine is the capability to record design metadata in an interoperable way. These are the ‘source-to-target’ mappings that capture what data is mapped, into what definition and/or subject to which transformation.

In the ecosystem outlined on this weblog this functionality is covered in the TEAM application.

The design metadata itself can be recorded in an agreed format such as the generic interface for Data Warehouse Automation – another project in the ecosystem.

In many ways, the Persistent Staging Area (PSA) is at the heart of the data solution. In recent posts I refer to the PSA as the application readable log of the data platform. A PSA captures all events / data as received by the feeding applications and stores these as-is with added timestamps and sequence values (offsets).

A PSA can be implemented in many ways including (but not limited to) file structures (AVRO, Parquet), relational (databases) or key-value. The PSA is fully focused on maintaining an accurate log of events and incorporates concepts such as schema evolution to manage changes in structure over time.

Data integration patterns are applied to the design metadata and the recorded events in the PSA to transform the data into its desired target state (target model).

The Virtual Data Warehouse software (VDW) can act as the engine component that manages the data integration patterns. VDW uses Handlebars.js as pattern engine and can write the output to file or executes it against a database.

The patterns need to be inherently deterministic, which in this context means they support loading in one pass or incrementally – always delivering the same result. This is critical to support refactoring, as target tables may need to be truncated and reloaded.

Code generation applies the design metadata (the ‘what’) with the data integration patterns (the ‘how’) – creating code or data logistics artefacts for the desired platform.

In the ecosystem of this weblog, the VDW software also covers this function. VDW can merge (apply) the patterns with the design metadata delivered by TEAM.

The VDW software is predominately focused on generating SQL. However, since the design metadata is stored in the generic data warehouse automation schema, different compilers can also be used – for example Biml or 3rd party SDKs / APIs.

Many ETL tools have options that support code generation.

At this stage, the data platform has the capability to capture design metadata and, via the data integration patterns, generate executable code. The source for this data is the PSA.

Combined, these components deliver the capability to refactor through a process I refer to as re-initialisation.

Re-initialisation is when you reload parts, or the entire scope, of the data solution from history (PSA) and are able to reproduce it completely as it was before you truncated it. It is meant to ensure all patterns are truly deterministic.

This capability means you can evolve your designs, and extend the library of available patterns to apply. It also means that the model can change (and) back without loss of information.

The delivery of data becomes a (flexible) schema-on-read on top off the PSA.

Regardless whether views or materialised objects (tables, indexed views) are implemented as the delivery structure, a data logistics process control framework is needed to track what happens inside the engine / data solution.

For example, all materialised data (i.e. in tables) needs to be ‘tagged’ with a pointer relating to the unique process (execution instance identifier) that inserted or modified the records. Similarly, each view call should be logged so that it is known who ran the view and when exactly.

The data logistics control framework supports auditability in general, but is also the implementation of transaction control at application level. This is because the control framework ‘knows’ what data is still being processed and what data is available for upstream selection. Preventing ‘dirty reads’ against the solution and associated locking strategies for data integration.

The logistics framework also reports on data latency, a major metric to manage the overall solution.

In the ecosystem, the DIRECT framework fulfils this component.

In addition to the control framework, a notification framework is needed to pro-actively inform the developers and support team on the integrity of the system.

A notification framework can consist of a number of exception checks on the solution, such as asserting referential integrity or consistency of logical groups of data. It can also be used in conjunction with a testing framework such as NBi or tSQLt to identify more advanced information such as data exceeding certain ranges or combination of factors.

Tests can be run independently (via the data logistics control framework) to collect and report on the environment, and can also be used to collect system information such as CPU levels, memory use and disk use.

The engine now has some of the main components in place to function well on its own, but in order to fully operate independently additional ‘DevOps’ components can be added – the first of which is versioning.

Versioning in this context is the management of snapshots across the design metadata and the model together. It goes beyond basic version control of the involved development artefacts alone, but it is the version of a combination of components that together deliver the data solution – including data integration logic and data – at a point in time.

The premise is that if we can generate all our code, and rebuild the entire data delivery via the PSA concept, then if we snapshot our design metadata and model we can always (re)deploy the entire solution as it once was.

You can also host multiple versions of the solution at the same time and compare which one works best for you.

This component is time-machine for the entire solution, not just for data in a bi-temporal context.

The data solution can now deterministically deliver against any version of the design, which means that simulations can be run and compared to evolve on physical model decisions (finding the optimum). This is the technique used to test the pattern change to switch from hash keys to natural business keys.

The next step is to ensure that these ‘releases’ can be managed in a DevOps framework, specifically for Continuous Deployment. By incorporating a workflow that can operate the previous steps autonomously, the day to day management of the solution can be handed over to the engine and further eliminate human error.

For example, design metadata ‘knows’ which upstream tables are impacted when changes are made. The code for these tables can be automatically generated and associated target tables can be updated or even truncated.

The deployment workflow can do this automatically by detecting either changes in the metadata or commits in source control. Impacted areas of the solution can then be automatically refactored to display the data in sync with the latest (or selected) version of the design.

Not all transformations can easily be captured into standard patterns.

Specific complex transformation logic may be added separately in addition to the design metadata outlined earlier, but recognisable in a dedicated area of the engine.

There are many ways to do this, for example by having a directory available which is polled by the engine at runtime and which contains instructions that can be understood by the compiler (e.g. SQL views).

Another option is to store a separate set of design metadata artefacts that contain complex logic itself. The generic interface for data warehouse automation supports this by having the option to, as the ‘source’ part of a source-to-target mapping, add a function or transformation.

This can be done both at entity and attribute level.

To keep track of the overall design, the solution itself can be abstracted in a directed graph. The graph operates both as visual confirmation of the solution, but also as input for the of development.

The idea behind the graph is to be able to truly separate the logical model with the physical model, but keep the link between the two worlds together (using edges in the graph).

At the abstract level everything is either a concept, a relationship or a piece of descriptive information (context).

Moreover, new information can be defined as the mapping (edge) of a data set towards a new ‘thing’. The direction and (number of) definitions into this ‘thing’ will give it it’s meaning.

In the ecosystem the TEAM_Graph extension for TEAM is developed using the amazing yWorks library. This is all but complete at the time of writing.

In the meantime, TEAM has the ability to generate a one-way export of the design as a Directed Graph Markup Language (DGML) file.

The last component in the engine is the optimiser itself. This is a service that can record rules on how to operate the data solution. The optimiser receives its input from the alerting & notification framework and acts through the continuous deployment workflow.

The exact goal of optimisation can be set by the user, who interfaces with the graph model. The optimiser can refactor the solution over time towards these goals, changing the physical delivery while the design metadata stays the same.

Listening to the environment

Having the above components configured so they can interact with each other as part of a framework is a huge milestone.

Even though we may still need to develop the data models, project planning and associated activities to deliver meaningful information we can now do so knowing there is an asset available that ensures fast and reliable delivery.

As soon as the first data is mapped and is being processed (either near real-time or batch driven) we can start understanding the effects the patterns have on the data. The data that is collected from the host infrastructure is something I refer to as technical metadata – as opposed of the design metadata covered in the engine component overview.

Technical metadata can cover many things – for example disk space, memory use, I/O, CPU usage at a more technical level. If we need to optimise against a fixed infrastructure we can start running tests to improve performance using these metrics.

But, technical metadata can also be about query usage patterns and delivery. The refactoring can be implemented then to optimise against latency in delivery (data refresh lag with ‘now’). Another perspective can be cost, especially in a managed database and/or cloud context. Query use versus storage footprint can be used to calculate an optimum here also.

The technical metadata can be inserted as environmental parameters into the engine and, via the alerting & notification process, trigger refactoring into a certain direction. In all cases though, the design metadata will stay the same.

The Engine – adapting to ever changing data interpretation

Refactoring, it’s not so bad

Assembling the engine

Components of the engine

Listening to the environment

You may also like...

1 Response

Leave a Reply Cancel reply

Search this site

Upcoming Events

Recent Posts

The Engine – adapting to ever changing data interpretation

Refactoring, it’s not so bad

Assembling the engine

Components of the engine

Listening to the environment

You may also like...

The new Virtual Data Warehouse software is ready!

Easy Automation for Azure Mapping Data Flows – part 1

Generating DBT ‘ETL’ for Snowflake

1 Response

Leave a Reply Cancel reply

Search this site

Upcoming Events

Recent Posts