Getting started with Confluent Kafka as a data professional – part 1

Overview

This article provides an explanation of how to publish and consume your own events using the Confluent Platform for Kafka, and the corresponding C# libraries. Because of the length the article has been broken up into four pieces, with this part explaining the context, the 2nd and 3rd part explaining the code itself and finally a video demonstration.

Sample code is provided in https://github.com/RoelantVos/confluent-data-object-mapping.

Event sourcing, for data solutions

Events are all the rage, and for good reason. Having an immutable, time-ordered, insert-only set of original transactions is exactly what you need to be able to deliver a deterministic representation of the state for a given object.

In the event context this is referred to as ‘event sourcing’, where the stream of events (transactions) associated with a given ‘thing’ or object can be used to describe the state at a given point in time. By following each successive change (the historical events) you can derive what the current state is, for example calculating a balance over a series of financial transactions.

In a Data Warehouse / Business Intelligence (BI) context we would call this replaying history.

Sounds familiar? An ability to deterministically display results based on an immutable set of transactions, persisted somewhere accessible by various consumers? This is what I mean when I describe the Persistent Staging Area (PSA) as an application-readable log.

In my view the underlying concepts for the PSA and a distributed commit log (such as Kafka) are basically the same (the technology differs, of course). However, it seems the world of events and Data Warehousing / BI don’t come together all that often.

In Data Warehouse systems we often think of state differently. Within the Data Warehouse itself we go to great lengths to represent data correctly at a point in time. However, in many cases the On-line Transactional Processing System (OLTP / ‘operational’) systems that provide the data themselves don’t record (or make available) everything that happens.

Rather, what we often see is that a specific state (‘change’) is recorded in a databases but not necessarily what steps happened in between that led to the emergence of this state. There may have been many intermediate changes, but we don’t always know.

The Data Warehouse reads these states from the operational systems and tries to piece together what happened as best as possible.

This is why I often compare Data Warehousing and BI to palaeontologists and biologists constructing (modifying!) the Tree of Life based on the fossil records and DNA tracing. We can only work with what we know, and remain open-minded for discoveries that could significantly alter our current world view.

An exception are the transaction-log Change Data Capture (CDC) mechanisms built in many Relational Database Management Systems (RDBMS). Since the transaction log of a database is an application readable log itself, the history of transactions can be used to follow the changes in data. Indeed, there are various connectors in the Kafka ecosystem that can consume CDC tables directly into a Kafka topic.

CDC is a mature and useful technology, but still may not give you the same representation of reality that applications that are built from the ground up using event sourcing concepts can provide. After all, between the database (where changes in data are recorded) there can still be an application (logic) layer.

Event sourcing as a principle allows you to build up your complete system based on your log. It doesn’t lose information – by definition you have all the data this way.

Getting started with Confluent Kafka

Apache Kafka (open source) currently is the defacto standard to create and manage an event-driven ecosystem. Confluent, by virtue of the Confluent Platform, packages the Kafka distribution with a number of supporting technologies and makes this available as a managed cloud service or downloadable for on-premise use.

There are many guides on how to use Kafka in general and for the Confluent platform in particular. The following links are worth a look:

  • Getting started with a local Docker image for the full Confluent platform. Also some good references here. One of the benefits is that the Docker images allow you to use all features, including ksqlDB (see below), but requires a bit of configuration and management to get (and keep) running. An additional benefit is that this allows you to deploy your own connectors, something I have done to test out the CDC connectors for SQL Server.
  • Register a personal-use account on the PaaS Confluent Cloud, which is free to use providing you are OK to work with a limited number of nodes and partitions. This is the ‘basic’ version. It does not allow free use of ksqlDB (see below) but does support the Schema Registry.
  • Examples of using the .Net libraries to interface with the Confluent platform.

For the example code used to research this article I am using the Confluent Cloud and the v1.5 version of the .Net (C#) libraries.

Event streaming, the future of the PSA?

When developing a data solution based on a PSA, one way to look at what we are doing is that we are deriving information from the history of captured transactions into our interpretation of the reality – the target data model. We can always delete these interpretations, because they are calculated and can always be redeployed again given the correct metadata.

This is why we can support (automated) refactoring as described in the engine concept.

The mechanisms to display PSA data as a Data Warehouse model is something I refer to as the Virtual Data Warehouse. Even though some tables may be persisted for performance reasons everything is in principle redundant. The whole system may as well be a set of views – which it sometimes is – or can be deleted and rebuilt completely.

While APIs and compilers exists for various Extract-Transform-Load (ETL) platforms, these concepts are more and more based on plain SQL, the various (free) available templates being case in point. Once your data is in the PSA, many good scalable database solutions exist to process everything there using SQL.

There are now some really interesting techniques starting to become mainstream that may be able to replace the approach of running the Virtual Data Warehouse on more traditional database systems using SQL. If you consider a topic in Kafka to be the equivalent of a PSA table or file it is now increasingly easy to query this topic directly – these is referred to as event streaming or event stream processing.

Event streaming is interrogating the event stream (historical and incoming events). Until very recently, programming skills were required to implement this, such as Java or Scala. However, solutions are now made available to directly query the history of events directly in the stream/topic (on Kafka), for example using ksqlDB. This is of course assuming some retention of events in the Kafka topic is in place, perhaps even set to infinite retention.

Event streaming solutions such as ksqlDB is something I will definitely look further into. In my tests, ksqlDB so far wasn’t able to support the full complexity that is required to create a Virtual Data Warehouse (see the various papers on ‘patterns for Data Mart delivery‘) but it’s certainly worth following.

My thinking is that at some point it may become possible to apply the same logical model approach we are working on with the engine to an event streaming solution.

In the following article I’ll go through the code to publish and consume your own events in a simple way.

Roelant Vos

Ravos Business Intelligence admin

You may also like...

Leave a Reply