An effective lightweight automation approach for Azure Data Factory

Last week I started at working for Varigence to work with the team on the BimlFlex solution for Data Warehouse Automation, so time to revisit some techniques in the Microsoft space.

While doing so, I re-created a lightweight approach for code generation and (data warehouse) automation that I really like and am sure can be useful in certain scenarios. It is one of many ways to get things done, and worth sharing as such.

Thanks go out to Komal Maurya for showing these capabilities.

This ETL/ELT generation & automation concept for Azure Data Factory (ADF) fits in the simple ETL generation series. This series contains a variety of similar-style approaches for getting started generating data logistics for your data solution, using templating (engines) and metadata input.

The same can be done for ADF, and in a very easy and (in my opinion) elegant way.

A setup for ADF Data Warehouse Automation

The approach is simple: ADF offers a ForEach activity (an object available in ADF) which essentially iterates of a collection (list) of items. For each item, it can perform one or more specified (sub)activities (invoked pipelines in ADF).

The convenient feature is that an item can be a complex object, such as a JSON structure. The example below shows this setup, where the ForEach object iterates over the items as available in the @pipeline().parameters.metadata object.

In this simple example setup, a ‘batch’ pipeline is created (displayed above) which calls another pipeline – called ‘pipeline_src_stg’.

The obvious question is: where does this ‘metadata’ object come from? This is defined as a parameter (or variable) at pipeline level.

This ‘metadata’ parameter is defined as an array, so something that can contain several items. As mentioned before an item can be a complex object, so it is possible to inject metadata in here that contains lots of information.

For this example, the array that is added here is as follows:

[
  {
    "runStgProcess": {
      "moduleCode": "exampleModule1",
      "sqlStatement": "SELECT Test FROM SourceTable",
      "targetTable": "TargetTable",
      "targetSchema": "dbo"
    }
  },
  {
    "runStgProcess": {
      "moduleCode": "exampleModule2",
      "sqlStatement": "SELECT SecondTest AS Test FROM SourceTableTwo",
      "targetTable": "TargetTable",
      "targetSchema": "dbo"
    }
  },
  {
    "runStgProcess": {
      "moduleCode": "exampleModule3",
      "sqlStatement": "SELECT Test FROM SourceTable",
      "targetTable": "TargetTableTwo",
      "targetSchema": "dbo"
    }
  },
  {
    "runStgProcess": {
      "moduleCode": "exampleModule4",
      "sqlStatement": "SELECT SecondTest AS Test FROM SourceTableTwo",
      "targetTable": "TargetTableTwo",
      "targetSchema": "dbo"
    }
  }
]

Basically, this JSON array (indicated by the square brackets) contains four separate data logistics processes (modules) each containing a SQL statement to retrieve the data and a target to load the results into. This is a simple example – but it can be extended to include additional properties, nested objects and basically anything that you have available in your metadata back-end.

If you are wondering where the SQL statement is coming from, please have a look at the corresponding paragraph near the end of this post for some background on this.

For now, this example loads records from two arbitrary sources into two arbitrary targets. Sources 1 and 2 go into target 1, and both sources are loaded into target 2 again also.

Inside the ForEach container

The activity called for each iteration is the invoked pipeline. In this pipeline, again at pipeline level, a parameter can be configured to ‘catch’ the value coming down from the original array in each iteration.

This is an object, which can be given the default value of @item() – an internal convention in ADF. An item() here is the object that is iterated over by the ForEach ADF activity.

To make this work, a parameter must be created at pipeline level for the pipeline_src_stg as displayed in the screenshot below.

This parameter can then be linked to the pipeline as invoked by the ForEach activity.

Now, in each iteration, the ‘metadata’ object receives an individual JSON segment from the original array. This information is now available in the pipeline and can be used everywhere in the activity, so that data can be moved around driven by the metadata.

Inside the (invoked) pipeline, a standard Copy Data activity has been created. This activity can be configured to accept dynamic sources and targets, which we can take from the metadata object (the item) that has been made available to the pipeline.

A generic connection for the ‘source’ has been created (a data set in ADF), which is configured to connect to the ‘source’ database. This source connection for the Copy Data activity can accept an object in addition to a straight-up SQL statement when we select the option to use a query as a source.

Objects can be added using the ‘add dynamic content’ option which becomes available if you click around in the query form.

In this case, because we want to make sure the SQL Statement property from the JSON input is used, we can add the following here object reference here: @pipeline().parameters.metadata.runStgProcess.sqlStatement.

This syntax follows the structure of the JSON input.

This way, we select the sqlStatement property from the runStgProcess object. The ‘metadata’ refers to the variable we have defined earlier to catch each individual JSON segment (item) in the invoked pipeline.

In other words, we specify that the source is a query but that the details of this query are passed on from the metadata object.

The configuration of the target is slightly more work because we have to configure additional parameters in the target data set for the table name and the schema name.

For the intended outcome, the table- and schema name must be taken from the metadata object as in the screenshot below. But to make these options available in the first place we need to define this as an option in the sink (target) dataset.

The values used are as follows, and as before these correspond to the property of the input JSON structure.

  • @pipeline().parameters.metadata.runStgProcess.targetSchema
  • @pipeline().parameters.metadata.runStgProcess.targetTable

To make these fields available, parameters can be defined for the dataset which is designated as the target. These are just string types as they will contain the corresponding string property from the JSON file.

And lastly, these local parameters can be assigned to the table name of the dataset.

Running the new process

At this stage, everything is in place to run the process with the input JSON content or file. For demonstration purposes the ForEach activity is configured to run each process sequentially, but in ADF this is configurable as well.

To run the process and input the metadata, the complete JSON can be inserted as input when the process is started.

This will open the monitoring console, where you can see that the information is passed to each iteration.

When the processes are finished (four iterations, as per the four JSON segments in the array), we can see the result in the database.

Success! Both sources are loaded to both targets as per the specified four modules.

Inputting the JSON metadata file

For this example we stored the JSON array as a parameter, and have the option to copy this into the parameter when we run the pipeline manually.

In ‘real’ deployments the JSON input can be provided as parameter or as a location to the JSON file in an accessible location such as an Azure Storage account (file storage).

Creating a pointer to a location where the metadata is maintained also helps as work-around on form limitations in the Azure portal, because there is a limit on the number of characters that can be added here. If you use a file location, all you need to do is update your JSON and continue running the processes – no updates in the ETL objects are needed.

Here you go, a fully generic SQL execution engine driven by input JSON metadata in ADF!

With this lightweight setup, you only configure ADF once. After this, everything can be metadata-driven.

Where does the SQL statement come from?

What has not been covered in this post is the creation of the JSON file itself. Generating a JSON file is not fundamentally different from generating SQL (or other) code in the same way this has been explained throughout this blog.

Examples include the simple ETL automation series and template-based code generation related to for example the Virtual Data Warehouse concept. The generic schema for Data Warehouse Automation also has many examples on how to generate this kind of output, including Data Vault examples and complex transformations.

The same approaches can be used to export your metadata in a JSON format that is defined in a way that matches with your ADF configuration. All you need is a good mechanism for managing your metadata, and potentially exposing this using the generic schema for Data Warehouse automation.

From here, it is easy to generate JSON files that can deliver real-world data solutions in various ways including this ADF example approach.

Final thoughts

This approach works well for solutions that look to push-down the compute into the database engine. Essentially, ADF is used as a scheduler and looks after the connectivity. All logic is executed against the selected database platform.

This is in many ways similar to approaches where templating engines are used to generate (complex) SQL, which is then executed against a given target platform. Data solutions using pattern-friendly methodologies such as Data Vault greatly benefit from this, and it is perfectly feasible to generate the entire solution this way. In fact, this has been done many times in real-world projects combining multiple and complex operational systems with large data sets.

The benefit is that ADF can be used to control the degree of parallelism and the connections. Many things can be parameterised in ADF, and when this is the case then metadata can be injected.

And in ADF, you can parameterise a lot.

If this makes sense to you, it may be worth checking out similar examples using DBT, the use of templating engines and of course the Virtual Data Warehouse.

The downside is that only generic configurations are used, and all logic is encapsulated in SQL. If you need a specific ADF object to do something different, you will need a separate component for this. Otherwise, all complex transformations are done in SQL, so your back-end must be able to generate this.

Another important note to make is that, for this to work, you need a mature metadata repository, catalog or framework so that you can manage your metadata in the first place – an important consideration to make between considering DIY and investing in vendor software.

This is where using BimlFlex comes in. Not only does BimlFlex has a complete metadata repository, but the platform allows you to specify metadata in a way that native components for data logistics can be generated for a number of supported target technologies (including ADF). This metadata, for this scenario, will then be generated as ADF objects such as complete pipelines, activities, lookups, data flows etc. using the Biml engine.

Roelant Vos

Ravos Business Intelligence admin

You may also like...

Leave a Reply