ETL is Easy Again in Azure Data Factory

by | Jun 20, 2020

Almost two years ago I was a part of the BI team in an organization that decided to move to the Cloud. As a first step they followed the lift and shift approach, so they moved their Data Warehouse and all the existing processes into the Azure environment almost untouched. They were also planning to eventually transform their solution to be fully cloud-native.

 Screenshot of the Mapping data flow from the Microsoft website

Thankfully, good old relational databases are still there, improved, automatically replicated and easily scalable, which is perfect for data storage. The hardest bit comes with the data transformation and, especially, integration, which previously has been done with SQL Server Integration Services (SSIS). At that time Azure was lacking a simple cloud equivalent of that. ETL options were limited to writing code in Big Data tools, which is not the most common way of treating relational data. Although my contract finished then, it was my incomplete gestalt to find a simple solution for ETL in Azure.

In late 2019, mapping data flows was added to the Azure Data Factory toolset. This is a visual tool for creating ETL data flows, pretty much like SSIS but in the Cloud. It has all the basic ETL functions: getting data from almost any imaginable source, merge datasets, aggregate, pivot, add new calculated columns, and finally land the results into some data storage. All of these computations are made on the Apache Spark cluster, which means you can manage parallelism of your flow tasks.

Data processing for Business Intelligence consists of repetitive simple tasks, therefore it is very good news that you don’t have to write (or, more likely, endlessly copy and paste) code yourself, but rather visually build graphs from a set of elements. With the script task available in SSIS, it is a programming tool for many ETL developers, its possibilities are endless. ADF MDF doesn’t include a script task, but you have plenty of tools in Azure available to solve any extra challenges.

While it is still possible to keep your old SSIS packages running in Azure, you can also start using cloud-native processing of your data. ADF supports Git (which is also a feature we lacked back then on the project), so all the good practices could also stay almost as they used to be on premise.

Kate
Data masseuse

Kate writes technical blogs about data warehouses, and is a Data Vault convert who works mostly with MS SQL Server. You can connect with Kate on LinkedIn, or read her other blogs here.

0 Comments
Submit a Comment

Your email address will not be published. Required fields are marked *