How to eliminate Azure Data Factory’s public Internet exposure using Private Link

An illustration depicting Azure Data Factory, next to a picture of Bit the Raccoon.

Azure Private Link was made generally available on Feb 2020. Since then, it has made numerous Azure PaaS services more secured. By eliminating data transfer via the public internet, Azure Private Link helps reduce your exposure to cyber security attacks significantly.

Out of all PaaS services, Azure Data Factory is considered to be one of the most important services to be secured. An ADF instance can connect to numerous data sources that might contain sensitive customer information – and the impact of data exposure can be far more serious than any other PaaS service. Making ADF more secured, therefore, is critical in building a secure data solution on Azure.

In the following sections, we are going to walk through:

  1. How Private Link works to make Azure Data Factory more secure;
  2. Provide sample ARM templates for you to provision an ADF environment that makes use of Private Link; and
  3. Provide reference links to Azure certification, in case you want to learn more about Azure Data Factory/Azure Private Link

 

What is Azure Private Link?

Azure Private Link enables you to access Azure PaaS Services (for example, Azure Storage and SQL Database) and Azure hosted customer-owned/partner services over a private endpoint in your virtual network. Traffic between your virtual network and the service travels through the backbone of the Microsoft network. Exposing your service to public Internet is no longer necessary.

Refer to this link for more details about Private Link:

 

What is Azure Data Factory?

Azure Data Factory is a cloud-based ETL and data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale. Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores. You can build complex ETL processes that transform data visually with data flows or by using compute services such as Azure HDInsight, Azure Databricks, and Azure SQL Database.

Azure Data Factory consists of two planes: the “control plane” and “data plane”.

  • The “control plane” stores metadata such as pipeline definitions and schedules, and provides Data Factory pipelines with authoring and monitoring capabilities.
  • The “data plane” is a compute infrastructure called Integration Runtime (IR) to provide data integration capabilities. It connects to a “linked service”, which are data stores or compute services, to perform “activities”, which can be copying data between data stores, running Data Flows, or dispatching transform activities to other Azure services such as HDInsight, Databricks and Azure Machine Learning. There are 3 types of integration runtimes: Azure, Self-hosted and Azure-SSIS. As of the time of the writing, only private link for self-hosted integration runtime is generally available. This blog will use self-hosted integration runtime for illustration.

The below logical diagram illustrates the various components for an Azure Data Factory pipeline:

A logical diagram illustrating the various components for an Azure Data Factory pipeline

Refer to the below link for more details Azure Data Factory:

 

How does Private Link make Data Factory more secure?

To illustrate the idea, we will look at a simplified Data Factory infrastructure setup below, where there is:

  1. One instance of Data Factory, that stores the metadata of the pipeline;
  2. One storage account that represents the “source” linked service;
  3. One storage account that represents the “destination linked service”; and
  4. One integration runtime that performs the actual data movement from “source” to “destination”.

Network diagram – before Private Link for ADF is implemented

An Azure Data Factory network diagram, without private link

Before Private Link is available:

  • the communication between ADF IR and ADF control plane will have to traverse the public Internet; and
  • the following communication channels between the Azure Data Factory and the virtual network will have to be opened up:
    • adf.azure.com, port 443
    • *.{region}.datafactory.azure.net, port 443
    • *.servicebus.windows.net, port 443
    • download.microsoft.com

All of these together added unnecessary risk exposure for Azure Data Factory.

I have exported the ARM templates for the above setup to the GitHub link here for your reference.

Network diagram – after Private Link for ADF is implemented

After Private Link is introduced, you can secure communication between ADF IR and the ADF control plane using Private Link. The below diagram illustrates how it works:

An Azure Data Factory network diagram, with private link

What’s more – you don’t need to configure the preceding domain and port in a virtual network – which further reduces your risk exposure. I have exported the ARM templates for the above setup to the GitHub link here.

If you compare the list of resources before and after implementing Private Link – you should notice that there are three additional resources created for supporting the use of Private Link.

A screenshot of the resource list - after adding Azure private link.

 

How do I set up Private Link for Azure Data Factory, and ensure no public Internet is allowed?

You can configure Private Link for Azure Data Factory via the portal UI. Below is a screen capture for quick reference:

The Private Endpoint screen in Azure Data Factory

From here, you can disable public network access to the Data Factory:

The Network Access screen in Azure Data Factory

 

How to tell if my self-hosted integration runtime is connecting to the private endpoint?

You can verify by checking the DNS resolution of the service endpoint hostname on the integration runtime. You can get the hostname from the authentication key string in the Azure Data Factory portal, as shown below:

A screenshot of the Integration runtime Authentication Key dialog, in Azure Data Factory.

If private endpoint is setup, then the DNS resolution on your self hosted integration runtime should show an internal IP address. In the example below, it resolves to 192.168.168.5 which is the intranet address assigned by the internal DNS.

A screenshot of command prompt showing DNS resolution in Azure Data Factory

 

I’d like to know more – where may I get more information?

Azure Data Factory is just one of the data services offered on Azure. To learn more about the management, monitoring, security and privacy of data on Azure, the below learning path for “Azure Data Engineer Associate” will help:

Private link is one of many security features offered by Azure. To learn more about the other security features, the below learning path for “Azure Security Engineer Associate” shall give you a more comprehensive overview on Azure Security:

If you are looking for information on the security consideration of implementing just Data Factory, refer to the link below:

And finally, the examples given in this blog can be found in the GitHub repo here. We hope you find this blog post useful!