There are a number of considerations when configuring access to Azure Data Lake Storage gen2 (ADLS) from Azure Databricks (ADB). How will Databricks users connect to the lake securely, and how does one configure access control based on identity? In a previous article we covered six access control patterns, the advantages and disadvantages of each, and the scenarios in which they would be most appropriate. This article aims to complete the security discussion by providing an overview of network security between these two services, and how to connect securely to ADLS from ADB using Azure Private Link.
Secure access to Storage/ADLS Gen2
In Azure there are two types of PaaS service – those which are built using dedicated architecture, known as dedicated services, and those which are built using a shared architecture, known as shared services. Dedicated services use a mix of cloud resources (compute, storage, network) allocated from a pool, and are assigned to a dedicated instance of that service for a particular customer. These can be deployed within a customer virtual network, for example, a virtual machine. Shared services use a set of cloud resources which are assigned to more than one instance of a service, utilised by more than one customer, and therefore cannot be deployed within a single customer network e.g. storage. Depending on the type of service, a different VNet integration pattern is applied to make it accessible only from clients deployed within Azure VNets and not accessible from the internet.
Azure Storage / ADLS gen2 is a shared service built using a shared architecture, and so to access it securely from Azure Databricks there are two options available. This Databricks blog summarises the following approaches:
Customers may use either approaches for securing access between ADB and ADLS Gen2, but both require the ADB workspace to be VNet injected.
The documentation explains how to configure service endpoints, and how to limit access to the storage account by configuring the storage firewall. Further secure the storage account from data exfiltration using a service endpoint policy.
The setup for storage service endpoints are less complicated than Private Link, however Private Link is widely regarded as the most secure approach and indeed the recommended mechanism for securely connecting to ADLS G2 from Azure Databricks. It exposes the PaaS shared services (storage) via a private IP and thus overcomes the limitations of service endpoints and protects against data exfiltration by default. The setup of Private Link requires a number of configurations at the network and DNS level and the complexity encountered is around the DNS resolution to the service. The following article goes into greater detail on DNS considerations and integration scenarios. The approach discussed below is to use Azure Private DNS Zones to host the “privatelink” zone.
Connecting securely to ALDS from ADB
The following steps will enable Azure Databricks to connect privately and securely with Azure Storage via private endpoint using a hub and spoke configuration i.e. ADB and private endpoints are in their respective spoke VNETs:
- Deploy Azure Databricks into a VNet using the Portal or ARM template.
- Create a private storage account with a private endpoint and deploy it into the different VNet (i.e. create a new VNet named spokevnet-storage-pl beforehand).
- Ensure the private endpoint is integrated with a private DNS zone to host the privatelink DNS zone of the respective service, in this case dfs.core.windows.net. When creating the Private Endpoint, there is an option to integrate it with Private DNS as shown below:
- When ADB and Storage private endpoints are deployed in their respective VNets, there are some additional steps that need to be performed:
- a. The VNets should be linked with the private DNS zone, as shown below (databricks-vnetpl and spkevnet-storage-pl):
- b. Also make sure both ADB and storage endpoint VNETs are peered:
- The network configuration should now be as follows:
- c. Make sure the storage firewall is enabled. As an optional step you can also add the ADB VNet (databricks-vnet) to communicate with this storage account. When you enable this, storage endpoints will also be enabled on the ADB Vnet (databricks-vnet).
- In an ADB notebook you can double check if the FQDN of the storage is now resolving to private IP:
- A mount can be created as normal using the same FQDN and it will connect privately to ADLS using private endpoints.
Note: You can deploy the private endpoint for storage within the same VNet where ADB is injected but it should be a different subnet i.e. it must not be deployed in the ADB private or public subnets.
There are additional steps one can take to harden the Databricks control plane using an Azure Firewall if required.
Securing vital corporate data from a network and identity management perspective is of paramount importance. Azure Databricks is commonly used to process data in ADLS and we hope this article has provided you with the resources and an understanding of how to begin protecting your data assets when using these two data lake technologies.