Secure Access to Storage: Azure Databricks and Azure Data Lake Storage Gen2 Patterns

The Data Lake Analytics logo, next to an illustration of Bit the Raccoon.

There are a number of considerations when configuring access to Azure Data Lake Storage gen2 (ADLS) from Azure Databricks (ADB). How will Databricks users connect to the lake securely, and how does one configure access control based on identity? In a previous article we covered six access control patterns, the advantages and disadvantages of each, and the scenarios in which they would be most appropriate. This article aims to complete the security discussion by providing an overview of network security between these two services, and how to connect securely to ADLS from ADB using Azure Private Link.

 

Secure access to Storage/ADLS Gen2

In Azure there are two types of PaaS service – those which are built using dedicated architecture, known as dedicated services, and those which are built using a shared architecture, known as shared services. Dedicated services use a mix of cloud resources (compute, storage, network) allocated from a pool, and are assigned to a dedicated instance of that service for a particular customer. These can be deployed within a customer virtual network, for example, a virtual machine. Shared services use a set of cloud resources which are assigned to more than one instance of a service, utilised by more than one customer, and therefore cannot be deployed within a single customer network e.g. storage. Depending on the type of service, a different VNet integration pattern is applied to make it accessible only from clients deployed within Azure VNets and not accessible from the internet.

Azure Storage / ADLS gen2 is a shared service built using a shared architecture, and so to access it securely from Azure Databricks there are two options available. This Databricks blog summarises the following approaches:

  1. Service Endpoints
  2. Azure Private Link

Customers may use either approaches for securing access between ADB and ADLS Gen2, but both require the ADB workspace to be VNet injected.

Service Endpoints

The documentation explains how to configure service endpoints, and how to limit access to the storage account by configuring the storage firewall. Further secure the storage account from data exfiltration using a service endpoint policy.

Private Link

The setup for storage service endpoints are less complicated than Private Link, however Private Link is widely regarded as the most secure approach and indeed the recommended mechanism for securely connecting to ADLS G2 from Azure Databricks. It exposes the PaaS shared services (storage) via a private IP and thus overcomes the limitations of service endpoints and protects against data exfiltration by default. The setup of Private Link requires a number of configurations at the network and DNS level and the complexity encountered is around the DNS resolution to the service. The following article goes into greater detail on DNS considerations and integration scenarios. The approach discussed below is to use Azure Private DNS Zones to host the “privatelink” zone.

Connecting securely to ALDS from ADB

The following steps will enable Azure Databricks to connect privately and securely with Azure Storage via private endpoint using a hub and spoke configuration i.e. ADB and private endpoints are in their respective spoke VNETs:

  1. Deploy Azure Databricks into a VNet using the Portal or ARM template.
  2. Create a private storage account with a private endpoint and deploy it into the different VNet (i.e. create a new VNet named spokevnet-storage-pl beforehand).
  3. Ensure the private endpoint is integrated with a private DNS zone to host the privatelink DNS zone of the respective service, in this case dfs.core.windows.net. When creating the Private Endpoint, there is an option to integrate it with Private DNS as shown below:a screenshot of a social media post
  4. When ADB and Storage private endpoints are deployed in their respective VNets, there are some additional steps that need to be performed:
    • a. The VNets should be linked with the private DNS zone, as shown below (databricks-vnetpl and spkevnet-storage-pl):a screenshot of a cell phone
    • b. Also make sure both ADB and storage endpoint VNETs are peered:a screenshot of a cell phone
    • The network configuration should now be as follows:a close up of a map
    • c. Make sure the storage firewall is enabled. As an optional step you can also add the ADB VNet (databricks-vnet) to communicate with this storage account. When you enable this, storage endpoints will also be enabled on the ADB Vnet (databricks-vnet).a screenshot of a cell phone
  5. In an ADB notebook you can double check if the FQDN of the storage is now resolving to private IP:a screenshot of a social media post
  6. A mount can be created as normal using the same FQDN and it will connect privately to ADLS using private endpoints.a screenshot of a cell phone

Note: You can deploy the private endpoint for storage within the same VNet where ADB is injected but it should be a different subnet i.e. it must not be deployed in the ADB private or public subnets.

a close up of a map

There are additional steps one can take to harden the Databricks control plane using an Azure Firewall if required.

 

Conclusion

Securing vital corporate data from a network and identity management perspective is of paramount importance. Azure Databricks is commonly used to process data in ADLS and we hope this article has provided you with the resources and an understanding of how to begin protecting your data assets when using these two data lake technologies.