Just-in-time Azure Databricks access tokens and instance pools for Azure Data Factory pipelines using workspace automation

An image representing Data Bricks, next to an illustration of Bit the Raccoon.

An illustration of an example Azure Data Factory setup

Introduction

Using AAD tokens it is now possible to generate an Azure Databricks personal access token programmatically, and provision an instance pool using the Instance Pools API. The token can be generated and utilised at run-time to provide “just-in-time” access to the Databricks workspace. Using the same AAD token, an instance pool can also be provisioned and used to run a series of Databricks activities in the same ADF pipeline.

For those orchestrating Databricks activities via Azure Data Factory, this can offer a number of potential advantages:

  • Increases agility, reduces potential human-error and decreases dependency on platform teams
  • Reduces spin-up time in scenarios where a series of Databricks activities are run in a pipeline or set of chained pipelines.
  • Implement ADF activity based workflow as an alternative to notebook workflows.
  • Establish guard rails, business logic and validation during the provisioning processes.
  • Increased governance of tokens and instance pools. Implements best practice by reducing exposure time of privileged tokens.

 

The Just-in-time Solution

The following diagram depicts the architecture and flow of events:

An illustration of an example Azure Data Factory setup

  1. A pipeline invokes an Azure Function
  2. The Function App uses client credential flow to get an access token with the Azure Databricks login application as the resource.
  3. Using the access token the Function App generates a Databricks access token (PAT) using the Token API and creates an instance pool using the Instance Pool API.
  4. The Function App stores the Databricks access token and Pool ID in Azure Key Vault
  5. The Databricks activities run utilising both the access token and instance pools created, retrieving these details from Key Vault at run time.

Extending this approach a little further can provide excellent separation of concerns between the platform team responsible for provisioning the infrastructure i.e. the Databricks runtime environment, and the data team depending on this environment to run their data pipelines. Using the technique described in this blog, it would be possible for the platform team to manage an “initialisation” pipeline which takes care of provisioning the environment as well as any validation and repeatable business logic. This “environment initialisation” pipeline may run in the same or another Data Factory, which then invokes or is invoked by the engineering pipeline managed by the data team running the Databricks workloads.

An illustration of an example Azure Data Factory setup

 

Demonstration

The following demo will provide a step-by-step tutorial to setup a Function app to generate the token and provision the pool, and an ADF pipeline which is provided just-in-time access to the workspace at run-time, leveraging cluster pools to run a series of Databricks activities.

Note: Any code provided should not be regarded as production ready but is simply functional for demonstration purposes.

 

Prerequisites

If you wish to complete this demonstration you will need to provision the following services:

  • Azure Data Factory
  • Azure Key Vault
  • Azure Databricks
  • Azure Function App (see additional steps)

Additional steps:

  • Review the readme in the Github repo which includes steps to create the service principal, provision and deploy the Function App. Whilst the code referenced in this repo is written in JavaScript, an example Python script can be found here.
  • As a once-off activity, the service principal will need to be added to the admin group of the workspace using the admin login, as shown in this sample code. The service principal must also be a granted the contributor role in the workspace.
  • The Databricks workspace can be premium or standard tier. To simulate some workload, create at least one Python notebook in the workspace which runs a simple command such as:
print("Workload goes here")

 

Key Vault Configuration

As per the steps in the repo ensure that the service principal has been granted access to the secrets in Key Vault via an access policy.

In a production scenario one would need at least two Key Vaults, one for the Platform team to store their secrets that will be used by the Function App, and another for the Data team to store Databricks tokens and cluster pool IDs. For the purposes of this demo only one Key Vault is required.

 

Function App Configuration

After running the setup scripts and publishing the code to the function app, there is no further configuration required. The setup script creates the app config settings which store the key vault name and Databricks workspace ID, and the app is also configured to use the service principal for authentication / authorisation. In the auth token input binding, found in the function.json file, the correct resources are specified, namely Key Vault and Databricks and the identity is set to clientcredentials, which means that the identity of the function app is used i.e. the service principal.

Azure Active Directory settings

This means that when the AAD token is generated or when the resource API is invoked, the service principal’s credentials will be used.

Once the app has been published, use the portal development experience to “Code+Test”. Starting with the function to generate the Databricks access token, use the Test functionality, enter a query name “patsecretname” and value and click Run.

Editing the input on createDBPAT

One should receive a 200 OK response and find that a new secret has been stored in Key Vault with the specified name.

Looking at the output on createDBPAT

Next test the function which creates the Databricks pool, enter a poolsecretname query parameter and ensure that a new pool has been created with the name of the query parameter specified.

The Clusters section on Azure Databricks

This pool will not incur any cost until the instance pool is used by the ADF pipeline – so long as the min_idle_instances parameter in the request payload of the Instance Pool API was set at 0. Any other value and the pool will provision VMs at the minimum threshold specified, incurring standard VM costs.

There will be an associated key vault secret which stores the pool ID, to be used ADF in the linked service configuration.

Showing the PAT test is enabled

Note: Whilst a key is required to invoke the functions app, it is still accessible over the public internet. If security is a concern, consider access restrictions and whitelisting the IP of the ADF self-hosted integration runtime and/or private site access and deploying the integration runtime into a managed vnet.

 

Data Factory Configuration

Using a combination of key vault, parameters and the dynamic contents setting (in the advanced section of the linked service) it is possible to create a more dynamic linked service, into which the configuration details can be “injected” at runtime.

1. To begin, grant the managed identity of ADF access to your Azure Key Vault.

2. Then configuring a Key Vault linked service as described in this tutorial.

3. Next create a new linked service for Azure Databricks, define a name, then scroll down to the advanced section, tick the box to specify dynamic contents in JSON format. Enter the following JSON, substituting the capitalised placeholders with your values which refer to the Databricks Workspace URL and the Key Vault linked service created above. Note the new format – adb-..azuredatabricks.net. It is unique per-workspace! This workspace URL could have also been stored in Key Vault also, which is better practice!

{ 
    "properties": { 
        "type": "AzureDatabricks", 
        "parameters": { 
            "myadbpatsecretname": { 
                "type": "string" 
            }, 
            "myadbpoolsecretname": { 
                "type": "string" 
            }
        },
        "annotations": [], 
        "typeProperties": { 
            "domain": "WORKSPACE URL", 
            "accessToken": { 
                "type": "AzureKeyVaultSecret", 
                "store": { 
                    "referenceName": "KEY VAULT LINKED SERVICE NAME", 
                    "type": "LinkedServiceReference" 
                }, 
                "secretName": "@{linkedService().myadbpatsecretname}" 
            }, 
            "instancePoolId": { 
                "type": "AzureKeyVaultSecret", 
                "store": { 
                    "referenceName": "KEY VAULT LINKED SERVICE NAME", 
                    "type": "LinkedServiceReference" 
                }, 
                "secretName": "@{linkedService().myadbpoolsecretname}" 
            }, 
            "newClusterNodeType": "Standard_DS3_v2", 
            "newClusterNumOfWorker": "2", 
            "newClusterVersion": "6.4.x-scala2.11" 
        } 
    } 
}

Note that two parameters are created to represent the KV secrets which contain the PAT and the Pool ID. These will be the parameters passed into the pipeline at trigger time.

After the linked service is created it should look as follows:

The linked service after being created

Note: Personal Access Tokens created by a service principal via the API are not displayed when you are logged into the Workspace UI because they have been generated by a different security principal. They are however visible via token LIST API using the AAD token generated from the service principal created above.

4. Create another linked service to authenticate to the Azure Function app as shown in the documentation.

5. Next, create a pipeline and add two parameters which will represent the names of the secrets in Key Vault which will contain the access token and pool ID.

The parameters section in the new pipeline

6. Drop two Function activities on to the canvas.

Creating two new Azure Functions

7. Specify the Function linked service, and using the function name specify each function to be invoked as well as their associated query parameter. For generating the access token use the following expression substituting the function name if necessary:

@concat('createADBPAT?patsecretname=',pipeline().parameters.patsecretname)

8. In the next function activity specify a function name which will create the instance pool, for example:

@concat('createADBPool?poolsecretname=',pipeline().parameters.poolsecretname)

9. Connect these two activities and Publish the changes.

10. Next, create another pipeline and add two parameters which will be passed to the pipeline which execute the Function apps.

Editing the parameters of a new pipeline

11. On the canvas add an Execute Pipeline activity. Specify the pipeline created above as the invoked pipeline in the Execute Pipeline activity. In the parameters section click on the value section and add the associated pipeline parameters to pass to the invoked pipeline.

Creating an Execute Pipeline activity

12. Add a Databricks notebook activity and specify the Databricks linked service which requires the Key Vault secrets to retrieve the access token and pool ID at run time.

13. Add these pipeline parameters to the linked service properties so that they are passed through to the link service at trigger time.

Adding the pipeline parameters to the linked service

14. Under the settings tab enter the path of the notebook created in the prerequisites. The path will similar to the following:

/Users/[USERNAME]/Workload1

15. Copy and paste the Databricks activity three times and connect all the activities.

Connecting the other activities in Databricks

16. Optionally, create another function app and activity which will revoke the access token and delete the instance pool at the end of the pipeline.

17. Publish the changes and trigger this pipeline, monitoring the results.

 

Results

With job clusters

Using only job clusters which spin-up with each Databricks activity, the total time for the same workload is around 18 and half minutes.

The estimated duration of the Databricks workload

Notice how each cluster takes between 4 and 5 minutes per activity.

A breakdown of the time per activity

 

With Instance Pools

Using instance pools the total time dramatically reduces to under 10 minutes.

The estimated duration with instance pools

Notice how despite the first Databricks activity which took the usual 4–5 minutes, the remaining activities are around a minute and a half, most of the time reflected is the time taken to initialise the Spark cluster.

The ADBWorkloads dialogue window

A breakdown of the duration each workload takes to run

Conclusion

Automation and security improvements to the Databricks workspace are being constantly released such as AAD tokens, permissions API, token management API, cluster policies, etc. Instance pools are also another example of these improvements and can make a dramatic improvement to the completion times of your ADF-based Databricks workloads – particularly so when running a series of chained Databricks activities. Managing access to the workspace and provisioning instance pools no longer requires manual intervention when using AAD tokens for workspace automation. Granting just-in-time access to these resources reduces the chance of manual error, promotes better governance and reduces the risk of improper access token practices.

 

Useful Links

 

Resources for your decision making team