What’s new with SQL Server Big Data Clusters

SQL Server Big Data Clusters (BDC) is a new capability brought to market as part of the SQL Server 2019 release. BDC extends SQL Server’s analytical capabilities beyond in-database processing of transactional and analytical workloads by uniting the SQL engine with Apache Spark and Apache Hadoop to create a single, secure, and unified data platform. BDC is available exclusively to run on Linux containers, orchestrated by Kubernetes, and can be deployed in multiple-cloud providers or on-premises.

Today, we’re announcing the release of the latest cumulative update (CU9) for SQL Server Big Data Clusters, which includes important capabilities:

  • Support to configure BDC post deployment.
  • Improved experience for encryption at rest.
  • Ability to install Python packages at Spark job submission time.
  • Upgraded software versions for most of our OSS components (Grafana, Kibana, FluentBit, etc.) to ensure Big Data Clusters images are up to date with the latest enhancements and fixes.
  • Miscellaneous improvements and bug fixes.

This announcement highlights some of the major improvements, provides additional context to better understand the design behind these capabilities, and points you to relevant resources to learn more and get started.

Configuring SQL Server Big Data Clusters to meet your business needs

SQL Server Big Data Clusters, a feature released as part of SQL Server 2019, is a data platform for operational and analytical workloads. We are announcing new configuration management functionality as part of today’s CU9 release. Workload requirements are constantly changing and these enhancements will help customers ensure that their Big Data Cluster is always prepared for their needs.

Configuration management is the ability to alter or tune various parts of the Big Data Cluster after deployment and to provide users with clarity into the cluster’s configurations. This allows administrators to configure the Big Data Cluster configurations to meet their workload’s needs. Whether an administrator wants to turn on SQL Agent, define the baseline resources for their organization’s Spark jobs, or even see what settings are configurable at each scope—configuration management is the one-stop solution to meet these needs.

To enable this functionality, we are exposing new commands to the azdata  command line interface (CLI). Azdata, an interface to manage a BDC, now includes post-deployment configuration functionality to set, diff, and apply configuration settings. To start, customers can configure settings at the cluster, service, and resource scope and then commit them for change. After applying pending configuration changes, customers can monitor the process through azdata or Azure Data Studio. Once the update is completed, the Big Data Cluster is ready for the next workload.

Learn more and get started with configuration management.

Spark job library management

Data engineers and data scientists often want to experiment with and use a variety of different libraries and packages as part of their workflows. There are separate ways to do this for each language including importing from Maven, installing from Python Package Index (PyPi) or conda, or installing from Microsoft R Application Network (MRAN). Before today, customers could import jars from Maven or reference custom packages stored in Hadoop Distributed File System (HDFS) through Spark job configurations.

Starting in CU9, data engineers and data scientists now have added flexibility for their PySpark jobs through job-level virtual environments. They can easily configure a conda virtual environment and get to work with their favorite Python libraries.

Learn how to configure a job-level Spark environment.

Improving the experience on encryption at rest

In SQL Server Big Data Clusters CU8, we introduced a comprehensive encryption at rest feature set that focused on system-managed keys. This enabled application-level encryption capabilities to all data stored in the platform, on both SQL Server and HDFS. The HDFS experience provided at that time for administrators was centered on usage of Azure Data Studio Notebooks to control all aspects of the feature. Starting with CU9, in addition to expanding the Notebook experience, we are enabling HDFS encryption zones and HDFS key management through azdata. This enables the automation of encryption at rest administrative tasks for HDFS administrators, a much desirable and consistent feature of the SQL Server Big Data Clusters platform.

To learn more about the new notebooks and the new azdata commands, visit the release notes.

Ready to learn more?

Check out the SQL Server CU9 release notes for Big Data Clusters to learn more about all of the improvements available with the latest update. For a technical deep-dive on Big Data Clusters, read the documentation and visit our GitHub repository.

Follow the instructions on our documentation page to get started and deploy Big Data Clusters.