Expanding SQL Server Big Data Clusters capabilities, now on Red Hat OpenShift

SQL Server Big Data Clusters (BDC) is a new capability brought to market as part of the SQL Server 2019 release. BDC extends SQL Server’s analytical capabilities beyond in-database processing of transactional and analytical workloads by uniting the SQL engine with Apache Spark and Apache Hadoop to create a single, secure and unified data platform. BDC is available exclusively to run on Linux containers, orchestrated by Kubernetes, and can be deployed in multiple-cloud providers or on-premises.

Today, we’re announcing the availability of the latest cumulative update (CU5) for SQL Server 2019, that includes important capabilities for SQL Server and BDC including:

  • Support for deploying BDC on Red Hat OpenShift Kubernetes platform.
  • Enabled running applications within BDC as non-root users.
  • Support for deploying multiple BDCs against the same Active Directory domain.
  • Enriched data virtualization experiences.
  • Enhanced and open sourced Spark SQL connector.
  • Miscellaneous improvements and bug fixes.

This announcement blog highlights some of the major improvements, provides additional context to better understand the design behind these capabilities, and points you to relevant resources to learn more and get you started.

Deploy Big Data Clusters on Red Hat OpenShift Kubernetes platform

Red Hat OpenShift provides an enterprise-grade, commercially-supported distribution of Kubernetes as the foundation of its container platform across hybrid and multi-cloud environments. Through a close partnership with the Red Hat team, today we’re announcing support for SQL Server BDC deployments on OpenShift, for version 4.3 and up, on-premises or in public cloud environments with (ARO). You can now leverage a fully supported stack to operationalize your next unified analytics platform using BDC, ensuring design and development best practices, and enterprise-grade security guidelines that are core to OpenShift.

We have enhanced the security design of BDC to take full advantage of the OpenShift Container Platform. In addition to privileged containers being no longer required, containers are also running as a non-root user by default. This includes enabling enhanced process isolation within a container. The white paper produced in collaboration with SQL Server and Red Hat security teams describes the design in detail, highlighting what and why we require certain security policies when deploying BDC on OpenShift.

The BDC deployment model and experiences were enhanced so that you can follow the prescribed guidance, in an integrated manner, with built-in deployment profiles targeting OpenShift environments or UX enhancements in Azure Data Studio that include OpenShift as a target platform. With containers and Kubernetes powered Red Hat OpenShift, organizations can achieve the desired agility, scalability, flexibility, security, and portability for Big Data Clusters.

Bringing SQL Server and Big Data Clusters to the OpenShift Container Platform has been a real team effort. Red Hat provided our team with valuable help, bootstrapping our initial efforts, as well as providing best practice guidance during implementation. Security and trust are critical for both companies and so we appreciate the valuable input and contributions of Dan Walsh, Senior Distinguished Engineer at Red Hat, and Michael Nelson, Principal Software Engineering Manager at Microsoft, who collaborated on the security design for Big Data Clusters on OpenShift.

For more information on the BDC deployment process on OpenShift, follow the instructions on our documentation page.

Secure by default containers, running as non-root users

As a modern data platform, BDC ensures enterprise-grade secure data access by enabling Active Directory authentication though innovative implementations for applications running in containers. In addition, we are now making the platform implementation safer by ensuring that all container applications running within BDC are started as non-root users by default, on all supported platforms. These capabilities are available for all new deployments using the SQL Server 2019 CU5 corresponding image tag. Existing pre-CU5 BDC deployments will not be impacted, and applications in these clusters will continue to run as root user. Support for migrating these clusters to non-root type configuration will be added in a future cumulative update.

Deploy multiple BDCs against the same Active Directory domain

To complement the above platform enhancements regarding secure big data clusters, we are pleased to announce that we added support for deploying multiple BDCs against a single Active Directory domain. You can now leverage multiple BDC deployments in your secure enterprise environment, to accommodate multiple use cases like development/test, pre-production or production, CI/CD pipelines or HADR.

To learn more about Active Directory integration for BDC and deploying multiple BDCs against the same domain, see the security related topics on our documentation page.

Announcing new data virtualization enhancements

In addition to the improvements above, we have also improved our data virtualization capabilities. Namely, we’ve introduced two new stored procedures, sp_data_source_objects and sp_data_source_table_columns, to support introspection of certain External Data Sources. They can be used by customers directly via T-SQL for schema discovery and to see what tables are available to be virtualized. We leverage these in the External Table Wizard of the Data Virtualization Extension for  Azure Data Studio, which allows you to create external tables from SQL Server, Oracle, MongoDB, and Teradata.

For more information on the external table wizard, visit the documentation page.

SQL Server and Azure SQL Connector for Apache Spark Open Sourcing

BDC includes the SQL Server and Azure SQL Connector for Apache Spark. Based on the Apache Spark DataSource V1 APIs and SQL Server Bulk APIs, this connector enables you to read/write to and from any SQL Server using Apache Spark. As part of Microsoft’s commitment to open-source technology, we will be releasing this connector under the ApacheV2 license for anyone to use and contribute to. Stay tuned for more updates once the connector is live!

SQL Server BDC team hears your feedback

If you would like to help make BDC an even better analytics platform, please share any recommendations or report issues through our feedback page. SQL Server engineering team is thoroughly going through the reported suggestions. They are valuable input for us, that is being considered when planning and prioritizing the next set of improvements. We are committed to ensuring that SQL Server enhancements are based on customer experiences, so we build robust solutions that meet real production requirements in terms of functionality, security, scalability, and performance.

Ready to learn more?

With SQL Server 2019 CU5 updates, BDC continues to simplify the security, deployment, and management of your key data workloads. Industry-leading innovative security and compliance features and support for market-leading Kubernetes based platforms like Red Hat’s OpenShift will help our mutual customers achieve the expected agility, scalability, flexibility, and portability to develop and operationalize intelligent applications.

Check out the SQL Server CU5 release notes for BDC to learn more about all the improvements available with the latest update. For a technical deep-dive on Big Data Clusters, read the documentation and visit our GitHub repository.

To get started with deploying BDC on OpenShift, follow the instructions on our documentation page. Make sure to read the Security Best Practices whitepaper to better understand the security requirements.