Build an intelligent analytics platform with SQL Server 2019 Big Data Clusters

In the most recent releases, SQL Server went beyond relational data and enabled support for graph data, R, and Python machine learning, while making SQL Server available on Linux and containers in addition to Windows. At the same time, organizations are challenged with the amount of data stored in different formats, in silos, and the expertise required to extract value out of the data. Through enhancements in data virtualization and platform management, Microsoft SQL Server 2019 Big Data Clusters provides an innovative and integrated solution to overcome these difficulties. It incorporates Apache Spark™ and HDFS in addition to SQL Server, on a platform built exclusively using containerized applications, designed to derive new intelligent insights out of data.

Modernize your data estate with a scalable data virtualization and analytics platform

Data integration strategies are based on extract, transform, and load (ETL) results in data duplication and transformations that diminish data quality, higher maintenance, and security risks. SQL Server 2019 has a new approach to data integration called data virtualization across disparate and diverse data sources, without moving data. Out-of-the-box connectors for data sources like Oracle, Teradata or MongoDB help you keep the data in place and secure, with less maintenance and storage cost. You can now uncover unconsidered perspectives by easily combining all your data, which ultimately leads to better data-driven decisions.

Systems Imagination is using these capabilities in SQL Server Big Data Clusters, eliminating the need to shift or replicate data to gain insights.

“With SQL Server 2019 Big Data Clusters, we can analyze cancer research data coming from dozens of different data sources, mine interesting graph features, and carry out analysis at scale” – Pieter Derdeyn, Knowledge Engineer, Systems Imagination.

In addition, SQL Server 2019 Big Data Clusters provides a comprehensive machine learning and AI platform with all the tools and services required to ingest, store, prepare, and analyze data. With previous versions of SQL Server, you can execute Python and R scripts to clean and prepare data, train, evaluate, or deploy machine learning models within a database. Within Big Data Clusters, you can use the data analysis tools and frameworks of your choice on the same platform where data resides.

In Azure Data Studio, you can submit Apache Spark™ jobs and use the built-in compute context in your preferred language including R, Python, or Scala. Your AI and machine learning lifecycle can benefit from SQL Server’s mission-critical features like performance, security, availability, and scalability. You can also operationalize these models and deploy them as containerized applications running within the platform, side by side with the data. Models are exposed as a REST API for easy integration with your business applications. This set of comprehensive analytics tools is what Dr. Foster, one of the Big Data Clusters early adopters customers, leveraged for their analytics platform:

“Our analysts need access to cutting edge data science technologies and techniques that adhere to strict industry-regulated guidelines. With SQL Server 2019 Big data clusters, we are able to analyze our relational data in the unified data platform, leveraging Apache Spark™, HDFS, and enhanced machine learning capabilities, all while remaining compliant.” – George Bayliffe, Head of Data, Dr. Foster

Data science process in Big Data Clusters: ingest using Spark streaming and SSIS; store data in data pools, SQL Server master instance, and HDFS; prep and train models using SQL Server or Spark ML; use machine learning models using SQL Server master and application pools.

Fig 1. Intelligence over all your data with SQL Server 2019 Big Data Clusters.

Built on top of the Kubernetes containers, Big Data Clusters have a built-in management system on any infrastructure

Managing all the services that enable you to run relational and big data workloads in a secure, efficient, and scalable way is challenging. With Big Data Clusters, you can operationalize management and data engineering tasks in an integrated and consistent way with a modern, containers-based architecture built on top of Kubernetes. At the center of this platform is the SQL Server master instance that stores relational data and serves as an entry point to other data sources within or outside the cluster. With additional SQL Server instances in the data pool, you can build a scale-out data mart for ingesting and automatically distribute data resulting in enhanced query performance efficiency. Multiple parallel-processing SQL Server instances in the compute pool and elastically scalable shared storage with SQL Server and HDFS are also included by default in a big data cluster. To further expand your data lake, you can unify your HDFS stores using HDFS tiering, Microsoft’s latest contribution to the Apache HDFS open source project, now available with SQL Server 2019 Big Data Clusters. Along with HDFS, we include Apache Spark™, ideal for data ingestion tasks, preparation, training, and analysis of high data volumes in a scalable and performant way.

Data science process in Big Data Clusters: ingest using Spark streaming and SSIS; store data in data pools, SQL Server master instance, and HDFS; prep and train models using SQL Server or Spark ML; use machine learning models using SQL Server master and application pools.

Fig 2. The combination of SQL Server database engine, Apache Spark™, and HDFS in SQL Server 2019 enable diverse big data scenarios.

The choice of infrastructure is fundamental when it comes to deploying and managing all these components at scale. Kubernetes enables application portability, elastic scalability, and consistency across platforms, allowing SQL Server 2019 Big Data Clusters to ensure a predictable, self-contained, and fast deployment workflow. Balzano recognizes the value of a self-managed, autonomous and flexible platform that allows you to focus on getting valuable insights out of data.

“SQL Server 2019 Big Data Clusters allowed us to accommodate and integrate all aspects from one shared platform for our data scientists and for our software engineers who wire up workflows, security, and scalability. At runtime, our healthcare customers benefit from simple containerized deployment and maintenance while being able to move our solution between on-premises and the cloud easily.” – René Balzano, Founder and CEO, Balzano.

SQL Server’s years long commitment is to support mission-critical applications. In Big Data Clusters, we ensure that management services embedded within the platform provide fast scale and automated upgrade operations, automatic logs and metrics collection, enterprise grade secure access, and high availability. Azure Active Directory authentication is available through innovative implementations for applications running in containers, providing an integrated security model that spans all services, including SQL Server, Apache Spark™, and HDFS.  Maintenance tasks like secure container deployment, certificates, and secrets storing and rotation are provided by the platform through tight integration with Azure Active Directory and Kubernetes. Applications running on top of a Kubernetes orchestrator benefit from the platform’s built-in health monitoring, failure detection, and failover mechanisms. In addition, for critical components like the SQL Server master instance, you can enable flagship features like Always On Availability Groups for additional reliability and read scale out capabilities.

Cost effective big data and AI platform

You can start with the Developer Edition at no cost and try the complete set of capabilities of a full-featured deployment. The SQL Server 2019 licensing model was updated to incorporate a new subscription model for Big Data Clusters, and you have the option to use your existing SQL Server software licenses for Big Data Clusters deployments. A new Software Assurance benefit gives you eight Big Data Cluster node core licenses for each of the Enterprise Edition SQL Server master instance cores for free.

Get started

With a unified set of data integration, management, and data analysis tools, Big Data Clusters makes it not just easy, but also affordable for you to build on this platform. SQL Server 2019 Big Data Clusters provides the analytics at scale platform that you can count on for enterprise-grade performance, high availability, security, and manageability. We are very excited to see you use the broad range of scenarios that will help bridge the gap between relational data and big data deployments. You can store and analyze data from multiple sources at scale, in various data formats, with scale-out compute for data processing and machine learning, together with the industry-leading experience of SQL Server.

Start building your new analytics platform today. Here are a few pointers to help you get started: