This post was authored by Casey Karst, Program Manager, SQL Server.
Changing data landscape
A lot has changed in the world of data over the last 10 years. The rise of connected devices, unstructured event data and ever-decreasing hardware prices has caused a Big Data boom. Solutions built on commodity hardware, such as Hadoop and HDFS (Hadoop Distributed File System), were developed to land machine-born, semi-structured data and deliver insights. This created new opportunities for generating value, but it has come at the cost of added complexity to enterprise data solutions. With the additional data also came the problem of having two or more disjoint datasets, some relational in SQL Server and some non-relational in HDFS. If a data analyst wanted to combine relational data with semi-structured data, they had to spend time and resources copying the data from one environment into the other, ultimately slowing the time to insight.
With PolyBase in SQL Server 2016, the days of disjoint relational and semi-structured data are over. With the combination of PolyBase and T-SQL, users can query data stored in HDFS as if it is local to the SQL Server, enabling a wide variety of new insights and scenarios.
Because PolyBase allows you to interact with both SQL Server and Hadoop, three new scenarios are possible:
- Ad hoc query of data stored in Hadoop and SQL Server 2016 using T-SQL
- Import data from Hadoop or Azure blob storage into SQL Server 2016
- Export “cold” relational data to Hadoop or Azure blob storage while keeping it query-able
When you run ad hoc queries over Hadoop data, PolyBase can intelligently leverage the compute resources of Hadoop. PolyBase can make a cost-based decision to push-down computation to Hadoop by generating MapReduce jobs on-the-fly, delivering the most optimal performance for your queries.
PolyBase: Bringing it all together
PolyBase extends the data footprint available for T-SQL queries by creating a structured definition in SQL Server of the semi-structured data stored in Hadoop and Azure blob storage. Once the structure is set up, users can interact with the data in nearly* the same way as data stored in SQL Server.
The three new data definition language (DDL) statements needed to structure external data are:
- Create External Data Source: Hadoop or Azure blob storage? Where is the data source located?
- Create External File Format: What is the layout of the data?
- Create External Table: What is the schema?
*A notable difference between external tables and local tables is that when an external table is deleted, the data remains intact on the external data source.
External data source
PolyBase supports the latest versions of Hortonworks and Cloudera distributions with SQL Server 2016. PolyBase will also support newer Hadoop versions within 90 days of their general availability. In addition to Hadoop distributions, PolyBase supports Azure storage blobs as external data sources.
For a full list of supported data sources, click here.
External file format
PolyBase supports Text, ORC, RC and Parquet file types in both compressed and uncompressed formats.
For more details on supported file types, see this how-to.
PolyBase External Tables reference the data stored in a Hadoop cluster or Azure blob storage. For more details on how to create external tables, reference this site.
PolyBase scale-out groups for breakthrough query performance
For workloads that require querying large datasets in HDFS, it is suggested that you increase the resources on the SQL side for optimal performance. Rather than increasing the size of a single node, SQL Server 2016 introduces PolyBase scale-out groups as a method of distributing a query across multiple SQL Server instances, enabling parallelized computation and data ingestion. A scale-out group is composed of a head node, which users submit queries to and that holds the relational data for the query, and compute nodes, which parallelize the ingestion and computation of the data. Compute nodes are free to process other workloads while also being part of a PolyBase scale-out group.
Interested in getting started with all of the capabilities PolyBase for SQL Server 2016 offers? Learn more with the following links and watch the video below to learn how PolyBase enables you to query HDFS using T-SQL directly from SQL Server.
For general knowledge on some of the great enhancements provided by SQL Server 2016, be sure to check out the other posts in the SQL Server 2016 blog series.
- Getting started with PolyBase
- PolyBase groups for scale-out computation
- PolyBase troubleshooting with dynamic management views