This post was co-authored by Dotan Elharrar, Principal Program Manager, Machine Learning, and Umachandar Jayachandran Senior Program Manager, SQL Server.
R has emerged as the most popular software for mining data, uncovering new insights and making predictions. It’s a perfect starting point because it is free and open-source and offers an ever-growing and vibrant community. You can leverage thousands of packages in a variety of domains. You gain instant access to standard statistical and data manipulation functions, graphics and cutting-edge machine learning algorithms. All these make R the tool of choice for academics and companies seeking to innovate through advanced analytics.
Despite its great promise, enterprises seeking to adopt R in production are hitting several considerable challenges:
- Scale and performance: The most common Open Source R packages run single-threaded and can therefore only consume datasets that fit into available memory on the local machine. This memory constraint becomes a critical drawback as data sizes grow or when wishing to speed up execution.
- Integrating with applications (operationalization): Enabling applications in production to call R code is all but easy. Instead, many companies end-up translating the R code into different languages in order to integrate it. This process is typically cumbersome, costly and error-prone and doesn’t support agile innovation. Enterprises need a simple way for applications to retrieve predictions, scores and plots, schedule R jobs and integrate it with existing application logic.
- Data movement: Moving data out of the database is best avoided when possible. Having to transfer data for analysis on a separate machine running R can present security headaches and latency issues when data volumes grow. It also tends to create data fragmentation because different people in the organization find themselves working on different versions of the data.
While R is open-source and free, addressing the challenges above introduces time and cost considerations. Innovation and agility suffer if you’re limited in the volumes of data you can process in production, if processing data takes way too long, or if you need to set up workflows to move data around. Similarly, costs can rise significantly if you need a development team to recode the R scripts or spend extra money on special hardware.
SQL Server R Services is a new feature in SQL Server 2016 that brings together the best of the two worlds. You still enjoy all the benefits of R, the ever-growing set of packages and the large user community, but now you also have the power of SQL Server and the familiar Transact-SQL interface.
- R Services support in-database analytics allowing you to keep the data in SQL Server 2016 and have applications call R scripts via a T-SQL system stored procedure. This makes application integration with R easier than ever. R scripts execute in a secure fashion on the SQL Server box so you are protected from R code trying to reach over to the Internet or to other jobs running on the SQL Server.
- R Services customers can benefit from the included ScaleR algorithm library, a set of functions that provide equivalents for routine tasks performed in R. However, unlike the CRAN-R equivalents, ScaleR functions can scale to handle large datasets that span hundreds of millions and billions of rows by performing parallel computations, thus providing superior performance over common packages. The ScaleR APIs were developed by Revolution Analytics and are now integrated into SQL Server following Microsoft’s acquisition. They are also a cross-platform API, supporting a variety of platforms beyond SQL Server.
- SQL Server also provides existing functions and mechanisms to speed-up performance and integration. For instance, customers can use ColumnStore indexes in conjunction with R for faster queries. They can use the built-in resource governance to control the resources allocated to the R runtime. They enjoy a smooth integration with SQL Server Integration Services (SSIS) through the stored procedure interface, allowing integration with common ETL and job scheduling.
- One other important element in R Services is that everyone gets to work with their familiar tools:
- Data Scientists can keep working with their familiar R IDE. They can use R Studio or work with the recently announced R Tools for Visual Studio, or any other IDE of their choice. And rather than having to pull data to a local machine, they can use the ScaleR remote execution directly from their IDE to accelerate analysis of large datasets by pushing the computation to SQL Server 2016 for in-database analytics.
- Server Administrators can use the familiar SQL Server Management Studio to manage R Services, to control resources and to assign permissions.
- Application Developers can leverage the T-SQL API. For instance, they can create SQL Server Reporting Services reports or Power BI dashboards with scores, predictions and visuals from R, without needing to learn R themselves.
- Data Engineers can combine R with existing ETL flows and schedule jobs using SQL Server Integration Services.
All the above make SQL Server R Services an amazing offering for analyzing your data with R while being able to scale, integrate with applications in production and keep your costs down.
How to get started
Get started with the following set of resources and learn more about R Services in SQL Server 2016 in the video below.
- Product documentation on MSDN
- This introductory video from Data Driven discusses how SQL Server R Services allows you to build smart, predictive applications and extract insights from your data. You can find additional R Services videos within the Data Driven video playlist and on Channel9.
- Tutorials on MSDN
- Solution Templates in the Cortana Analytics Gallery, with sample data and code hosted in GitHub.
- If you are completely new to R, check out the online courses available on sites like Datacamp, which also offers deep dive courses covering ScaleR APIs.
See the other posts in the SQL Server 2016 blogging series.