Visual Studio Code: Develop PySpark jobs for SQL Server 2019 Big Data Clusters

Today we’re announcing the support in Visual Studio Code for SQL Server 2019 Big Data Clusters PySpark development and query submission. It provides complementary capabilities to Azure Data Studio for data engineers to author and productionize PySpark jobs after data scientist’s data explore and experimentation. The Visual Studio Code Apache Spark and Hive extension enables you to enjoy cross platform and enhanced light weight Python editing capabilities. It covers scenarios around Python authoring, debugging, Jupyter Notebook integration, and notebook like interactive query.

With the Visual Studio Code extension, you can enjoy native Python programming experiences such as linting, debugging support, language service, and so on. You can run current line, run selected lines of code, or run all for your PY file. You can import and export a .ipynb notebook and perform a notebook like query including Run Cell, Run Above, or Run Below. You can also enjoy a notebook like interactive experience that includes your source code and markdown comments along with the running results and output. You can remove the unneeded sections, enter comments, or type additional code in the interactive results window. Moreover, you can visualize your results in a graphic format through a matplotlib like Jupyter Notebook. The integration with SQL Server 2019 Big Data Clusters empowers you to quickly submit a PySpark batch job to the big data cluster and monitor job progress.

Highlights of key features

  • You can link to SQL Server: The toolkit enables you to connect and submit PySpark jobs to SQL Server 2019 Big Data Clusters.
  • Python editing: Develop PySpark applications with native Python authoring support (e.g. IntelliSense, auto format, error checking, etc.).
  • Jupyter Notebook integration: Import and export .ipynb files.
  • PySpark interactive: Run selected lines of code, or notebook like cell PySpark execution, and interactive visualizations.
  • PySpark batch: Submit PySpark applications to SQL Server 2019 Big Data Clusters.
  • PySpark monitoring: Integrate with the Apache Spark history server to view job history, debug, and diagnose Spark jobs.

How to install or update

First, install Visual Studio Code and download Mono 4.2.x for Linux or Mac. Then get the latest Apache Spark and Hive tools by going to the Visual Studio Code extension repository or the Visual Studio Code Marketplace and searching for Spark.

For more information about Apache Spark  and Hive tools for Visual Studio Code, please use the following resources:

If you have questions, feedback, comments, or bug reports, please use the comments below or send a note to hdivstool@microsoft.com.