Analytics - Microsoft Open Source Blog

Cloudera Data Platform’s integration with Azure delivers enterprise security and governance

November 12, 2020 2 min read

By Chris Van DykePrincipal Solution Engineer at Cloudera

Modern analytics and the resulting business insights unlock new opportunities to optimize company performance and open new revenue streams. Since these initiatives also heighten the need for greater security and governance of company data, Identity and Access Management (IAM) needs to be a foundational component of any corporate security plan that covers company data. Critical Read more

Hyperspace, an indexing subsystem for Apache Spark™, is now open source

June 30, 2020 1 min read

By Rahul PotharajuPrincipal Engineering Manager, Microsoft Azure Data Group

For Microsoft’s internal teams and external customers, we store datasets that span from a few GBs to 100s of PBs in our data lake. The scope of analytics on these datasets ranges from traditional batch-style queries (e.g., OLAP) to explorative ”finding the needle in a haystack” type of queries (e.g., point-lookups, summarization). Resorting to linear Read more

What’s new in SandDance 3

June 23, 2020 2 min read

By Dan MarshallPrincipal Research Software Development Engineer

SandDance, the open source data visualization tool from Microsoft Research, is launching several new features in version 3. Facets on all chart types We’ve added much more control to faceted data. All chart types now have the Facet By column feature. When a Facet By column contains quantitative data, you can specify the number of Read more

Microsoft open sources SandDance, a visual data exploration tool

October 10, 2019 2 min read

By Dan MarshallPrincipal Research Software Development Engineer

SandDance, the beloved data visualization tool from Microsoft Research, has been re-released as an open source project on GitHub. This new version of SandDance has been re-written from the ground up as an embeddable component that works with modern JavaScript toolchains. The release is comprised of several components that work in native JavaScript or React Read more

Trill 103: Ingress, Egress, and Trill’s notion of time

August 13, 2019 8 min read

By James TerwilligerPrincipal Software Engineer

Congratulations! You’ve made it to the next installment of our overview of Trill, Microsoft’s open source streaming data engine. As noted in our previous posts about basic queries and joins, Trill is a temporal query processor. Trill works with data that has some intrinsic notion of time. However, Trill doesn’t assign any semantics to that Read more

AzureR now available: Create, manage, and monitor Azure services with R

July 1, 2019 4 min read

By Hong OoiSenior Data Scientist
David SmithCloud Developer Advocate

AzureR, a family of packages that provides tools to manage Azure resources from the open source R language, is now available. If you code in Python, C#, Java or JavaScript, you already have a rich selection of SDKs to choose from to interact with Azure. AzureR extends SDK support to the R language, by providing Read more

Trill 102: Temporal Joins

May 1, 2019 5 min read

By James TerwilligerPrincipal Software Engineer

This post is the second in a sequence intended to introduce developers to the Trill streaming query engine, its programming model, and its capabilities. We introduced in the previous post the concept of snapshot semantics for temporal query processing. Here, we go deeper into the mechanics of snapshot semantics by showing its impact on one Read more

data accelerator (race car illustration)

Microsoft open sources Data Accelerator for Apache Spark

April 16, 2019 4 min read

By Geoff StaneffPrincipal Program Manager
Dinesh ChandnaniPrincipal Group Engineering Manager

Welcome to Data Accelerator! Data Accelerator for Apache Spark simplifies streaming big data using Spark. Data Accelerator has been used for two years within Microsoft for processing streamed data across many internal deployments handling data volumes at Microsoft scale. Offering an easy to use platform to learn and evaluate your streaming needs and requirements, we Read more

Trill 101: how to add temporal queries to your applications

March 28, 2019 6 min read

By James TerwilligerPrincipal Software Engineer

Last December, we released Trill, an open source .NET library designed to process one trillion events a day. Trill provides a temporal query language enabling you to embed real-time analytics in your own application. In this blog post, we spend some time introducing how to get started using Trill. Trill’s query and data model A Read more

Microsoft open sources Trill, a powerful query processor for analytics at incredible speeds

December 17, 2018 1 min read

By James TerwilligerPrincipal Software Engineer

In today’s demanding business environment, processing massive amounts of data each millisecond is becoming a common business requirement. We are excited to be announcing that an internal Microsoft project known as Trill—for processing “a trillion events per day”—is now being open sourced. Trill started as a research project at Microsoft Research in 2012, and has Read more

How to process streams of data with Apache Kafka and Spark

July 9, 2018 23 min read

By Lena Hall

Data is produced every second, it comes from millions of sources and is constantly growing. Have you ever thought how much data you personally are generating every day? Data: direct result of our actions There’s data generated as a direct result of our actions and activities: Browsing twitter Using mobile apps Performing financial transactions Using Read more

Blog posts

Follow OpenAtMicrosoft