The move to the cloud is changing the security landscape. As a result, there is a surging interest in applying data-driven methods to security. In fact, there is a growing community of talented people focused on security data science. We’ve been shedding our respective “badges” and meeting informally for years, but recently decided to see how much progress we might make against some of our bigger challenges with a more structured and formal exchange of ideas in Redmond. The results far exceeded our expectations. Here’s a bit of what we learned.
The first thing to understand is that academia and industry both focus largely on security detection, but the emphasis is almost always on the algorithmic machinery powering the systems. We at Microsoft are transparent with our algorithm research and in fact are the only cloud provider to openly share the machine learning algorithms securing our cloud service. In order to build on that research and learn more about best practices for putting security data science solutions in production, we reached out to our peers in the industry.
We started by meeting with some friends at Google to swap ideas for keeping our cloud services and mutual customers secure. That one-time exercise proved so valuable that it soon turned into a recurring meeting wherein we learned that despite different approaches to data modeling, we face similar challenges. Last week, we opened the doors at Microsoft to the broader community. At first, we weren’t sure if companies would take us up the offer to discuss security data science issues in the open – nothing could have been farther from the truth. We quickly had delegates from Facebook, Salesforce, Crowdstrike, Google, LinkedIn, Endgame, Sqrrl, the Federal Reserve and researchers from the University of Washington. What was supposed to be an hour-long meetup, morphed into a full-blown conference – so much so, we had to give it a name – “Security Data Science Colloquium”.
The goal of the colloquium was simple: share learnings of how different cloud providers/services secure their systems using machine learning. No NDAs, no complicated back and forth paperwork. Our only constraint: keep it technical and be honest. This way, we could ensure that that the 300+ applied Machine Learning (ML) engineers, security analysts, and incident responders who signed up, had a collaborative environment to discuss freely!
Security Data Science > Security + Data Science
Operationalizing security and machine learning solutions is tricky, not only because security data science solutions are inherently complex from both fields, but also because their intersection poses new challenges. For instance, compliance restrictions that dictate data cannot be exported from specific geographic locations have a downstream effect on model design, deployment, evaluation, and management strategies (a data science constraint). As Adam Fuchs, CTO of Sqrrl, pointed out in his lecture, this complicated machinery requires a variety of actors to land an operational solution: threat hunters, data scientists, computer scientists and security analysts, in addition to the standard development crew of program managers, developers and service engineers.
Security Data Scientists ❤ Rules
To quote Sven Krasser (@SvenKrasser), Chief Scientist at Crowdstrike, “Rules are awesome”. This may come as a surprise to machine learning puritans who have long berated rules as futile tools. But as Sven noted in his talk, rules are very good at finding known maliciousness and we as a community must not shy away from them. During our smaller brainstorm discussions, we discussed various ways to combine rules and machine learning. For instance, at Microsoft, we have had success in using Markov Logic Networks to combine the domain knowledge of our security analysts and model them into probabilistic graphs.
Adversarial Machine Learning is Mainstream and We Don’t Know How to Solve It
Hyrum Anderson (@drhyrum) and Robert Filar’s (@filar) riveting talk on how adversaries can subvert machine learning solutions made defenders in the room uncomfortable (in a good way!). They showed different ways that attackers can successfully manipulate machine learning models, from partial to no access to the system. While instances of such attacks have been known since spammers have tried to evade detection, or when adversaries attempt to dodge antivirus systems, the biggest takeaway here is the Machine learning current system, like any system, is susceptible to attacks. For instance, attackers can use the labels alert outputs, or the decision label (such as malware or not), and work around these defenses. While this has been happening for some time, the game changer is that this feedback is instantaneous: the data that was designed as a way for defenders to act swiftly is now exploited by attackers. Research in this area is nascent, and we still don’t know how to bridge this gap.
Call for standardization and benchmarks
At our breakout sessions, we heard the need for a standardized benchmark dataset à la ImageNet – for instance, how do we know if the newest detection for anomalous process creation performs under various test cases. An interesting observation made by the “Security Platform” discussion group, was the need for something along the lines of “GitHub for feature engineering”. They reckoned that many teams waste time managing feature pipelines and sometimes re-computing the same feature, and wanted an effective management system that will make teams more efficient and code more maintainable.
The colloquium, thanks to the enthusiastic participation of our peers, ended up as a marketplace of security data science ideas – we discussed, agreed, and challenged one another with the intention of learning. My favorite quote about the conference, comes from a Salesforce participant, who remarked “we are all batting for the same team”. It particularly resonated with me, because despite our organizational boundaries, we all have a common goal: protect our customers from adversaries.
This is our commitment to share what we have learned – success and failures, so that you don’t have to waste time going down the wrong path. Given the overwhelming support from the security analytics community, my colleagues have already started planning on the next edition of the colloquium. If you are interested in participating, have ideas to make it better, or want to lend a helping hand in organizing, drop a note at firstname.lastname@example.org or reach out to me on Twitter – @ram_ssk.