Introducing chaos engineering to Machine Learning deployments

By Vivek Raja; Data Scientist, NexStem

A data scientist’s worst nightmare is seeing their Machine Learning model deployment fail at production. But there are ways to ensure that the deployment targets are resilient enough to handle incoming prediction traffic. This blog charts out a way to establish a well-architected Azure solution by applying chaos engineering approaches in Machine Learning Operations (MLOps).

The blog covers three main sections:

  • Introduction to chaos engineering
  • Deploying ML Model to AKS Cluster in Azure
  • Creating and running chaos experiment on AKS in Azure

Let’s get started!

Introduction to chaos engineering

Remember when we were kids, we used to take wooden sticks and try to break them into two halves by bending them. Sometimes, we tried to shatter the sticks by hammering at them. The experiment allowed us to see “the point” when the stick breaks. Observing the moment the stick breaks is important because it conveys the maximum stress and pressure it can withstand.

This scenario can be used to explain what chaos engineering means. Let’s begin by trying to understand the analogy.

  • Start with hypothesis – Measuring the strength of wooden stick
  • Measure baseline behavior – Typical strength of the stick
  • Inject a fault or faults – Break/Hammer the stick
  • Monitor the resulting behavior – Observing the point when the stick breaks

The goal is to observe, monitor, respond to, and improve our system’s reliability under adverse circumstances. Coming to our use case, we will observe our Machine Learning deployment target – Azure Kubernetes Cluster – by injecting faults and validating that the service is able to handle these faults gracefully.

Deploying ML Model to AKS Cluster in Azure

To keep things simple for the sake of the blog, we have used the existing Azure Quickstart Jupyter Notebook for the deployment of Machine Learning to AKS Cluster.

The steps involved in deploying the ML Model to AKS Cluster are detailed below:

Create a Machine Learning Workspace:
In Azure portal, search for Machine Learning resource and create one.

Image showing creation of ML resource

Create a compute instance: Launch the Machine Learning Studio and create a compute instance from compute tab. Select suitable Compute VM.

Clone the Quickstart: Once the compute instance is up and running, click Jupyter. When Jupyter is launched, navigate to sample tabs and search for “Production Deploy to AKS” Quickstart. Follow the code sample of the production-deploy-to-aks.ipynb to deploy your model to AKS.

Once the model is successfully deployed as web service in AKS, it will be able to make predictions using ‘run’ method or HTTP calls.

Creating and running chaos experiment on AKS in Azure

The main objective of performing chaos experiment on our Machine Learning deployment target is to observe and monitor the resilient feature of AKS to ensure it can handle service faults in production.

Let’s head over to Chaos Studio in Azure portal. There are two main terminologies we need to understand before proceeding.

Targets – The service that needs to be tested.
Experiment – Designing the fault that needs to be applied on the target service.

Now, we will create a chaos experiment that uses a chaos mesh fault to fail and kill AKS pods with the Azure portal.

Let’s follow the Microsoft documentation to fail our AKS machine learning deployment pods. Link here

We can see the experiment details as follows in the Chaos Studio experiment:

Once the experiment starts, we can observe the live pod health status in CLI or AKS Cluster overview in portal.

We can observe the pod restarted when it was killed as result of chaos experiment. But the prediction service remains uninterrupted.

Observation and inference of chaos experiment:

Observation:

We noticed that we were able to successfully inject pod failure in our AKS Machine Learning Deployment. The failed pod restarted immediately.

Inference:

If AKS finds multiple unhealthy nodes during a health check, each node is repaired individually before another repair begins. Thus, the model prediction services remain uninterrupted.

To summarize, we deployed Machine Learning on AKS service cluster. We performed chaos engineering experiment in which we injected faults in the deployment targets to test the resilience of the service.

By conducting fault-injection experiments, you can confirm that monitoring is in place and alerts are set up, the directly responsible individual (DRI) process is effective, and your documentation and investigation processes are up to date.

Vivek Raja is a data scientist and founding member of NexStem, a company specializing in biotechnology.