Measuring your way to failure: Thinking beyond your model metrics before deployment

By Jose Medina Gomez, Data Scientist, Customer Success in Financial Services, Microsoft

May 13, 2020

Categories
Era of AI

Companies across the Financial Sector are using AI and Machine Learning to model customer behaviors, avoid risk, and streamline critical business processes. Consequently, building and testing Machine Learning models has become an important discipline for many financial institutions.

Traditionally, the success of Machine Learning models, and often AI algorithms, is measured by key metrics that evaluate how consistent, accurate, and reliable these tools are. In an Enterprise Setting these metrics alone are rarely sufficient. However, before we dig deeper into this discussion, lets level-set some basic concepts:

What is a Machine Learning model? Loosely speaking, a Machine Learning model is an algorithm that has been trained on a subset of data to recognize and reproduce certain patterns.

What are the metrics used for? Metrics are used to evaluate the performance of these models; the metrics used depends on the desired business outcome (R^2 or rMSE) or the distribution of the data (F1-Score vs. Accuracy). This is where many validation processes stop. Choosing the right metric is important, but usually not enough. In fact, successful machine learning models share four common characteristics:

Successful models adhere to security and privacy regulations

Security and privacy guidelines are crucial for financial institutions. Not only do businesses need to protect their data, but they are federally mandated to safeguard the Personal Identifiable Information (PII) of their clients and customers. Likewise, AI applications need to be compliant with security regulations.

Because of those security demands, good models for security and privacy regulations aren’t necessarily quantifiable through metrics. Instead, these models:

Protect PII regardless of how accurate they are. Exposing PII can introduce undesired effects that skew models in ways that metrics can’t predict.
Account for the security of the data they work on and produce.
Work with input and output channels that control who has access and when.

They apply to a clear business need

Accurate performance metrics are critical to a model’s success, but you can only get an idea of what those metrics are when you have clearly defined business goals.

More specifically, clearly defining business goals in terms of a Machine Learning model means formulating the right goals and ensuring that they are measurable and achievable in order to define what accuracy means for your models.

To clarify business needs along those lines, successful projects usually ask the following questions:

What processes are being improved?
What do those processes look like today?
What is a reasonable improvement to those processes based on past performance and future goals?

After answering those questions, the performance metrics and the level of accuracy in a given model can serve those goals, rather than serving some abstract conception of effectiveness.

They are fundamentally robust

Machine Learning models are a simplified version of the reality. By consequence, metrics calculated by ML models do not measure that reality, they measure how well the models are performing on a very specific set of data on a very limited subset of variables that might not affect the outcome in real life.

Because of that, high performing Data Science teams focus on building robust models. This typically involves:

Ensuring the quality of training data and how representative it is of a given population. Machine Learning models are great at predicting what they have seen before.
Developing explainable and auditable models.
Identifying bias within the assumptions of the model, which includes understanding how different model biases will necessarily provide certain results.
Avoiding over-engineering. Most processes do not require Deep Learning algorithms to make accurate predictions.
Sticking with proper validation results and avoid cheating on validation to get the desired results.
Understanding that Machine Learning is about learning, not immediate correctness. Allow models to mature.

They include a process to guarantee robustness after deployment

There is no one-size-fits-all algorithm or model. Even a robust model won’t last forever. Models need to evolve as data sets, populations, and business needs change.

Ensure that the model is applied to the right data and population. Do not train your model under a certain population or process and expect it to work on a population that behaves differently.
Track and audit performance of the model over time.
Deploy models securely to reveal customer data, credentials, and/or access to production environments.

Conclusion

These are some of the common factors that play into successfully deploying models to production. Security, clarity of goals, robustness, and maintenance are all going to inform how successful a model is, even if the metrics show it making fairly accurate predictions.

In the coming weeks, we will explore a combination of the tools and best practices commonly seen in the community and used in our projects. We’ll go deeper into each one of these topics to give more specific recommendations and share the code on how to execute them. Keep tuned learn more, but in the meantime, if you have any questions email me at jose.medina@microsoft.com.