How to ensure business impact of ML Models ?

5 min readOct 15, 2020

In the previous post, we saw about importance of building business partnerships for being successful in AI initiatives.

Once the Data science team has a clear business roadmap to work on, the data scientists can build models to increase user engagement, fuel growth or optimize costs.

In this post lets look at how to measure and maximize the business impact of a ML model

Business Metrics

Key for getting the maximum business impact of any model is to first agree upon a business metric that we want to improve.

Following are the the different types of business metrics:

Growth metrics are typically Revenue metrics indicating how well a company is able to market and sell their products.

Optimization metrics include Profitability metrics that are operational to denote efficiency of logistics, production, and operations. And risk metrics indicate how an organization is able to track and control the risks.

User engagement metrics indicate how relevant the user finds the product. It typically denotes the frequency, intensity, or depth of interaction between user and product

So we should discuss with business and select the most relevant metric for our use case

Model Evaluation Metrics

During model training, it might not be possible to measure the actual business metric. So based on the business problem, we have to pick the right technical evaluation metric to measure the effectiveness of our model.

First we need to understand what are the key business drivers. For example in case of classification, is the business goal to find classes with a high degree of confidence? Or is the goal is have high coverage of classification for all data ? Is the focus finding as many true positives or is it to reduce false positives? Many of these requirements are trade offs and we need to pick the right metric to represent this.

For classification models, we can start off with standard metrics: Confusion matrix, precision, recall and F1 score. If our goal is to measure and tune the confidence of our prediction (threshold probability), then we can use ROC/AUC and precision recall curve. Or we can use advanced metrics like log loss functions to compare models.

Similarly for regression models, we can use mean square error or the absolute error to determine the variance between the actual and predicted values

Model Selection — Other Factors

A cautionary note. We should NOT be always picking the model architecture with the best evaluation metrics.

There are multiple other factors which can influence model selection:

Model complexity is determined by whether the model is linear or not and the number of features used by the model. High Model complexity can lead to over fitting, preventing the model from generalizing the learning for new unseen data. It also adversely impacts model serving and prediction SLAs for low latency requirements.

Model explain-ability is for business to understand the intuition of making a particular prediction. It is important for business cases like Loan application review where along with the prediction, the reasoning for doing the prediction is critical.

There are a wide variety of techniques from using interpretable linear models to techniques like Feature Importance, Partial Dependence Plots and Surrogates.

For interested readers, highly recommend reading the book : Interpretable Machine Learning by Chris Molnar

Measuring Business Impact

So we have take a nuanced call on which model to deploy to production, based on model evaluation metrics and model complexity and interpretability factors.

But before deploying this model at scale, online validation of the business metric is crucial. Also there could be cases where we want to try multiple models directly in production. There are multiple ways we could measure and exploit model variants

A/B Testing

We can use Randomized Controlled Trials (RCTs) or A/B testing to evaluate model impact on business metrics like conversions and overall engagement. Here we distribute model request to multiple variants of the model and measure effectiveness.

A/B tests work well when we want to explore and prove the statistical significance of the model. So that we know the model has performed well, not due to a random chance, and will continue to perform well over time. This works well in most business cases.

Canary Testing

Similar approach to A/B test is to do Canary testing. Here we identify a small subset of users for whom the new model versions can be deployed to and measure the business metrics.

Multi Armed Bandit Selection

But let’s assume we have developed a recommender model for a short duration sale in an ecommerce app. In this case we might not have the time to do randomized trails and collect the learnings of our model. Also we might not want to lose conversions initially due to low performing model variants.

So the goal here is to maximize potential by exploiting the best variant of the model. In this case we can use multi armed bandit technique, a reinforcement learning based approach, to incentivize the winning variant.

The main implementation difference between a A/B test and bandit selection approach is how traffic is allocated to the model variants.

In A/B test we use a fixed distribution of incoming traffic. But for multi-armed bandit approach, model requests are allocated dynamically skewing towards the winning model variant as illustrated below.

A/B Testing vs Multi Armed Bandit — In Action

Conclusion

In this post we looked at ways of measuring and maximizing the business impact of a ML model:

Optimize model training by focusing on the final business metric which we will use to measure the impact on the business

Choose evaluation metrics during model training which will later help optimize the business metrics

Use techniques like A/B Testing and Canary Testing to explore and select high performing model variants in production

Use multi armed bandit technique to exploit high performing model variants to get immediate business impact

Reference

https://www.coursera.org/learn/analytics-business-metrics/home/welcome

Balancing Earning with Learning: Bandits and Adaptive Optimization - Conductrics

While AB Testing is currently the default method for optimizing online conversions, there are frameworks other than AB…

conductrics.com

https://christophm.github.io/interpretable-ml-book/