Why do ML Models Fail in Production: 3 Common Causes

Do not index

Canonical URL

Introduction

Getting an ML model into production is a hard nut to crack. According to Chris Chapo, SVP of Data and Analytics at Gap, 87% of the models fail before getting deployed. Even if you are in the lucky 13%, this is when the real hard work begins.

The cycle of constant monitoring and maintaining the model is called post-deployment data science. It's a crucial step since our model is live and embedded in business processes, and every mistake can cost us a lot.

But don't worry. In the following sections, we will break down the three most common failure modes and show how to spot them with NannyML. So grab a cup of coffee, sit back, and dive in.

What are the typical failure causes?

Technical performance deterioration

💡

The moment you put a model in production, it starts degrading.

When evaluating the performance of an ML model, there are a many factors to consider - from bias and fairness to business impact. But the most critical dimension to evaluate is the technical performance. After all, this is what the ML model is specifically optimized for.

Measuring technical performance is often done using metrics like ROC AUC, accuracy, and mean absolute error. The steady and persistence decrease in these numbers is called a technical performance deterioration.

The main reason for it is simple: the world is constantly evolving thus the patterns in our data. Since ML models are trained on historical data, they tend to become outdated after some time. The observed patterns and assumptions made in the past often do not hold in production anymore.

The dynamics of those deteriorations can be different, to formalize it we split them into three types:

Sudden - unforeseen drop in the performance.

For example, to attract more people in a new country Netflix gives a free first-month subscription. It brings a large number of new customers with different characteristics than the historical ones, and as a result, the input data drift.

This can lead to a sudden drop in performance, as the model is no longer able to accurately predict which customers are likely to churn.

Sudden Degradation of ROC AUC estimation metric in NannyML.

The graph shows technical performance before(reference period) and after(analysis period) the deployment. The estimated term comes from the fact that it is calculated in the absence of the ground truth.

The graph shows technical performance before(reference period) and after(analysis period) the deployment. The performance is estimated because it is calculated in the absence of the ground truth.

The ROC AUC remained stable from January to July, and then in the next four months suddenly drop down below the threshold. This change could indicate Netflix’s entry into the new country mentioned earlier.

Gradual - the performance gradually decrease over time.

Imagine a spam detection model trained on the data with emails from a few months ago. The model is performing well, and it is deployed in production.

As time goes on, the underlying patterns that the model observed about spam emails change. For example, spammers begin using new tactics, such as sending emails from different domains or using different types of content. These changes would cause the model's performance to decrease over time because it's unable to adapt to the new patterns in the input data. This phenomenon is also known as concept drift.

Gradual Degradation of ROC AUC estimation metric in NannyML.

The estimated ROC AUC is gradually decrease each month as expected. From June to October spammers could be a way ahead of the model, and we can see alerts about the performance drop below the threshold. It is a good time to take a step back and possibly retrain the model.

Others - f.e recurring, cyclical.

The fluctuation of the performance may be caused by the dynamic nature of the problem like a stock prediction. The stock price is affected by different factors like news, economic events, or company earnings. These events can change rapidly. As a result, the model’s performance may go up or down depending on how well it is able to adapt to these changes.

Downstream business failure

💡

A successful ML model is only as good as the system in which it operates.

Machine Learning models are integral component of business processes. Their predictions help to improve the decisions making, but they are never the end point. Therefore, the success of the business depends on all parts of the system are working together seamlessly.

Sadly, it’s not always the case, even if our model is working well, some of the downstream processes can fail.

To give you a bit more context, let’s get take a customer churn as an example. In this case, the technical performance is on point. At the same time, the new manager is coming and change the retention method from calling to emailing the customer with a high probability of leaving.

Later, It turns out that emailing was not as persuasive and effective as a personal call. As a result, the customer ends up canceling its subscription which directly affects the company's key performance indicators(KPI) like revenue or customer lifetime value. The unaware retention department can bang on the door of data science team and blame their model for the failure.

Estimated ROC AUC and Campaign success rate metric in NannyML.

Constant monitoring helps to make sure that our model is performing well in relation to our business goals. This way, as a data science team, if the things go wrong, we can help to identify the root cause of the problem, instead of blaming other departments.

Training-serving skew

Imagine putting in a lot of work to carefully select, prepare and deploy a machine learning model for production, only to find out that it is not performing as well as it did during testing. This frustrating experience is known as training-serving skew which refers to the discrepancy in performance between the training and the production.

The main reasons for the skew are:

Data leakage - leak of information about the target variable into the training set.

It’s a cardinal sin in data science. The most illustrative example is when dealing with time-series data collected over a few years. The problem arises when the dataset is randomly shuffled before the test set is created. This can lead to the unfortunate situation of trying to predict the past from the future. As a result, we overestimate the performance of our model.

Overfitting - inability to generalize well on unseen data.

By far it’s the most common problem in applied machine learning. It happens when the model is too complex and is memorizing the training data. As a result, the model performs great on the training set but not on the new data.

Discrepancy in handling training and production data - difference in the way how the data is processed, transformed or manipulated in training and production

A common practise in feature engineering is to have a separate codebases for training and production.

The difference in the frameworks may result in different outputs to the same input data, causing the skew.

Post-deployment drop in ROC AUC estimation metric in NannyML.

The graph above represents the picture-perfect training-serving skew. The model's performance during the reference period looks promising, leading us to believe that it's ready for deployment. And then the reality hits us hard as the estimated ROC AUC falls below the threshold.

In some cases it can be an immediate red flag(f.e the model is overfitting) to launch an further investigation. But as we see in the example above, the model was kept alive for quite some time(10 months). If the predictions are valuable from the business point of view, the model can keep running, until we find the better solution.

Nevertheless, monitoring is essential in both of these cases, it gives us more understanding and control over the model.

Conclusions

Babysitting ML model after the deployment is a necessary to keep its business value. Sometimes things can go wrong, like a drop in performance, failure of downstream processes, or differences between how the model was trained and how it's used in production. But now, since you're aware of these potential issues, you can be more proactive in finding and resolving them with NannyML.

If you want to learn more how to use NannyML in production, check out our other blogs and docs!

Also, if you more into video content, we recently published some YouTube tutorials!

Lastly, we are fully open-source so don’t forget to star us on Github!⭐

3 Common Causes of ML Model Failure in Production