
Do not index
Do not index
Canonical URL
Introduction
Choosing the reference dataset is the primary step in machine learning (ML) monitoring. It is also the easiest step to mess up. The reference set serves as a benchmark for monitoring model performance, setting expectations for data quality, and detecting data drift.
By comparing real-time model outputs against this benchmark, data scientists can identify issues such as data drift, concept drift, or performance degradation.
In our efforts at NannyML to help data scientists monitor their ML models in production, we have seen many mistakes when choosing the reference set. The most commonly repeated mistake is using the training data for the reference period, which can set up an ineffective monitoring system. This blog highlights the drawbacks of this decision and guides you on selecting the correct reference data. 
Post-Deployment Datasets
Models trained and validated on diverse training data and then tested on separate datasets are less likely to overfit, which is why they perform better. But the story doesn't end here. Once a model is deployed, we must monitor it to prevent failure before it impacts the business.
To determine model performance, NannyML compares two sets of data periods: Reference and Analysis.  Over time, data distributions can change due to seasonality, market trends, user behaviour shifts, and other external factors. These periods are continually updated with corresponding datasets to reflect the existing production environment. 
The analysis dataset is used to evaluate a model's real-time performance in production environments. It includes recent data points that reflect the model's features and predictions as they occur. We try to detect any deterioration in model performance by comparing this dataset against the one from the reference period. Therefore, the reference data is historical or benchmark data where the model performed desirably. 
We “analyse” the state of the analysis period by “referring” to the optimal state of the reference period.
The reference dataset is composed of the model's inputs, outputs, and target values. These target values derive the model performance, which is validated to be desirable and stable.
The Problem with Using Training Data as Reference Data
We simulated a production scenario by splitting the California Housing dataset into pre-deployment and post-deployment subsets. The pre-deployment dataset was further divided into training, validation, and testing subsets, which were used to train a classifier model. We then applied Confidence-Based Performance Estimation (CBPE) to monitor the model's performance for a binary classification problem. 
CBPE is an algorithm that allows us to estimate model performance even in the absence of ground truth. To learn more about this example, check out the documentation. 
We then compare two CBPE estimators, one where we use training data as reference data and one where we use testing data. The analysis period will stay common in both, which is the post-deployment data. 
Using Training Data as Reference Data

Here, the model's metrics reflect its ability to memorize the training data. Consequently, it will not capture the true variability expected in the production scenario. Training data for the reference period will reinforce this overfitting as the baseline standard hence, you get tricked😿.
This is why there is a sharp increase in alerts during the analysis period. Most of these alerts are meaningless since they result from a misleading reference set. A constant stream of false alerts can cause alert fatigue. This overburdens the team and increases the risk of missing critical alerts amidst false alerts. 
NannyML’s monitoring workflow is performance-centric. Our Cloud helps you monitor what truly matters and does not flood your system with false alerts. Check out this blog to help you validate whether NannyML is the right solution. 
Using Testing Data as Reference Data

When we repeated the experiment using the testing data as the reference dataset, the model's performance expectation was more rational. The resultant plot has fewer alerts and more stable monitoring.
A classic sign of using training data for the reference period is the large drop at the start of the analysis period. You must have noticed that it is absent in the testing data plot. Additionally, when using training data as a reference, the absence of confidence bands typically indicates an overly confident model, which is a concerning sign.
The reference set derived from testing data simulates the conditions under which the model will operate once deployed.
Should We Always Use Testing Data?

For newly deployed models, the reference dataset is usually the test dataset, where the model is evaluated before entering production. 
For models that have been in production for some time, the reference period should be production data when the model performs well. 
It is essential to update the reference data regularly. Data distributions can change over time due to seasonality, market trends, or other factors. Keeping this period recent maintains the reliability of the monitoring system.
The analysis dataset consists of the latest production data, which should be obtained after the reference period ends. It does not need targets available, as its primary purpose is to evaluate the model's real-time performance. 
Conclusion
If you use training data as the reference dataset, you might set your monitoring system up for a disaster. For new models, use the test dataset as your reference. For established models, choose a benchmark period from production data.
NannyML Cloud simplifies monitoring models as alert fatigue and data drift can be costly. Speak to one of our founders to get tailored solutions. Schedule a demo today to step up your monitoring strategies. 

Read More…
If you want to learn about model monitoring and post-deployment data science, you should check these blogs!
Frequently Asked Questions
What is model monitoring?
Model monitoring involves tracking and evaluating the performance of a machine-learning model after it has been deployed. This involves detecting changes in the properties of the production data or other types of model degradation. 
How to choose the reference dataset for monitoring ml models?
The reference dataset should reflect the optimal state of a deployed model. For new models, use the test dataset. Use a representative benchmark dataset from the model's production data for existing models.


.jpg?table=block&id=3eb03532-5e9b-4fe2-a354-9d4918cc1ec1&cache=v2)
.png?table=block&id=1a4d2060-6af8-4749-8ec7-fec175a322ef&cache=v2)