February 6, 2023

Banking on Failure: Keeping a Close Eye on Machine Learning Models in Finance

In 2023, Banking and Technology go hand-in-hand, with the rise of FinTechs, mobile wallets, and cryptocurrency threatening traditional banking. Beyond the fancy apps and digital channels, banks invest massively in machine learning and artificial intelligence to reduce risk, generate revenue, improve processes, and enhance the overall customer experience.

Machine learning models provide predictive power for making decisions in banking, but this predictive power is susceptible to changes in real-world circumstances. Furthermore, once deployed into production, machine learning models degrade over time. Therefore, an outdated machine learning model making inaccurate predictions could be detrimental to the value chain and stick out like a sore thumb on the balance sheet.

In this blog, we will learn how to monitor model performance in production using a bank dataset and identify changes in production data.

Monitoring a Multi-currency Card Propensity Model

Machine learning models degrade over time, rendering predictions less accurate. This degradation can occur due to changes in the underlying data or external events. Building the model is only half the work done; ensuring it stands the test of time is where the rubber hits the road.To monitor performance, we will build a propensity model using a bank dataset that shows a customer's likelihood to take up a multi-currency prepaid card. The dataset contains demographic and financial information of customers, with the target variable being a binary indicator of sign-ups for this product.

Detecting a change in model behaviour will ensure the business does not waste resources targeting the wrong clients. A robust, unbiased, and time-relevant model will help create a better and richer customer experience.

Our propensity model will be trained to predict which customers will sign up for a multi-currency card to allow the bank to target these clients. The bank would typically run a targeted campaign through its digital channels or relationship managers to get customers with a likelihood of interest in the product to sign up for it.

But there is a problem - the absence of ground truth. The ground truth (i.e., whether or not the customer signed up and received the card) may not be available until several weeks or months after the model was trained.

Let's monitor our model using NannyML (even without ground truth)!


NannyML is an open-source python-based library that makes monitoring your models more effortless and productive. It allows you to estimate post-deployment model performance without the target, detect data drift, and identify the data drift responsible for the drop in performance. So let's dive straight into applying NannyML to our use case.

Using NannyML

Let's import the necessary packages and load our data to get started.

First five rows of the dataset.

The dataset contains a mix of categorical and numerical columns. Let's conduct some basic processing of the data. In this case, we encode our categorical columns to enable us feed them as input into our classifier.

First few rows (with data encoded)

Next, we include a timestamp column to represent the time at which the various observations were recorded. This will allow us to distinguish between training and production data. Our timestamp will span a period of two years (2021 - 2022).

We then proceed to train a simple RandomForestClassifier to predict whether a customer will take up the multi-currency card or not.

We will split our data into two partitions; a reference partition, which is our testing data with ground truth, and an analysis partition which is our production data. We use the test set as a reference instead of the training data since it provides a more accurate assessment of the model's ability to generalize to new data, which is essential for ensuring the reliability and performance of the model over time.

Our datasets are now ready for performance monitoring.  The reference partition includes our target variable signed_up (ground truth) but is absent in the analysis partition.

Reference partition
Analysis partition

Estimating Performance of the Model without Target

NannyML can estimate the performance of a machine learning model in production without access to its target. The two approaches for achieving this are Confidence-based performance estimation (CBPE) and Direct Loss Estimation. CBPE leverages confidence scores of the predictions and is applied to classification models. Since our use case is a classification problem, we will apply CBPE.

Estimated performance chart using CBPE

We can see a drop in model performance in the second and third quarters of 2022. A possible reason for this drop is data drift.

Data drift occurs when the distribution of one or more underlying independent variables changes over time. For example, our dataset might historically contain customers of older age on average. In the last decade, however, youth banking has become increasingly popular, meaning the average age of bank customers is much lower than in past. This would result in the decay of a machine learning model, which learned to predict digital product uptake on "legacy" data.

Covariate drift: Change in distribution of features

NannyML provides us with the functionality to test our data for drift and provides us with two approaches to detect drift: Multivariate and Univariate drift detection.

Multivariate Drift Detection

The Multivariate detection approach begins by compressing the dataset into a latent space using Principal Component Analysis (PCA). The dataset is then restored to its original form with a specified error, and this reconstruction error serves as the benchmark for data drift detection. If the error increases or decreases, it indicates a shift in the data distribution.

By utilizing this approach, we can examine the entire dataset at once and easily determine if the issue is with the data or elsewhere. Let's demonstrate how this method can be applied to our dataset.

Multivariate drift graph

The graph shows a number of drift alert signals in the year's first half. This reflects what we witnessed earlier on in our performance estimation. We are getting closer to unpacking the reasons for our drop in performance. Let's focus on the features in detail.

Univariate Drift Detection

The Univariate approach for identifying data drift evaluates changes in the distribution of each feature, comparing the chunks generated from the analysis period with the reference data. This approach, however, does not cater well to changes in correlation between model features.

In addition, NannyML allows us to rank our features based on the extent of their drift.

We see that the variables num_of_products and balance experience the most (and only) drift. But just how much have these distributions drifted? Let's visualize the distributions over time.

Distribution graph for balance
Distribution graph for num_of_products

Both drifts indicate a reduced feature variance, with mean values closer to the lower limits. This allows us to zero down on what might account for the drifts from March to June, which ultimately affects the model's performance.

A possible explanation is the issuance of new pricing guides with increased fees. The bank would typically communicate its new pricing structure to customers at the beginning of the year. As customers begin to feel the impact of these fees in the year, their trust in the bank decreases over the long term and, in some cases, causes them to switch to another financial institution. Moreover, higher fees can also make certain banking services less accessible and affordable to specific customer segments, such as low-income customers. This can result in a decline in customer engagement, product utilization, and usage of bank services.

With this information and result, bank executives and managers can put in place proper strategies to ensure increased uptake of their product.


Monitoring a machine learning model after launching into production is as important as building and deploying it initially. It ensures that decisions made from these predictions are relevant to the time. In the financial services industry, this is critical to protect the purse of individuals and corporates alike. Libraries such as NannyML - which we have gone through in this tutorial - are essential to monitoring machine learning model performance and catching potential model drift before it happens.

Check out the NannyML repo on GitHub and try out some examples. All the best!

Continue reading

Our newsletter
Get great AI insights every month.
Leave your email address below and we'll keep you posted about all the great AI insights we have to offer.
No spam!