Tutorial: Monitoring Missing and Unseen values with NannyML

Do not index

Canonical URL

Starting from version 0.8.6, NannyML supports features to monitor your data quality. In this version, we added two new calculators, one for monitoring missing values (MissingValuesCalculator) and another for monitoring unseen values (UnseenValuesCalculator).

These two methods not only allow us to gather statistics on missing and unseen values, which are often straightforward to calculate, but they help us measure what the baseline amount of missing/unseen values is within a known dataset (a.k.a reference) and monitor the production data against these baselines.

In this tutorial, we will use the Marketing dataset to show how to use these calculators and learn how to interpret their results in a practical case.

Installing NannyML

If you haven't installed NannyML in your environment, you can do it by using pip or Anaconda.

Pip

To install NannyML via pip, use the following command:

pip install nannyml

Anaconda

To install NannyML via Anaconda, use the following command:

conda install -c conda-forge nannyml

Loading Required Libraries

In this tutorial, we will use Numpy, Pandas, and NannyML.

import numpy as np
import pandas as pd
import nannyml as nml

Load the Data

The Marketing dataset is a collection of survey responses from shopping mall customers in the San Francisco Bay Area. It was designed to predict a household's Annual Income using demographic attributes such as marital status, age, education, etc.

We split the dataset into two parts: reference and analysis. We will use the reference set to calibrate NannyML methods. Ideally, the reference set mimics the real world data as close as possible. So, it tell us how what is the acceptable rate of missing/unseen values.

Then, we use a subsequent dataset, that we call analysis to check if this new data conform with our expectations. The analysis set can be any production data that you want to monitor.

data = pd.read_csv('marketing.csv', dtype='category')

reference = data.iloc[0:int(len(data)/2)]
analysis = data.iloc[int(len(data)/2): -1]

features = ["Sex",
            "MaritalStatus",
            "Age",
            "Education",
            "Occupation",
            "YearsInSf",
            "DualIncome",
            "HouseholdMembers",
            "Under18",
            "HouseholdStatus",
            "TypeOfHome",
            "EthnicClass",
            "Language",
            "Income"]

Missing Values

Checking missing values in a dataset is relatively straightforward. Usually, we just need to use the pandas’ isnull() method.

For example, to count how many missing values are in the reference set, we can do the following:

reference.isnull().sum()

YearsInSf           496
HouseholdMembers    175
TypeOfHome          175
Language            136
HouseholdStatus     114
Occupation           69
MaritalStatus        62
EthnicClass          40
Education            29
Sex                   0
Age                   0
DualIncome            0
Under18               0
Income                0
dtype: int64

We see that it is normal to have missing values, and it is easy to count them, but what if, for the analysis set, we start getting more and more missing values than expected? How can we calculate what the expected and acceptable missing value rate is?

Here is where the NannyML’s MissingValuesCalculator() can help. Let’s see how to use it.

Monitoring Missing Values in Production

NannyML’s methods have a sckit-learn styled interface. Where we instantiate a method, fit it on a dataset, and then use it to calculate/estimate a metric.

The code snipped below shows how to use the MissingValuesCalculator() method. The only required parameter is the column_names, which is a list of all the columns we want to check for missing values. Once the calculator is fitted, we can use it to calculate the rate of missing values on the analysis set.

# instantiate the MissingValuesCalculator class
calc = nml.MissingValuesCalculator(column_names=features)

# fit the calculator on the reference set
calc.fit(reference)

# calculate the rate of the missing values on the analysis set
results = calc.calculate(analysis)

results.plot()

When we plot the results, we get the following.

Missing Values Rate for the MaritalStatus column. The vertical grey line in the center separates the reference and analysis periods.

We see that the MaritalStatus column contains missing values during both the reference and analysis periods. The red horizontal lines are the thresholds, they were computed based on the reference data, and since the number of missing values during the analysis period is inside these thresholds, we conclude that the rate of the missing value for the MaritalStatus column has its expected behavior.

If a value surpasses the thresholds, NannyML will show an alert, telling us there is an unfamiliar amount of missing values in a specific period. An example of a column showing more missing values than expected is the Language column.

Missing Values Rate for the Language column. The vertical grey line in the center separates the reference and analysis periods. We see how the Language column had more missing values than expected at two chunks of the analysis period.

A column without any missing value would look like the following.

Missing Values Rate for the Age column. The vertical grey line in the center separates the reference and analysis periods.

The Age column doesn’t have any missing values. Maybe it was a mandatory field in the Marketing Survey, so no missing information is shown.

To learn more about the MissingValuesCalculator and its parameters, check out its API reference.

Unseen Values

The unseen values calculator works similarly to the missing values one. With one main conceptual difference, the notion of unseen values only makes sense for categorical variables.

NannyML defines an unseen value as a categorical value that appears in production data but not in the reference period. So, if a new unseen value shows up NannyML will alert us.

Let’s see how to use the UnseenValuesCalculator method.

Monitoring Unseen Values in Production

Just as with the missing values calculates, we instantiate the UnseenValuesCalculator method, fit it on the reference set and calculate the rate of unseen values on the analysis set.

# instantiate the UnseenValuesCalculator class
calc = nml.UnseenValuesCalculator(column_names=features)

# fit the calculator on the reference set
calc.fit(reference)

# calculate the rate of the unseen values on the analysis set
results = calc.calculate(analysis)

results.plot()

When we plotted the results for the Marketing dataset, we realized that the only column with unseen values was the TypeOfHome column.

Unseen Values Rate for the TypeOfHome column. The vertical grey line in the center separates the reference and analysis periods.

It looks like, at some point in the survey, a new type of home became available as an option.

An unexpected increment of unseen values in your model input can make your model less confident in the regions containing these values, so it is always important to know how to deal with these changes. To learn more about the UnseenValuesCalculator and its parameters, check out its API reference.

The notion of calculating missing and unseen values is simple, but what NannyML’s methods bring to the table is an easy way to determine what an unexpected rate of missing and unseen values is. So, you can easily tell if your model inputs contain more missing and unseen values than usual!

If you want to learn more about NannyML’s data quality checks or have any questions join our Slack Community!

We are fully open-source, so don't forget to spread some love by leaving us a star on GitHub! ⭐