Data Shift in ML: Understanding Statistical Intuition

Do not index

Canonical URL

The growing number of Machine Learning (ML) models deployed to production caused i.i.d.* to become another example of wishful thinking. The World is changing. Data is changing. ML models stay the same, though. They don’t adapt automatically (yet). To make the right decision on when and how to adapt the model, you need to monitor the changes in data that affect your model.

In this article, we will explore the main types of data change (data drift), their effect on target distribution and on model performance. We share findings that we have learned and developed while researching data drift detection and performance estimation methods at NannyML. So, if you haven’t heard about covariate shift, concept shift, and label shift - you’re in good hands. If you have read about them many times already – that’s even better, I bet you will find this interesting.

The concepts and ideas in this article are described with binary classification in mind. We will often refer to an example of credit default prediction. This simplifies the description and helps to build the intuition. When well understood, these concepts can be generalized to other tasks.

*i.i.d - the assumption that input data is an independent and identically distributed collection of random variables.

What is Data Drift?

Data drift is any change in the joint probability distribution of input variables and targets (denoted as ) [1]. This simple definition hides a lot of complexity behind it.

Notice the keyword joint. Data drift can take place even if none of the individual variables drifts in separation - that is, when their marginal distributions don’t change. It might be the relationship between variables that changes.

There are different types of data drift, depending on the type of change that happens in this joint probability distribution and the causal relationship between inputs and targets.

The two most popular ones mentioned in almost every data-drift-related piece of content are covariate shift and concept shift. Even if you have already read about them tens of times, read again. This time it is going to be different. Trust me.

Covariate shift

What covariate shift really is

Covariate is just another name for the model input variable (or feature).

Pure covariate shift is then a change in the joint probability distribution of input variables provided that the probability distribution of targets conditioned on inputs remains the same [1].

Putting it simpler– it is a change in the distribution of the input with the relationship (not necessarily causal) between inputs and target remaining unchanged.

An example of a covariate shift in credit default use case would be an increase in the fraction of applicants with low income, with the relationship between income and probability of default remaining the same.

The product rule

This is a good place to introduce a widely used equation of the probability product rule:

This time we will actually explain it and make use of it. Don’t worry, there won’t be a lot of heavy math or statistics, just the basics.

Let’s start from the term – this is a conditional probability of targets given inputs. It represents the true probabilistic relationship between inputs and targets. We call it a concept. This is what ML models try to learn based on data.

‍For example, a trained sklearn classifier holds a representation of the learned concept, which maps inputs to expected target probabilities through the predict_proba method. In general, it can be represented as the distribution of targets given the value of inputs. In the case of binary classification, can be imagined as two hypersurfaces over the whole input space. This is tricky. Let’s build our intuition from the bottom up.

While keeping in mind the credit default use case, consider only a single observation – a single credit applicant described with a set of input features:

Let’s also focus on a single target value: . Now the concept term becomes:

and it represents the probability of defaulting for a specific set of inputs (an applicant). It is a single number, a scalar, for example, . Let’s simplify even further and imagine our model has only one input feature, so is 1d. We now have , which is still a scalar. , however, is now a curve – it expresses the probability of default for all possible values of . We can easily imagine and plot it:

For a second let’s forget about inputs distribution and focus on target alone i.e. on . Let’s drop the simplification for now. For binary classification, target can take two values: or . The probability of seeing one value defines the probability of seeing the other – if we have 20% probability of observing then it means we have 80% probability of observing . Now, for a selected observation, the probability of targets given input becomes a pair of scalars:

For 1d input are two curves over the range of , for 2d – two surfaces over the area created by and and for 3 and more - are two hypersurfaces over the whole input space.

I hope it becomes clear now. In the rest of the text, we will stick to assumption most of the time.

Now the other term - .

This, I think, is simpler. It is just a joint multivariate probability distribution of all input variables/features. If there is only one input and that input is a categorical (discrete) variable, it is a probability of each category or probability mass function (PMF). For continuous input, we can simplify and imagine this as a probability density function (PDF)*. We can visualize both:

*Keep in mind that for PMF probability of seeing a specific discrete value is defined so , which is a scalar. For continuous variables, the notion of probability exists only in the ranges of , so is an integral of in the range from to .

The magnitude of covariate shift

We have the basics sorted, let’s go back to the covariate shift. In pure covariate shift, the concept remains the same while the probability distribution of inputs changes. This causes a change of , which – as we already mentioned – is the joint probability distribution of inputs and targets.

Let’s talk examples.

Assume again that our credit default model has only one input variable – applicant’s income. Consider it a categorical variable with three categories – low, medium, and high. In the training data of our model, we have:

30% low,

50% medium,

20% high-income applicants.

Now imagine that the economy is getting better and there are fewer low-income applicants. We now have the following:

10% low,

70% medium,

20% high-income applicants.

This can be visualized:

The first natural question to ask is – how to express the magnitude of covariate shift? An intuitive and easily interpretable measure is a sum of absolute differences of for each of classes:

or in general case – the absolute difference between input distributions:

The difference is multiplied by to avoid measuring the same change twice (if one of the categories dropped by , other categories had to increase by – if we count both, the total charge will be but the intuition says that the actual change is ). As a side effect, the result falls into the range where means that nothing has changed, while means there is no overlap at all between reference and shifted data. In our case, that would be:

This indeed is interpretable, especially in simple cases like the example analyzed, as exactly 20% of applicants has moved from one category to the other.

The effect of covariate shift on target distribution

We know the magnitude of the covariate shift already. Now we are interested to see what the impact is on target distribution. We need a concept for this. Let’s assume conditional probabilities of default given income for each income category:

Showing the full story in a single plot now:

Let’s forget about covariate shift for a second and look at the reference period only. What is the overall probability of seeing an event of a low-income applicant defaulting? Product rule to the rescue.

It is multiplied by . It means that in reference data, we observe 9% of low-income clients defaulting. For medium and high-income applicants, this is respectively 5% and 1%. That is our initial target distribution with respect to the input. Now, after covariate shifts this becomes:

for low,

for medium,

for high-income.

So until now, we have applied the product rule to reference input distribution and shifted input distribution while keeping the concept the same:

Now we can compare these two and summarize changes in categories:

a drop from 0.09 to 0.03 in the low-income group,

an increase from 0.05 to 0.07 in the medium-income,

no change in the high-income group.

Plotting again:

These differences calculated can be used as the measure of covariate shift effect on target distribution. An absolute magnitude of this effect for binary classification is:

Or:

For our example this is:

The interpretation is that 8% of targets have changed its expected value due to covariate shifts, either from to or from to . We can also calculate the directional change (non-absolute):

In binary classification the directional effect on target distribution is equal to the expected change in class balance. So in our case, due to the covariate shift, we expect 4 percentage point fewer defaults among all the applicants.

Impact of covariate shift on performance

We already discussed the business effect of covariate shift – there are fewer low-income people amongst applicants and fewer applicants defaulting in general. Whether that’s good or bad – depends on the business-related stuff (like business goals, the dependence of profitability of credits on applicants’ income and default probability, etc.) Data-science-wise, we might wonder what’s the effect of covariate shift on the performance of the model. Let’s have a look at the following plot:

For all the considerations we assume our model is indeed probabilistic and it returns well-calibrated probabilities (meaning among applicants whose credit default probability was estimated to be 10%, 10% will default). With this assumption, the concept that we see on the x-axis is equal to the probability predicted by our model. On the y-axis, we see what the proportion of specific predictions (which is equal to the proportion of inputs for which the concept maps them to the specified probability for class 1, hence ).

We see that due to covariate shift, we will predict the probability of 0.1 more often compared to the reference period, while a 0.3 prediction will be less likely. In that case, the model will get better according to most of the metrics generally used for binary classification (accuracy, f1, ROC AUC, etc.) Why exactly? Because we have fewer predictions from the high-uncertainty region (close to 0.5) and more from the low-uncertainty region (close to 0 and 1). Since explaining this in detail is not the purpose of this (already too long and getting longer) blog post, check out more information here if you are not convinced.

Anyways, the covariate shift has improved our model – that’s great, but can we just enjoy the stroke of luck and do nothing? Not really. Theoretically, when experiencing pure covariate shift, model performance cannot be improved by retraining –because the concept that the model learns stays the same. In reality, however, we just got more data from a specific region in the input space. It may turn out that this region was previously underrepresented, and the concept our model learned can be improved there. Definitely worth investigating. Another thing is that covariate shift may foreshadow concept shift happening soon*.

We will discuss concept shift later, now, we will go through the same analysis for continuous input.

*Alright, you got me – it may or may not. I wanted to introduce some drama. The gut feeling, however, is that when observable data changes, we expect the unobservable data may change – just as in the world we live in, everything impacts everything in a sense. So if one thing changes, we expect other things to change as an effect, sooner or later. And the change of unobservable data that is causally related to the target is a concept shift. We will explain that later.

Same analysis for continuous input

We have a single categorical input analyzed, let’s see what happens with continuous variables.

When you write a master's thesis at a technical university, they tell you to take your bachelor's thesis and replace all the sums with integrals and all the differences with differentials, and it’s ready. It’s the case here when switching to a covariate shift in continuous variables.

Imagine our single-input model again, but this time with continuous, normally distributed income. The covariate shift and the unchanged concept are as follows:

Now the magnitude of covariate shift alone becomes:

We can clearly show the integral in the plot:

The value calculated for our case is 0.38. What does it mean? Notice the 90 €k threshold where the two pdfs cross. After the covariate shift, there are fewer people in the <90 €k range and more people in the >90 €k range. Exactly 38% of applicants migrated from the <90 €k range to the>90 €k range. Now let’s take the concept into account and calculate the absolute effect of pure covariate shift on target distribution. Denoting concept as we have:

The directional effect is simply:

The integrands can be plotted:

The absolute and directional effects have exactly the same meaning as in the case of categorical variables. Let’s focus on the directional effect as it shows the full picture (right-hand side plot).

We can see that in the 0-90k €/y area, the effect of covariate shift on target distribution is negative. That’s expected – as a result of the income increase, there are relatively fewer applicants with income in that range, so we will observe fewer people defaulting from the 0-90k €/y area.

On the other hand, the number of applicants with income >90k €/year has increased, and we will observe more people defaulting from that group. At the end of the day, the negative effect is stronger (the red area is larger than the green one) because of higher concept values in the negative effect range. The fraction of applicants defaulting will drop exactly by 6.9 % (as calculated using the directional integral).

When it comes to the effect of covariate shift on model performance metrics – just like in categorical case, the model has improved due to covariate shift as there are relatively more predictions from lower-uncertainty regions:

‍

Covariate shift summary

Let’s recall all the measures of covariate shift that we have introduced:

1. Pure covariate shift effect that does not take the concept into account. It tells how severe the covariate shift is alone.

‍

2. The absolute and directional effects of covariate shift on target distribution. First tells the fraction of targets affected, and the second tells the direction of the change (the effect on class balance):

‍

What’s described here is a part of the research we do at NannyML. Some of it does not come from reviewed articles so we might be wrong here and there (well, reviewed articles can be wrong as well). Let us know if you find any mistakes in our reasoning. Explore the notebook linked to check the results on your own.

Now it’s time to go into detail and tackle another big monster, concept shift!

Concept shift

What concept shift really is

Concept is the relationship between covariates (features) and the target. Pure concept shift is a change in the conditional probability of inputs given targets, provided that the joint probability of inputs remains the same. So changes, stays the same, and changes as an effect[1] (due to product rule, didn’t I tell you to read about covariate shift first?).

An example of concept drift in our use case would be: fraction of low-income people applying for credit remains unchanged, but the likelihood that low-income applicant will default changes.

Bayesian view on concept shift

When we view our model as a Bayesian network, we can define concept shift as a shift in unobserved variables which directly or indirectly cause the target with no observed variables on a causal path. Sounds complicated, but it will get simple soon. Let’s imagine that the network below is true for our simple model:

In this graph, arrows indicate true causal relationships. This simplified model of reality assumes that there are two factors that directly cause clients to default – whether they can afford to pay the loan back (monthly balance) and whether they consider it beneficial to pay it back (perception of defaulting consequences).

First thing to notice is that there is no direct arrow between the income and default probability. That’s what I meant at the beginning when defining concept as the relationship between income and defaulting probability as mapping between these two that can be learned (which is not necessarily causal). Again, that’s what ML models do – they learn the mapping.

In causal world a job salary has no direct effect on probability of credit default. We can earn a lot but need to spend a lot as well. We may also not work at all but have other sources of money (like investments or savings). So income affects target through monthly balance. Income itself in this simple model is caused by general economy situation and applicant’s value on the job market.

‍Let’s consider following population-level shifts (i.e., shifts that are real trend for all applicants, not for a single one):

client’s value on the job market shifts (for whatever reason, let’s not complicate this by details) – notice that there is no other path to affect the target other than the one leading through the observed income variable. What does it mean? That the change in client’s value on the job market that is relevant for the target will always be observed in the income variable. From our perspective it just looks like the income distribution has changed, and the target distribution has changed accordingly – with the mapping between them remaining the same. This will be a pure covariate shift.

unavoidable costs shift - this will affect the target through monthly balance variable and won’t be noticed by income. This shift will change the target distribution without affecting the income distribution. What happens then is that the input (income) distribution remains the same but the target distribution changes. From our (and model) perspective the mapping between income and target will have to change then as we see different targets for the same inputs. This will be a pure concept shift.

general economy shifts – this usually will be seen as both – covariate shift and concept shift as it affects the observed variable and the target through a path that does not contain the observed variable.

I found this view on concept shifts more comprehensive. An alternative is just saying that the concept is changing, for whatever reason. But this reason is always related to another change in the system, which is unobserved or might even be unmeasurable. For example – buyers’ decisions may change due to social media influencers changing their perception of a specific behavior or a product. This perception change is unmeasurable, but it has a real, measurable effect on concept change.

The magnitude of concept shift

We are already familiar with the credit default model example, and we will explore it further. We have some intuition already, so we can take shortcuts. We will go straight to continuous input variable cases as I found them more insightful.

The situation is: income distribution stays the same, concept changes:

Qualitatively we see that the concept has changed in the following way – applicants with income <93 k€/y became more likely to default while the ones earning >93 k€/y became less. The change for the <93 €k/y group is stronger (the difference is larger).

Let’s quantify that. Recall that we have denoted concept as . The absolute magnitude of pure concept shift effect is:

The first term (outside of integral) is a normalization factor that ensures that the result of integration converges and is meaningful. Otherwise, it would increase together with an increase of the difference between integral limits and .

How to define and ? This is a range in which exists as the concept itself exists only where data exists. Theoretically, exists everywhere (ranges from to ) as pdf of normal distribution will return a positive value for any . In reality, this is limited, and in practice, this could be just the minimum and maximum value of that variable seen in the data sample.

The normalization term also ensures that the value of the proposed concept shift measure is in the range of . means there is no concept shift at all. means that all the targets would change their expected value – either from to or the other way around.

This measure does not take into account the probability distribution of - it doesn't care how exactly input data is distributed, it only cares where and concept exist, which is between and . Therefore it does not tell how the target is affected by our specific . It tells how severe the concept shift is alone and how the target would be affected if it was uniformly distributed in the range between and . In the analyzed case concept shift magnitude equals 0.043.

The effect of concept shift on target distribution

We have calculated the magnitude of concept shift alone, now, we want to see its impact on target distribution when we include the actual probability distribution of the input. We can calculate the absolute impact with the following:

and the directional one:

The terms inside integrals can be visualized:

Let’s look at the directional change. In the <93 k€/y region, concept shift has a positive effect on target distribution as the probability of default for this income range has increased. It is additionally boosted by the fact that there are generally more <93 k€/y applicants than >93 k€/y. The final effect on target distribution will be then positive and equal to about 3% - we will see an increase of defaults by 3 percentage points.

The effect of concept shift on model performance

Generally, pure concept shift will always have a negative effect on model performance. The concept is what the model learns, and if this changes model just becomes worse. Performance drop depends on how strong the concept shift is and how much data is affected by it.

Concept shift summary

Let’s recall all the measures of concept shift that we have introduced:

1. Pure concept shift effect that does not take input distribution into account. It tells the severity of concept shift alone – what is the fraction of targets that would be affected if was uniformly distributed:

‍

2. The absolute and directional effect of concept shift on target distribution. First tells the fraction of targets affected, the second tells the direction of the change (the effect on class balance):

In the previous two sections, we discussed covariate shift and concept shift separately. In reality they may both happen at the same time. Measuring separate effects of concept shift and covariate shift in such situation is not enough – usually they will also interact with each other.

Covariate shift and concept shift

Isolated effects, interaction, and combined effect

Imagine a concept shift that happens in some region of input space and covariate shift that shifts the data towards that region. The interaction will be positive (in the sense that it exacerbates the effect, like in positive feedback loop) as concept shift alone would not be that harmful as it is together with covariate shift that pushes more data towards the drifted region.

On the other hand covariate shift may cause data to escape from concept-shifted region. The interaction term would be then negative – as the concept shift effect is mitigated by the fact that less data is affected by the shift due to covariate shift.

Apart from measuring the magnitude and effect on target distribution of concept shift and covariate shift we are also interested in interaction term and the combined effect of both. Let’s discuss the example:

We now have covariate shift – mean of income distribution increases – and concept shift – the probability of defaulting for low-income applicants increases. The directional interaction term can be calculated from:

We will discuss why we care about directional interaction only (and not the absolute one) soon, now let’s plot what’s inside the integral:

So what happened was, as a result of concept shift, the probability of defaulting among applicants with lower income has significantly increased. That would have a strong positive effect on the target distribution (more applicants defaulting), but as an effect of the covariate shift, there are fewer lower-income applicants. So the interaction term balances the positive effect of concept shift in that region (the negative peak in the interaction plot is around 60 €k/year). There is some small positive interaction as well. So as an effect of covariate shift, there are more people in the >90 €k/year group. The concept shift has slightly increased the probability of default in that group as well, specifically between 90 and 120€k/year. In that range, we see a positive effect of interaction.

Let’s recall the directional integrals that allow us to calculate the isolated effects of covariate shift and concept shift, respectively:

Let’s plot all three:

This nicely confirms what we already deduced:

Strong positive concept shift effect in the lower to medium income range.

Significant negative effect of covariate shift in lower to medium income range and slightly positive in higher income values (>90 k€/y).

Negative interaction effect that mitigates the positive impact of concept shift due to covariate shift moving data away from the lower income region. Positive interaction effect in 90-120 k€/y region as covariate shift moves data towards that region, and concept shift is still positive there.

If that’s the full picture, we would expect the three directional terms to give a combined effect of covariate shift and concept shift together. Summing all three integrals, we get (feel free to do the math on your own):

Which is in fact a representation of:

Which is the difference between joint target and input probability distributions between drifted and reference data:

Cool, right? Let’s plot the whole thing again:

The combined effect can be again calculated as absolute or directional. For our case, we get 0.10 for absolute – this is the fraction of targets that would change the expected value from 0 to 1 or vice-versa. For the directional integral, we get -0.023, so we will have 2.3 percentage points fewer defaults after data drift. Even though the concept shift looked serious, the covariate shift was stronger.

Now it’s a good time to explain why we have discussed only directional effects for the interaction term. In the case of the isolated covariate shift effect, the absolute integral answers the question –what would be the fraction of targets affected if only the covariate shift happened? The same holds for concept shift and the combined effect.

However, interaction cannot be discussed in isolation. One cannot ask what would be the fraction of targets affected if only interaction happens since interaction happens only when (and where) both concept shift and covariate shift happen. It is different with directional effects. We saw that already in the example above – in the analyzed region, there was a positive effect of concept shift on the target and a negative effect of covariate shift.

Additionally, due to covariate shift, data escaped that region, so interaction also had a negative effect on target distribution. The superposition rule works with directional effect - if we add directional covariate shift, concept shift, and interaction magnitudes together, we will get the combined directional effect. But this equality does not hold for absolute magnitudes. Check the simple example below:

isolated concept shift effect is positive, say +0.05,

isolated covariate shift is negative -0.10,

and interaction term -0.05.

Then:

If only concept shift happened, 0.05 of targets would be affected (positively).

If only covariate shift happened, 0.10 of targets would be affected (negatively).

If we add directional terms, we get the combined directional effect: and it means that 0.10 fraction of targets in that region would change the expected value from 1 to 0 (cause negative). The directional combined effect is then 0.10, and so is the absolute.

Whereas if we add the absolute effects, it would be 0.2, which... is meaningless.

Label shift and Manifestation shift

While reading other blog posts, you will usually see label shift listed together with covariate shift and concept shift. I don’t think it belongs there.

I really like how Kevin P. Murphy explained it in his latest Probabilistic Machine Learning: Advanced Topic. He splits the different drift types depending on the data generation process. In causal setting i.e. when causes (features cause target), covariate shift and concept shift may happen. Whereas label shift exists in anti-causalmodelling, which is the opposite - causes (or target causes features).

An example might be any image-related ML task, like image classification. The target (what is on the image) causes the inputs (the image itself, features). Label shift is a change of target probability distribution with the concept between target and inputs remaining the same, for example, in a dog-breed image classification task that would be having more dogs of a specific breed in the shifted data compared to the reference data.

Let’s have a look at the product rule in anti-causal setting:

The left-hand side looks as usual - we have a joint target-input probability distribution. On the right hand side, we have and . We already discussed the latter. The former looks like a reversed concept. It is called a manifestation– it tells how target manifests itself in features . Or what is the input probability distribution given a target? So label shift is a change of target distribution given the target manifestation remains unchanged.

Finally, we can also have a pure manifestation shift - changes with remaining the same. It is the change of the mapping – just like in concept shift. An example would be the change in the way a specific dog breed looks like. This, by the way, really happens – see how bull terrier has changed:

Covariate and concept shift interaction summary

Let’s recall all the measures of covariate shift, concept shift and combined effects that we have introduced:

1. Pure covariate shift effect that does not take the concept into account. It tells how severe the covariate shift is alone,

‍

2. The absolute and directional effects of covariate shift on target distribution. First tells the fraction of targets affected, the second tells the direction of the change (the effect on class balance):

‍

3. Pure concept shift effect that does not take input distribution into account. It tells the severity of concept shift alone – what is the fraction of targets that would be affected if was uniformly distributed:

‍

4. The absolute and directional effect of concept shift on target distribution. First tells the fraction of targets affected, the second tells the direction of the change (the effect on class balance):

‍

5. The directional interaction term. It tells what is the directional effect of concept shift and covariate shift happening at the same time in the same region:

‍

6. The absolute and directional combined effect. The first tells the fraction of targets affected, and the second tells the direction of the change (the effect on class balance). It is equal to the sum of directional covariate shift, concept shift, and interaction effects. It represents the difference between joint targets and input probability distributions between drifted and reference data:

‍

All of the above equations were proved working for 1d categorical and continuous input variables, for binary classification problems. We expect these to hold for multivariate inputs. The difference would be that we sum/integrate over the whole input space and that the normalization constant in pure concept shift integral would be a measure of the size of the whole input space (it is a range or length in 1d, area in 2d, volume in 3d, etc.) We stopped at 1d, though, as our purpose of analyzing this was to get an intuition of different data drift components, how they interact with each other, and how we can quantify them using some interpretable, meaningful measures. We feel we have achieved this.

What’s described here is a part of the research we do at NannyML. Some of it does not come from reviewed articles, so we might be wrong here and there (well, reviewed articles can be wrong as well). Let us know if you find any mistakes in our reasoning. Explore the notebook linked to check the results on your own.

References

[1]: Moreno-Torres, J.G., Raeder, T., Alaíz-Rodríguez, R., Chawla, N., & Herrera, F. (2012). A unifying view on dataset shift in classification. Pattern Recognit., 45, 521-530.

‍

NannyML is fully open-source, so don't forget to support us with a ⭐ on Github!

If you want to learn more about how to use NannyML in production, check out our other docs and blogs!

Understanding Data Drift: Impact on Machine Learning Model Performance