In the previous two articles, we have discussed covariate shift and concept shift separately. In reality they may both happen at the same time. Measuring separate effects of concept shift and covariate shift in such situation is not enough – usually they will also interact with each other.
Covariate shift and concept shift
Isolated effects, interaction and combined effect
Imagine a concept shift that happens in some region of input space and covariate shift that shifts the data towards that region.The interaction will be positive (in the sense that it exacerbates the effect, like in positive feedback loop) as concept shift alone would not be that harmful as it is together with covariate shift that pushes more data towards the drifted region.
On the other hand covariate shift may cause data to escape from concept-shifted region. The interaction term would be then negative – as the concept shift effect is mitigated by the fact that less data is affected by the shift due to covariate shift.
Apart from measuring the magnitude and effect on target distribution of concept shift and covariate shift we are also interested in interaction term and the combined effect of both. Let’s discuss the example:
We now have covariate shift – mean of income distribution increases – and concept shift – the probability of defaulting for low income applicants increases. The directional interaction term can be calculated from:
$\int_x(c(x)_ {drifted}-c(x) _ {reference})\cdot(pdf(x)_{shifted}- pdf(x)_ {reference})dx$
We will discuss why we care about directional interaction only (and not the absolute one) soon, now let’s plot what’s inside the integral:
So what happened was, as a result of concept shift, the probability of defaulting among applicants with lower income has significantly increased. That would have a strong positive effect on the target distribution (more applicants defaulting) but as an effect of covariate shift there are less lower-income applicants. So the interaction term balances the positive effect of concept shift in that region (the negative peak in the interaction plot around 60 €k/year). There is some small positive interaction as well. So as an effect of covariate shift, there are more people in >90 €k/year group. The concept shift have slightly increased the probability of default in that group as well, specifically between 90 and 120€k/year. In that range we see a positive effect of interaction.
Let’s recall the directional integrals that allow to calculate the isolated effects of covariate shift and concept shift, respectively:
$\int_X c(x)_ {reference}\cdot(pdf(x)_ {shifted}- pdf(x)_{reference})dx$
$\int_x pdf(x)_ {reference}\cdot(c(x)_ {shifted}- c(x)_{reference})dx$
Let’s plot all three:
This nicely confirms what we already deduced:
- Strong positive concept shift effect in lower to medium income range.
- Significant negative effect of covariate shift in lower to medium income range and slightly positive in higher income values (>90 k€/y).
- Negative interaction effect that mitigates the positive impact of concept shift due to covariate shift moving data away from the lower income region. Positive interaction effect in 90-120 k€/y region as covariate shift moves data towards that region and concept shift is still positive there.
If that’s the full picture, we would expect the three directional terms to give a combined effect of covariate shift and concept shift together. Summing all three integrals we get (feel free to do the math on your own):
$\int_x c(x)_ {shifted}\cdot pdf(x)_ {shifted}- c(x)_ {reference}\cdot pdf(x)_ {reference}dx$
Which is in fact a representation of:
$ P(Y│X)_{shifted} P(X)_{shifted} -P(Y│X)_{reference}P(X)_{reference}$
Which is the difference between joint target and input probability distributions between drifted and reference data:
$P(Y,X)_ {drifted} -P(Y,X)_ {reference} $
Cool, right? Let’s plot the whole thing again:
The combined effect can be again calculated as absolute or directional. For our case we get 0.10 for absolute – this is the fraction of targets that would change expected value from 0 to 1 or vice-versa. For directional integral we get -0.023, so we will have 2.3 percentage points less defaults after data drift. Even though concept shift looked serious, covariate shift was stronger.
Now it’s a good time to explain why we have discussed only directional effects for the interaction term. In case of isolated covariate shift effect the absolute integral answers the question –what would be the fraction of targets affected if only the covariate shift happened? The same holds for concept shift and the combined effect.
However, interaction cannot be discussed in isolation. One cannot ask what would be the fraction of targets affected if only interaction happen since interaction happens only when (and where) both concept shift and covariate shift happen. It is different with directional effects. We saw that already in the example above – in the analyzed region there was a positive effect of concept shift on the target and negative effect of covariate shift.
Additionally, due to covariate shift, data escaped that region, so interaction had also negative effect on target distribution. The superposition rule works with directional effect - if we add directional covariate shift, concept shift and interaction magnitudes together we will get the combined directional effect. But this equality does not hold for absolute magnitudes. Check the simple example below:
- isolated concept shift effect is positive, say +0.05,
- isolated covariate shift is negative -0.10,
- and interaction term -0.05.
Then:
- If only concept shift happened, 0.05 of targets would be affected (positively).
- If only covariate shift happened, 0.10 of targets would be affected (negatively).
- If we add directional terms we get the combined directional effect: $-0.05-0.05-0.10=-0.10$ and it means that 0.10 fraction of targets in that region would change the expected value from 1 to 0 (cause negative). Directional combined effect is then 0.10 and so is the absolute.
- Whereas if we add the absolute effects it would be 0.2 which... is meaningless.
Label shift and Manifestation shift
While reading other blog posts you will usually see label shift listed together with covariate shift and concept shift. I don’t think it belongs there.
I really like how Kevin P. Murphy explained it in his latest Probabilistic Machine Learning: Advanced Topic. He splits the different drift types depending on the data generation process. In causal setting i.e. when $X$ causes $Y$ (features cause target) covariate shift and concept shift may happen. Whereas label shift exists in anti-causal modelling which is the opposite - $Y$ causes $X$ (or target causes features).
An example might be any image-related ML task, like image classification. The target (what is on the image) causes the inputs (the image itself, features). Label shift is a change of target probability distribution $P(Y)$ with the concept between target and inputs remaining the same. For example in dog-breed image classification task that would be having more dogs of a specific breed in the shifted data compared to the reference data.
Let’s have a look at the product rule in anti-causal setting:
$P(Y,X)=P(X│Y)\cdot P(Y)$
Left-hand side looks as usual - we have a joint target-input probability distribution. On right hand side we have $P(X│Y)$ and $P(Y)$. We already discussed the latter. The former looks like a reversed concept. It is called a manifestation– it tells how target $Y$ manifests itself in features $X$. Or what is the input probability distribution given a target. So label shift is change of target distribution given the target manifestation remains unchanged.
Finally, we can also have a pure manifestation shift - $P(X│Y)$ changes with $P(Y)$ remaining the same. It is the change of the mapping – just like in concept shift. An example would be the change the way a specific dog breed look like. This by the way really happens – see how bull terrier has changed:

What are the differences and similarities between label vs covariate shift and concept vs manifestation shift? Recall the product rule:
$P(Y,X)=P(Y,X)$
$P(Y│X)P(X)=P(X│Y)P(Y)$
If concept remains unchanged, then manifestation stays the same. That means that covariate shift and label shift are equivalent. They are indistinguishable if we don’t have knowledge about causality, since:
$\dfrac{P(Y|X)}{P(X|Y)}=const=\dfrac{P(Y)}{P(X)}$
So when $P(X)$ change with concept/manifestation staying the same, $P(Y)$ changes. When it comes to manifestation vs concept shift things get complicated. Let’s recall the sum rule:
$P(Y)=\sum_X P(Y,X)$
In concept shift we change $P(Y|X)$ and keep $P(X)$ the same, as a result $P(Y,X)$ changes so $P(Y)$ also changes. Therefore, there is no equivalency here. In pure concept shift $P(Y|X)$ and $P(Y)$ change while $P(X)$ remains the same. In pure manifestation shift $P(X|Y)$ and $P(X)$ change while $P(Y)$ remains the same.
Summary
Let’s recall all the measures of covariate shift, concept shift and combined effects that we have introduced:
1. Pure covariate shift effect that does not take concept into account. It tells how severe is the covariate shift alone,
$\dfrac{1}{2}\int_X\bigg|pdf(x)_{shifted}-pdf(x)_{reference}\bigg|dx$
2. The absolute and directional effects of covariate shift on target distribution. First tells the fraction of targets affected, the second tells the direction of the change (the effect on class balance):
$\int_Xc(x)_ {reference}\cdot\bigg|pdf(x)_ {shifted}- pdf(x)_{reference}\bigg|dx$
$\int_Xc(x)_ {reference}\cdot(pdf(x)_ {shifted}- pdf(x)_{reference})dx$
3. Pure concept shift effect that does not take input distribution into account. It tells the severity of concept shift alone – what is the fraction of targets that would be affected if $x$ was uniformly distributed:
$\dfrac{1}{x_{max}-x_{min}}\int_X\bigg|c(x)_{shifted}-c(x)_{reference}\bigg|dx$
4. The absolute and directional effect of concept shift on target distribution. First tells the fraction of targets affected, the second tells the direction of the change (the effect on class balance):
$\int_Xpdf(x)_ {reference}\cdot\bigg|c(x)_ {shifted}- c(x)_{reference}\bigg|dx$
$\int_Xpdf(x)_ {reference}\cdot(c(x)_ {shifted}- c(x)_{reference})dx$
5. The directional interaction term. It tells what is the directional effect of concept shift and covariate shift happening at the same time in the same region:
$\int_x(c(x)_ {shifted}-c(x) _ {reference})\cdot(pdf(x)_{shifted}- pdf(x)_ {reference})dx$
6. The absolute and directional combined effect. First tells the fraction of targets affected, the second tells the direction of the change (the effect on class balance). It is equal to the sum of directional covariate shift, concept shift and interaction effects. It represents the difference between joint targets and input probability distributions between drifted and reference data:
$\int_x c(x)_{shifted}\cdot pdf(x)_{shifted}- c(x)_{reference}\cdot pdf(x)_{reference}dx$
$P(Y,X)_{drifted}- P(Y,X)_{reference}$
All of the above equations were proved working for 1d categorical and continuous input variables, for binary classification problem. We expect these hold for multivariate inputs. The difference would be that we sum/integrate over the whole input space and that the normalization constant in pure concept shift integral would be a measure of the size of the whole input space (it is a range or length in 1d, area in 2d, volume in 3d etc.) We stopped at 1d though as our purpose of analyzing this was to get intuition of different data drift components, how they interact with each other and how can we quantify them using some interpretable, meaningful measures. We feel we achieved this.
What’s described here is a part of the research we do at NannyML. Some of it does not come from reviewed articles so we might be wrong here and there (well, reviewed articles can be wrong as well). Let us know if you find any mistakes in our reasoning. Explore the notebook linked to check the results on your own.
References
[1]: Moreno-Torres, J.G., Raeder, T., Alaíz-Rodríguez, R., Chawla, N., & Herrera, F. (2012). A unifying view on dataset shift in classification. Pattern Recognit., 45, 521-530.
NannyML is fully open-source, so don't forget to support us with a ⭐ on Github!
If you want to learn more about how to use NannyML in production, check out our other docs and blogs!
Read part I of the blog series - Understanding Data Distribution Shifts in Machine Learning I: Covariate Shift
Read part II of the blog series - Understanding Data Distribution Shifts in Machine Learning II: Concept Shift