Kolmogorov-Smirnov (K-S) Test to Detect Data Drift

Shadi Balandeh
2 min readApr 15, 2024

--

Everyone used to love it so what happened?!

I am talking about your model.

In data science, maintaining the integrity and relevance of your models over time is a challenge that often goes underestimated.

As part of our ongoing #DataDrivenPitfalls series, today we explore another critical yet frequently overlooked phenomenon called 𝐝𝐚𝐭𝐚 𝐝𝐫𝐢𝐟𝐭 and explore how the Kolmogorov-Smirnov (K-S) test can help detecting it.

Data drift often occurs when the statistical properties of the model input data change over time, leading to a decline in model performance.

This can happen due to various reasons — seasonal changes, evolving market trends, or changes in consumer behavior.

The challenge with data drift is that it goes unnoticed until the model’s accuracy has significantly deteriorated.

The Kolmogorov-Smirnov test is helpful here as it compares the distribution of data at two different points in time.

It’s a non-parametric test that calculates the maximum distance between the cumulative distribution functions (CDF) of two datasets.

To illustrate the impact of data drift on data-driven decision-making, let’s look at a simplified example in banking.

Banks and financial institutions often use predictive models to assess the creditworthiness of loan applicants. These models are built on historical data, which includes various features such as credit history, income, and existing debts to predict the likelihood of an applicant defaulting on a loan.

Below, the first dataset represents the original credit score distribution based on which a financial model was developed, and the second dataset represents the current distribution.

We’ll assume the financial institution has set a threshold credit score of 680 for loan approval. Scores above this threshold are considered acceptable, while those below are not.

After performing the K-S test we find the following values:

K-S statistics of 0.364 with p-value =0.00

How to interpret these?

The K-S statistic is a non-negative number that can range from 0 to 1:

A value of 0 indicates that the two distributions are identical

A value closer to 1 suggests a greater divergence between the two distributions.

A K-S statistic of 0.364 indicates a noticeable difference between the two distributions, but it doesn’t tell us about the significance of this difference on its own. For that, we look at the p-value associated with this statistic. The p-value helps determine whether the observed difference is statistically significant.

A small p-value (typically < 0.05) would indicate that the difference in distributions is statistically significant which is the case here.

This suggests a data drift that warrants revisiting assumptions, as more applicants now fall below the decision threshold, potentially leading to more loan denials under the same credit criteria despite changes in the economic context that might call for a nuanced credit risk approach.

--

--

Shadi Balandeh

AI and Data Science Manager| AI & Data Literacy Educator| Scientific Data-Driven Decision Making Advocate| Mom