Be Careful about Slice-Level Split in Training Image Data!

Shadi Balandeh
2 min readMar 26, 2024

--

Here is a list of nine publications that claimed +90% accuracy but just due to data leakage from wrong splits!

Data leakage is a critical mistake in data science that can significantly skew the results of a study or model.

Today, we explore another aspect that can lead to data leakage: improper slice-level split in the context of image data.

There’s a study* that beautifully examines this issue.
It quantitatively assesses the impact of data leakage in CNN (convolutional neural networks) models used to classify patients with Alzheimer’s and Parkinson’s disease.

The findings are shocking:

➡ Using slice-level splits could lead to an overestimation of a model’s performance by as much as 30–55%, depending on the dataset size!

In medical imaging studies, such as those involving Magnetic Resonance Imaging (MRI), data is often three-dimensional.

To leverage 2D convolutional neural networks (CNNs), researchers might slice this 3D data into 2D images.

An improper slice-level split occurs when these 2D slices, derived from the same patient, are distributed across both training and test datasets.

This setup allows the model to indirectly ‘learn’ about the patient’s data in the training set, which it will then ‘see again’ in the test set.

Consequently, the model may falsely appear highly accurate, not because it has learned generalizable patterns, but because it has memorized patient-specific features.

𝐓𝐡𝐞 𝐬𝐭𝐮𝐝𝐲 𝐡𝐢𝐠𝐡𝐥𝐢𝐠𝐡𝐭𝐬 𝐭𝐡𝐞 𝐢𝐦𝐩𝐨𝐫𝐭𝐚𝐧𝐜𝐞 𝐨𝐟 𝐬𝐮𝐛𝐣𝐞𝐜𝐭-𝐥𝐞𝐯𝐞𝐥 𝐝𝐚𝐭𝐚 𝐬𝐩𝐥𝐢𝐭𝐬, 𝐰𝐡𝐞𝐫𝐞 𝐚𝐥𝐥 𝐝𝐚𝐭𝐚 𝐟𝐫𝐨𝐦 𝐚𝐧 𝐢𝐧𝐝𝐢𝐯𝐢𝐝𝐮𝐚𝐥 𝐬𝐮𝐛𝐣𝐞𝐜𝐭 𝐚𝐫𝐞 𝐞𝐱𝐜𝐥𝐮𝐬𝐢𝐯𝐞𝐥𝐲 𝐚𝐥𝐥𝐨𝐜𝐚𝐭𝐞𝐝 𝐭𝐨 𝐞𝐢𝐭𝐡𝐞𝐫 𝐭𝐡𝐞 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐨𝐫 𝐭𝐞𝐬𝐭𝐢𝐧𝐠 𝐬𝐞𝐭.

The importance of data and AI literacy now extends beyond data scientists and AI practitioners.

As AI and data-driven decisions become more ingrained in various sectors, understanding the nuances of data handling, the potential pitfalls, and the ethical implications becomes critical for everyone.

--

--

Shadi Balandeh

AI and Data Science Manager| AI & Data Literacy Educator| Scientific Data-Driven Decision Making Advocate| Mom