What is the hardest part about data science?

2 min readFeb 5, 2024

With all the tools now available, producing ‘results’ has become straightforward. The challenge, however, lies in being mindful of potential pitfalls that can easily be overlooked in the process.

Consider Principal Component Analysis (PCA), a powerful technique in machine learning for data simplification and dimensionality reduction.

It is especially useful when the model contains many features, sometimes in the thousands.

PCA is often used as a preprocessing step to simplify high-dimensional datasets before applying machine learning algorithms. It transforms data into a set of orthogonal components that capture the most variance.

Despite its versatility, PCA is not without limitations, and its application requires careful consideration of the data’s underlying structure.

A significant pitfall here, as illustrated by the example below, is PCA’s linear nature.

➡ PCA assumes that the principal components can linearly recombine to capture the most variance in the data.

⛔ This assumption can lead to misinterpretations when the data contains intrinsic non-linear relationships.

Imagine a scenario where a company wants to segment its customer base to tailor its marketing strategies.

The company collects various customer data points, such as purchase history, product preferences, and engagement levels and capture their relationships as shown in the left plot.

In this scenario, the concentric circles can represent different segments of customers based on their engagement levels.

The company decides to use PCA for dimensionality reduction to simplify the dataset.

🔺 However, PCA, being a linear method, fails to recognize the radial structure that defines the customer segments.

Points (customers) that are originally far apart in the original structure of the data end up being projected close to each other after the PCA transformation.

It just finds the direction of maximum variance in a linear sense, as shown in the right plot, which does not correspond to the real, underlying customer segments losing the nuanced differences between the customers.

This could lead to marketing strategies that do not accurately target the distinct needs and behaviors of each segment.

While PCA is a valuable tool in the machine learning toolkit, its application should be guided by a thorough understanding of the data and the specific objectives of the analysis.

For datasets with non-linear structures, alternative techniques such as kernel PCA, t-SNE, or UMAP might be more appropriate.

Finally, the 𝐢𝐧𝐭𝐞𝐫𝐩𝐫𝐞𝐭𝐚𝐛𝐢𝐥𝐢𝐭𝐲 of the transformed features should be considered, especially in applications where understanding the meaning behind the components is crucial for decision-making.

📢 Data-driven decision-making involves more than just “data”. As part of the “Pitfalls of data-driven decision-making series”, I regularly write about the pitfalls of data science and data-driven decision-making.

What is the hardest part about data science?

Written by Shadi Balandeh