Common Metrics to Evaluate a Model’s Fairness

Shadi Balandeh
2 min readMay 28, 2024

--

Yup! They made the model technically “fair” by making it less accurate equally across all groups!

AI models must be fair, as it’s a core principle of practicing responsible AI.

Thankfully, there are metrics to quantify fairness. However, there is more than one, and they can sometimes give contradictory results. That’s why it’s crucial to understand each and pick the one most relevant to the use case.

Let’s review two interesting examples:

𝐃𝐢𝐬𝐩𝐚𝐫𝐚𝐭𝐞 𝐈𝐦𝐩𝐚𝐜𝐭 𝐑𝐚𝐭𝐢𝐨 (𝐃𝐈𝐑) is a common fairness metric. DIR is calculated by comparing the rate of positive outcomes between the minority and the majority groups. A DIR value of less than 80% often indicates potential discriminatory effects.

For example, if a lending model approves loans for 80% of male applicants but only 60% of female applicants, the DIR would be 75% (60/80), suggesting a potential bias against female applicants.

While DIR is a good first step in assessing model fairness, it is generally insufficient. Here is a real example why:

A widely used criminal risk assessment tool, COMPAS, predicts a defendant’s risk of committing a felony within 2 years. When COMPAS was evaluated on the DIR across racial groups, the ratio was close to 1, suggesting no bias*.

However, a closer investigation revealed that COMPAS was twice as likely to incorrectly classify Black defendants as higher risk compared to White defendants!

The reason for this discrepancy was that COMPAS maintained demographic parity by making a higher proportion of false negatives for White defendants compared to Black defendants. Essentially, it achieved statistical parity by balancing inaccuracies across different groups!

🛑 By focusing only on DIR, other important fairness criteria like equality of false positive/negative rates were violated, leading to discriminatory outcomes.

𝐂𝐨𝐮𝐧𝐭𝐞𝐫𝐟𝐚𝐜𝐭𝐮𝐚𝐥 𝐅𝐚𝐢𝐫𝐧𝐞𝐬𝐬 is another commonly used metric, especially when dealing with protected attributes (sex, race, religion, etc.).

This metric ensures that the model’s decisions would be the same if a sensitive attribute were changed. A common starting point is to then exclude protected characteristics from the model.

For instance, in the case of estimating diabetes risk, one could use only BMI and age and exclude race.

However, excluding race can ultimately harm both White and Asian patients since Asian patients indeed have a higher incidence of diabetes (different base rates)*.

🛑 As a result, the race-blind model systematically underestimates risk for Asian patients and systematically overestimates risk for White individuals.

These cases highlight the importance of examining multiple fairness metrics and aligning the choice of metrics with the specific context and potential real-world impacts. Optimizing for one metric alone can lead to highly undesirable outcomes from other perspectives.

References to the studies: https://dl.acm.org/doi/pdf/10.5555/3648699.3649011

--

--

Shadi Balandeh

AI and Data Science Manager| AI & Data Literacy Educator| Scientific Data-Driven Decision Making Advocate| Mom