Fairness Metrics in Machine Learning (Part 1): Group-based metrics

As fair machine learning (ML) has become a booming area of research, there are also many “fairness” metrics people have proposed. In this post, I want to summarize some of those metrics. I’ve only started working on fairness in ML a few months ago, and I still get confused between different terms. So, by writing this post, I wish I can reinforce my own memory πŸ™‚ 


Before I delve into fairness metrics, let me briefly introduce the field of fair ML. To do that, let me begin with the personal recount of the field of ML. When I applied for my PhD in 2013, I’ve never heard of the word “machine learning”. The term didn’t reach a clueless undergraduate student in Korea. When I started my PhD in 2014 at CMU, everybody was talking about Machine Learning. In fact, one of the first classes I took at CMU was the famous ’10-701: Introduction to Machine Learning’. ML was still a somewhat vague term back then. Statistician professors used to claim that ML is just a dressed-up term for statistics CS people came up with. At conferences, I’ve heard scholars saying if you want to get a job in the industry, you have to do ML or at least connect your research to ML. The hype was definitely there in 2014. At the same time, people were wary because of the hype. Especially about deep learning, lots of theory researchers around me were flat-out against it. 


Fast forward to 2020, the ML hype didn’t die out as a fad. The term that was unfamiliar to a CS student in 2014 has become universal and well-known to the general public. It has seeped into many industries and our everyday life. Not so tech-savvy people know that google translate works so great because of deep learning. One of the domains that have benefited the most from the recent ML advances, image processing, has created a huge industry of self-driving cars. It has long been claimed that ML technologies will upend the healthcare industry. For example, ML-based image processing would replace the role of radiologists. ML is also becoming an essential tool in science and engineering research ranging from astrophysics to biology. 


With its prevalence and wide-applicability, there comes a bigger liability. The application of ML in safety-critical areas such as self-driving cars or medicine has been a hot topic of debate since the beginning. Another facet of liability that has risen above the surface was “fairness”. Does this new technology give more advantage to a certain group of people than others? While I’m writing this, I got sorta curious about if this kind of question is unique to ML or it is a common question being asked when a new technology comes about. Maybe I should defer this question to another blog post πŸ˜‰ 


So, what would be an example of “unfair” ML? One example is an ML model trained to detect a person on a self-driving car can perform more poorly on black people compared to white people. This puts black people at a higher risk of a car accident. Also, an ML model trained to predict the risk of cancer might perform better on men than women. An example that depicts a more direct unfair treatment from an ML model is when you use an ML model to screen candidates for a tech-related job, the model might admit more male candidates than female candidates, thus providing less opportunity for qualified women. These problems are not hypothetical. Some versions of these examples have been observed multiple times in practice. (Maybe I need to add references here..)


As these problems have been recognized and have gained public attention, research interest in fair ML started to grow rapidly. See the figure below for the number of papers in this field in the past 7 years. (Is there any data that includes up to 2020?) There is also a new online book on this topic which I referenced for this blog post [1]. 

Image for post
Fig: The number of publications on fairness from 2011 to 2017. (Borrowed from Towards Data Science post. )





Okay, a short prelude became too lengthy. Now, let’s really talk about how to define fairness mathematically. To do this, I have to introduce a few notations. Let us define random variables as follows:

  • A: Protected attribute. E.g., gender, race, … 
  • Y: Target variable (ground-truth outcome). E.g., hired/not hired
  • R: Prediction outcome. E.g., hired/not hired, or score (probability of being hired)


A commonly used metric for non-discrimination is based on the “independence” between these variables. First, note that in lots of cases where fairness issues arise, A and Y are NOT  independent. For instance, even if you hire purely based on the merits, people from certain group can be hired more than the others. In another example in healthcare, people from a certain group can have a higher risk of the given disease they want to predict. 


There are three types of independence people consider for fairness:


1. Statistical Parity (Independence): R \perp A

Statistical parity is a very strong notion of fairness. What this is saying is that even if there is dependence between Y and A, we would enforce the prediction from the model R to not depend on A. For instance, let’s say the result from hiring purely based on merits would be hiring 50% of the male candidates and 30% of the female candidates. The ML model with the statistical parity property would rather hire 40% of the male candidates and 40% of the female candidates. 


2. Equalized Odds (Separation): R \perp A \;|\; Y.

This looks similar to statistical parity but it’s conditioned on Y. So, this is somewhat relaxed from statistical parity. Consider a binary-class case where Y \in \{ 0, 1\} and our prediction is also R \in \{ 0, 1 \}. Let’s also assume that the protected attribute A has two classes: A \in \{ \triangle, \square \}.  Equalized odds doesn’t enforce that there should be the same ratio of Y=1 in both class \triangle and class \square. Instead, what this enforces is the following:
P(R = 1 | Y=1, A= \triangle) &= P(R = 1 | Y=1, A= \square),
P(R = 1 | Y=0, A= \triangle) &= P(R = 1 | Y=0, A= \square).
This means that both groups \triangle and \square have the same true positive rate (TPR) and the same false positive rate (FPR). 
This is also referred to as “disparate mistreatment” because what this metric is preventing is different misclassification rates between groups. It equalizes false positive rate and false negative rate. 

3. Predictive Rate Parity (Sufficiency): Y \perp A \;|\; R

This again looks similar to the previous definition, but just Y and R are flipped. So, instead of equalizing P(R|Y) across groups, this equalizes the posterior distribution P(Y|R) across groups. I.e., 

P(Y = 1 | R=1, A= \triangle) &= P(Y = 1 | R=1, A= \square),
P(Y = 1 | R=0, A= \triangle) &= P(Y = 1 | R=0, A= \square).


This implies the equality in positive predictive value (PPV) and negative predictive value (NPV).

This is also called “conditional use accuracy equality”. I think this term was coined in this paper [4]. I guess the term “conditional” comes from conditioning on the prediction R, and the authors state that the term “use”is there because “when risk is actually determined, the predicted outcome is employed; this is how risk assessments are used in the field.”





Now, let’s look at some examples to understand these three different criteria better. 

The table shows the distribution of Y across two classes \triangle and \square. If we have a model that assigns R as given in the table, this satisfies statistical parity: P(R=1|\triangle) = P(R=1|\square) = 25/50 =0.5. However, this model doesn’t satisfy equalized odds nor predictive rate parity.

  • Equalized Odds — unsatisfied: P(R=1|Y=1, \triangle) =25/30 , but P(R=1|Y=1, \square) =20/20=1
  • Predictive Rate Parity — unsatisfied: P(Y=1|R=1, \triangle) = 25/25=1, but P(Y=1|R=1,\square) = 20/25

On the same dataset, we can train a different model that assigns R as below.

This satisfies equalized odds: P(R=1|Y=1, \triangle) = P(R=1|Y=1, \square) = 0.9 and P(R=0|Y=0, \triangle) = P(R=0|Y=0, \square) = 0.9. However, this model doesn’t satisfy statistical parity or predictive rate parity. 

  • Statistical Parity — unsatisfied: P(R=1 | \triangle) = 29/50, but P(R=0|\square) = 21/50
  • Predictive Rate Parity– unsatisfied: P(Y=1 | R=1, \triangle) = 27/29 \approx 0.93, but P(Y=1|R=1, \square) = 18/21 \approx 0.86


Hopefully, these examples give you some idea on in what cases these parity conditions satisfy. Also, as you can notice in these examples, even though these conditions look very similar, it is hard to satisfy two of them at the same time. You can in fact, prove that with mild assumptions, it is impossible to satisfy two out of these three conditions at the same time (See Chapter 2 in [1]). 


Finally, I want to mention that these metrics are all based on counting the prediction results across groups and equalizing them. There exist completely different notions of fairness: individual fairness and counterfactual/causal fairness. I will cover them in a different post because I want to go run now πŸƒπŸ»β€β™€οΈβ˜ΊοΈ



[1] Fair Machine Learning Book 

[2] Verma & Rubin, “Fairness Definitions Explained” , ACM/IEEE International Workshop on Software Fairness (2018)

[3] A Tutorial on Fairness in Machine Learning , A Blog post

[4] Berk et al., “Fairness in Criminal Justice Risk Assessments: The State of the Art” , Sociological Methods & Research (2018)

Leave a Reply

Your email address will not be published. Required fields are marked *