import pandas as pd
import seaborn as sns
import numpy as np
"whitegrid")
sns.set_style(= 3)
np.set_printoptions(precision 'display.precision', 3)
pd.set_option(
= "https://raw.githubusercontent.com/PhilChodrow/ml-notes/main/data/compas/compas.csv"
url = pd.read_csv(url) compas
5 Introduction to Algorithmic Disparity: COMPAS
Open the live notebook in Google Colab or download the live notebook
.Today we are going to study an extremely famous investigation into algorithmic decision-making in the sphere of criminal justice by Angwin et al. (2022), originally written for ProPublica in 2016. This investigation significantly accelerated the pace of research into bias and fairness in machine learning, due in combination to its simple message and publicly-available data.
What do you notice in the COMPAS sample form? Are there kinds of information collected that concern you in the context of predictive recommendations that may impact a person’s freedom?
You may have already read about the COMPAS algorithm in the original article at ProPublica. Our goal today is to reproduce some of the main findings of this article and set the stage for a more systematic treatment of bias and fairness in machine learning.
Parts of these lecture notes are inspired by the original ProPublica analysis and Allen Downey’s expository case study on the same data.
Data Preparation
Let’s first obtain the data. I’ve hosted a copy on the course website, so we can download it using a URL.
For today we are only going to consider a subset of columns.
= ["sex", "race", "decile_score", "two_year_recid"]
cols = compas[cols] compas
We are also only going to consider white (Caucasian) and Black (African-American) defendants:
= compas["race"] == "Caucasian"
is_white = compas["race"] == "African-American"
is_black
= compas[is_white | is_black]
compas = compas.copy() compas
Our data now looks like this:
compas.head()
sex | race | decile_score | two_year_recid | |
---|---|---|---|---|
1 | Male | African-American | 3 | 1 |
2 | Male | African-American | 4 | 1 |
3 | Male | African-American | 8 | 0 |
6 | Male | Caucasian | 6 | 1 |
8 | Female | Caucasian | 1 | 0 |
Preliminary Explorations
Let’s do some quick exploration of our data. How many defendants are present in this data of each sex?
"sex").size() compas.groupby(
sex
Female 1219
Male 4931
dtype: int64
What about race?
"race").size() compas.groupby(
race
African-American 3696
Caucasian 2454
dtype: int64
The decile score is the algorithm’s prediction. Higher decile scores indicate that, according to the COMPAS model, the defendant has higher likelihood to be charged with a crime within the next two years. It’s worth emphasizing that this isn’t the same thing as committing a crime: whether or not you’re charged also depends on the behavior of police and prosecutors, which are known to often display racial bias.
In the framework we’ve developed in this class, you can think of the decile score as being produced by computing a score like \(s_i = \langle \mathbf{w}, \mathbf{x}_i \rangle\) for each defendant \(i\), and then dividing these into the lowest 10% (decile score 1), the next 10% (decile score 2), the next 10% (decile score 3) and so on.
The easiest way to see how this looks is with a bar chart, which we can make efficiently using the seaborn
(sns
) package.
= compas.groupby(["race", "decile_score"]).size().reset_index(name = "n")
counts = sns.barplot(data = counts,
p = "decile_score",
x = "n",
y = "race",
hue = "BuPu",
palette = 0.5) saturation
You may notice that the number of white defendants who receive a given decile score tends to decrease as the score increases, whereas the number of Black defendants remains relatively constant.
Let’s also take a look at the recidivism rate in the data:
"two_year_recid"].mean() compas[
np.float64(0.4661788617886179)
So, in these data, approximately 47% of all defendants went on to be charged of another crime within the next two years. This is sometimes called the prevalence of the outcome. Although this is not a “good” outcome, it is labeled 1
in the target data and so we refer to this as the “positive” outcome. Prevalence without further specification usually refers to prevalence of the positive outcome.
The base rate of prediction accuracy in this problem is 53%: if we always guessed that the defendant was not arrested within two years, we would be right 53% of the time.
We can also compute the prevalence broken down by race of the defendant:
"race")["two_year_recid"].mean() compas.groupby(
race
African-American 0.514
Caucasian 0.394
Name: two_year_recid, dtype: float64
When interpreting these different prevalences, it is important to remember that
- Race is itself a socially-constructed system of human categorization invented by humans with political and economic motives to describe other humans as property (Bonilla-Silva 2018).
- The relation between arrest/charging and actual criminal offense can display racial bias, with effects varying by geography (Fogliato et al. 2021).
- Decisions about which behaviors are criminal are contingent political decisions which have, historically, fallen hardest on Black Americans (Yusef and Yusef 2017).
The prevalences between the two groups are substantially different. This difference will have major consequences later on for the possibility of different kinds of fairness in classifiers.
We’re going to treat the COMPAS algorithm as a binary classifier, but you might notice a problem: the algorithm’s prediction is the decile_score
column, which is not actually a 0
-1
label. Following the analysis of Angwin et al. (2022), we are going to construct a new binary column in which we say that a defendant is predicted_high_risk
if their decile_score
is larger than 4.
"predicted_high_risk"] = (compas["decile_score"] > 4) compas[
Now that we’ve done that, we can ask: how likely are Black and white defendants to receive positive predictions in this data?
"race")[["two_year_recid", "predicted_high_risk"]].mean() compas.groupby(
two_year_recid | predicted_high_risk | |
---|---|---|
race | ||
African-American | 0.514 | 0.588 |
Caucasian | 0.394 | 0.348 |
Black defendants are substantially more likely to receive a positive prediction than white defendants, and the disparity is larger than the observed prevalence of the positive outcome.
Is this fair? What is your gut telling you? Yes, no, possibly? What information would you need in order to make a judgment? What is the principle on which your judgment rests?
The ProPublica Findings
Let’s now ask a few questions about the the predictive accuracy of this algorithm. First, how accurate it is it overall?
"correct_prediction"] = (compas["predicted_high_risk"] == compas["two_year_recid"])
compas["correct_prediction"].mean() compas[
np.float64(0.6508943089430894)
Recall that the base rate in this problem is 53%, so our accuracy is somewhat better than random guessing.
What about the accuracy on Black and white defendants separately?
"race"])["correct_prediction"].mean() compas.groupby([
race
African-American 0.638
Caucasian 0.670
Name: correct_prediction, dtype: float64
The overall accuracies for Black and white defendants are comparable, and both are somewhat higher than the base rate of 53%.
What about the error rates? Here is a simple calculation which computes the false positive rate (FPR) in the first row and the true positive rate (TPR) on the bottom row:
"two_year_recid"])["predicted_high_risk"].mean().reset_index() compas.groupby([
two_year_recid | predicted_high_risk | |
---|---|---|
0 | 0 | 0.352 |
1 | 1 | 0.654 |
However, and this was the main finding of the ProPublica study, the FPR and FNR are very different when we break down the data by race:
"two_year_recid", "race"])["predicted_high_risk"].mean().reset_index() compas.groupby([
two_year_recid | race | predicted_high_risk | |
---|---|---|---|
0 | 0 | African-American | 0.448 |
1 | 0 | Caucasian | 0.235 |
2 | 1 | African-American | 0.720 |
3 | 1 | Caucasian | 0.523 |
The false positive rate for Black defendants is much higher than the false positive rate for white defendants. This was the main finding of Angwin et al. (2022). The FPR of 44% for Black defendants means that, out of every 100 Black defendants who in fact will not be charged with another crime, the algorithm nevertheless predicts that 44 of them will. In contrast, the FPR of 23% for white defendants indicates that only 23 out of 100 non-recidivating white defendants would be predicted to be charged with another crime.
There are a few ways in which we can think of this result as reflecting bias:
- The algorithm has learned an implicit pattern wherein Black defendants are intrinsically more “criminal” than white defendants, even among people who factually were not charged again within teh time window. This is a bias in the patterns that the algorithm has learned in order to formulate its predictions. This is related to the idea of representational bias, in which algorithms learn and reproduce toxic stereotypes about certain groups of people.
- Regardless of how the algorithm forms its predictions, the impact of the algorithm being used in the penal system is that more Black defendants will be classified as high-risk, resulting in more denials of parole, bail, early release, or other forms of freedom from the penal system. So, the algorithm has disparate impact on people. This is sometimes called allocative or distributional bias: bias in how resources or opportunities (in this case, freedom) are allocated or distributed between groups.
We can think about the argument of Angwin et al. (2022) as a two-step argument:
- The COMPAS algorithm has disparate error rates by race.
- Therefore, the COMPAS algorithm is unjustly biased with respect to race.
This argument implicitly equates equality of error rates with lack of bias.
- Suppose that we developed an alternative algorithm in which the false positive rates were equal, but there were still more positive predictions for Black defendants overall. Would that be enough to ensure fairness?
- Suppose that we developed an alternative prediction algorithm in which the rate of positive prediction was the same across racial groups, but the false positive rates were different. Would that be to ensure fairness?
The Rebuttal
Angwin et al. (2022) kicked off a vigorous discussion about what it means for an algorithm to fair and how to measure deviations from bias. In particular, Northpointe, the company that developed COMPAS, issued a report by Flores, Bechtel, and Lowenkamp (2016) in which they argued that their algorithm was fair. Their argument is based on an idea of fairness which is sometimes called sufficiency (Corbett-Davies et al. 2017).
Here’s the intuition expressed by sufficiency. Imagine that you and your friend both received an A- in Data Structures. Suppose, however, that the instructor says different things to each of you:
- To you, the instructor says: “You did fine in this class, but I don’t think that you are prepared to take Computer Architecture. I gave you a higher grade than I would normally because you wear cool hats in class.”
- To your friend, the instructor says: “*You did fine in this class and I think you are prepared to take Computer Architecture. Some students got a bump in their grade because they are cool-hat-wearers, but you didn’t get that benefit.”
Feels unfair, right? The instructor is saying that:
What a grade means for you in terms of your future success depends on your identity group.
Suppose that you heard this, but instead of cool hats it was because you are a member of an identity group that “needs some help” in order to achieve equitable representation in the CS major. How would you feel? Would that feel fair to you?
We’ll formally define sufficiency in a future lecture. For now, let’s use an informal definition:
Sufficiency means that a positive prediction means the same thing for future outcomes for each racial group.
To operationalize this idea, we are looking for the rate of re-arrest to be the same between (a) Black defendants who received a positive prediction and (b) white defendants who received a positive prediction.
Let’s check this:
"predicted_high_risk", "race"])["two_year_recid"].mean().reset_index() compas.groupby([
predicted_high_risk | race | two_year_recid | |
---|---|---|---|
0 | False | African-American | 0.350 |
1 | False | Caucasian | 0.288 |
2 | True | African-American | 0.630 |
3 | True | Caucasian | 0.591 |
The rates of rearrest are relatively similar between groups when controlling for the predictions they collectively received. Formal statistical hypothesis tests are typically used to determine whether this difference is sufficiently “real” to warrant correction. In most of the published literature, scholars have considered that the two rates are sufficiently close that we should instead simply say that COMPAS appears to be relatively close to satisfying sufficiency.
Indeed, in a rejoinder article published by affiliates of the company Northpointe which produced COMPAS, the fact that COMPAS satisfies sufficiency is one of the primary arguments (Flores, Bechtel, and Lowenkamp 2016).
Recap
In these notes, we replicated the data analysis of Angwin et al. (2022), finding that the COMPAS algorithm has disparate error rates between Black and white defendants. We introduced the idea that fairness actually has several different facets in our moral intuitions, and found that the COMPAS algorithm satisfies one of them (sufficiency: equal scores mean the same thing regardless of your group membership) but not the others (equal prediction rates and equal error rates).
Some Questions Moving Forward
- Can we have it all? Could we modify the COMPAS algorithm in such a way that it satisfies all the ideas of fairness that we discussed above? Could we then call it “fair” or “unbiased?”
- Are there other ways to define fairness? Which ones are most compelling to us? Does the right idea of fairness depend on the context in which we apply it?
- How did this happen? The COMPAS algorithm was never trained on race data about the defendant. How did it happen that this algorithm nevertheless made recommendations at different rates across groups?
- Is automated decision-making legitimate in this setting? Can it be legitimate (just, fair) to use an automated decision-system for making recommendations about parole and sentencing decisions at all? What safeguards and forms of recourse are necessary for the legitimate use of automated decision-making in criminal justice?
- What are the systemic impacts? Disparate sentencing decisions can have downstream impacts on communities and institutions. How could application of the COMPAS algorithm exacerbate systemic inequalities?
References
© Phil Chodrow, 2025