5  Introduction to Algorithmic Disparity: COMPAS

Download the live notebook corresponding to these notes here.

Today we are going to study an extremely famous investigation into algorithmic decision-making in the sphere of criminal justice by Angwin et al. (2022), originally written for ProPublica in 2016. This investigation significantly accelerated the pace of research into bias and fairness in machine learning, due in combination to its simple message and publicly-available data.

It’s helpful to look at a sample form used for feature collection in the COMPAS risk assessment.

You may have already read about the COMPAS algorithm in the original article at ProPublica. Our goal today is to reproduce some of the main findings of this article and set the stage for a more systematic treatment of bias and fairness in machine learning.

Parts of these lecture notes are inspired by the original ProPublica analysis and Allen Downey’s expository case study on the same data.

Data Preparation

Let’s first obtain the data. I’ve hosted a copy on the course website, so we can download it using a URL.

This data set was obtained by Angwin et al. (2022) through a public records request. The data comprises two years worth of COMPAS scoring in Broward County, Florida.
import pandas as pd
import seaborn as sns
import numpy as np
sns.set_style("whitegrid")
np.set_printoptions(precision = 3)
pd.set_option('display.precision', 3)

url = "https://raw.githubusercontent.com/PhilChodrow/ml-notes/main/data/compas/compas.csv"
compas = pd.read_csv(url)

For today we are only going to consider a subset of columns.

cols = ["sex", "race", "decile_score", "two_year_recid"]
compas = compas[cols]

We are also only going to consider white (Caucasian) and Black (African-American) defendants:

is_white = compas["race"] == "Caucasian"
is_black = compas["race"] == "African-American"

compas = compas[is_white | is_black]
compas = compas.copy()

Our data now looks like this:

compas.head()
sex race decile_score two_year_recid
1 Male African-American 3 1
2 Male African-American 4 1
3 Male African-American 8 0
6 Male Caucasian 6 1
8 Female Caucasian 1 0

Preliminary Explorations

Let’s do some quick exploration of our data. How many defendants are present in this data of each sex?

compas.groupby("sex").size()
sex
Female    1219
Male      4931
dtype: int64

What about race?

compas.groupby("race").size()
race
African-American    3696
Caucasian           2454
dtype: int64

The decile score is the algorithm’s prediction. Higher decile scores indicate that, according to the COMPAS model, the defendant has higher likelihood to be charged with a crime within the next two years. In the framework we’ve developed in this class, you can think of the decile score as being produced by computing a score like \(s_i = \langle \mathbf{w}, \mathbf{x}_i \rangle\) for each defendant \(i\), and then dividing these into the lowest 10% (decile score 1), the next 10% (decile score 2), the next 10% (decile score 3) and so on.

The easiest way to see how this looks is with a bar chart, which we can make efficiently using the seaborn (sns) package.

counts = compas.groupby(["race", "decile_score"]).size().reset_index(name = "n")
p = sns.barplot(data = counts, 
                x = "decile_score", 
                y = "n", 
                hue = "race", 
                palette = "BuPu", 
                saturation = 0.5)

You may notice that the number of white defendants who receive a given decile score tends to decrease as the score increases, whereas the number of Black defendants remains relatively constant.

Let’s also take a look at the recidivism rate in the data:

compas["two_year_recid"].mean()
0.4661788617886179

So, in these data, approximately 47% of all defendants went on to be charged of another crime within the next two years. This is sometimes called the prevalence of the outcome. Although this is not a “good” outcome, it is labeled 1 in the target data and so we refer to this as the “positive” outcome. Prevalence without further specification usually refers to prevalence of the positive outcome.

The base rate of prediction accuracy in this problem is 53%: if we always guessed that the defendant was not arrested within two years, we would be right 53% of the time.

We can also compute the prevalence broken down by race of the defendant:

compas.groupby("race")["two_year_recid"].mean()
race
African-American    0.514
Caucasian           0.394
Name: two_year_recid, dtype: float64

When interpreting these different prevalences, it is important to remember that

  • Race is itself a socially-constructed system of human categorization invented by humans with political and economic motives to describe other humans as property (Bonilla-Silva 2018).
  • The relation between arrest and actual criminal offense can display racial bias, with effects varying by geography (Fogliato et al. 2021).
  • Decisions about which behaviors are criminal are contingent political decisions which have, historically, fallen hardest on Black Americans (Yusef and Yusef 2017).

The prevalences between the two groups are substantially different. This difference will have major consequences later on for the possibility of different kinds of fairness in classifiers.

We’re going to treat the COMPAS algorithm as a binary classifier, but you might notice a problem: the algorithm’s prediction is the decile_score column, which is not actually a 0-1 label. Following the analysis of Angwin et al. (2022), we are going to construct a new binary column in which we say that a defendant is predicted_high_risk if their decile_score is larger than 4.

compas["predicted_high_risk"] = (compas["decile_score"] > 4)

Now that we’ve done that, we can ask: how likely are Black and white defendants to receive positive predictions in this data?

compas.groupby("race")["two_year_recid", "predicted_high_risk"].mean()
/var/folders/xn/wvbwvw0d6dx46h9_2bkrknnw0000gn/T/ipykernel_18771/3539224628.py:1: FutureWarning:

Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
two_year_recid predicted_high_risk
race
African-American 0.514 0.588
Caucasian 0.394 0.348

Black defendants are substantially more likely to receive a positive prediction than white defendants, and the disparity is larger than the observed prevalence of the positive outcome.

Fairness (Part 1)

Is this fair? What is your gut telling you? Yes, no, possibly? What information would you need in order to make a judgment? What is the principle on which your judgment rests?

The ProPublica Findings

Let’s now ask a few questions about the the predictive accuracy of this algorithm. First, how accurate it is it overall?

compas["correct_prediction"] = (compas["predicted_high_risk"] == compas["two_year_recid"])
compas["correct_prediction"].mean()
0.6508943089430894

Recall that the base rate in this problem is 53%, so our accuracy is somewhat better than random guessing.

What about the accuracy on Black and white defendants separately?

compas.groupby(["race"])["correct_prediction"].mean()
race
African-American    0.638
Caucasian           0.670
Name: correct_prediction, dtype: float64

The overall accuracies for Black and white defendants are comparable, and both are somewhat higher than the base rate of 53%.

What about the error rates? Here is a simple calculation which computes the false positive rate (FPR) in the first row and the true positive rate (TPR) on the bottom row:

compas.groupby(["two_year_recid"])["predicted_high_risk"].mean().reset_index()
two_year_recid predicted_high_risk
0 0 0.352
1 1 0.654

However, and this was the main finding of the ProPublica study, the FPR and FNR are very different when we break down the data by race:

compas.groupby(["two_year_recid", "race"])["predicted_high_risk"].mean().reset_index()
two_year_recid race predicted_high_risk
0 0 African-American 0.448
1 0 Caucasian 0.235
2 1 African-American 0.720
3 1 Caucasian 0.523

The false positive rate for Black defendants is much higher than the false positive rate for white defendants. This was the main finding of Angwin et al. (2022). The FPR of 44% for Black defendants means that, out of every 100 Black defendants who in fact will not commit another crime, the algorithm nevertheless predicts that 44 of them will. In contrast, the FPR of 23% for white defendants indicates that only 23 out of 100 non-recidivating white defendants would be predicted to recidivate.

There are a few ways in which we can think of this result as reflecting bias:

  • The algorithm has learned an implicit pattern wherein Black defendants are intrinsically more “criminal” than white defendants, even among people who factually never committed another crime. This is a bias in the patterns that the algorithm has learned in order to formulate its predictions. This is related to the idea of representational bias, in which algorithms learn and reproduce toxic stereotypes about certain groups of people.
  • Regardless of how the algorithm forms its predictions, the impact of the algorithm being used in the penal system is that more Black defendants will be classified as high-risk, resulting in more denials of parole, bail, early release, or other forms of freedom from the penal system. So, the algorithm has disparate impact on people. This is sometimes called allocative or distributional bias: bias in how resources or opportunities (in this case, freedom) are allocated or distributed between groups.

Sometimes predictive equality is also defined to require that the false negative rates (FNRs) be equal across the two groups as well.

We can think about the argument of Angwin et al. (2022) as a two-step argument:

  1. The COMPAS algorithm has disparate error rates by race.
  2. Therefore, the COMPAS algorithm is unjustly biased with respect to race.

This argument implicitly equates equality of error rates with lack of bias.

Fairness (Part 2)
  1. Suppose that we developed an alternative algorithm in which the false positive rates were equal, but there were still more positive predictions for Black defendants overall. Would that be enough to ensure fairness?
  2. Suppose that we developed an alternative prediction algorithm in which the rate of positive prediction was the same across racial groups, but the false positive rates were different. Would that be to ensure fairness?

The Rebuttal

Angwin et al. (2022) kicked off a vigorous discussion about what it means for an algorithm to fair and how to measure deviations from bias. In particular, Northpointe, the company that developed COMPAS, issued a report Flores, Bechtel, and Lowenkamp (2016) in which they argued that their algorithm was fair. Their argument is based on an idea of fairness which is sometimes called sufficiency Corbett-Davies et al. (2017).

Here’s the intuition expressed by sufficiency. Imagine that you and your friend both received an A- in Data Structures. Suppose, however, that the instructor says different things to each of you:

  • To you, the instructor says: “You did fine in this class, but I don’t think that you are prepared to take Computer Architecture. I gave you a higher grade than I would normally because you wear cool hats in class.
  • To your friend, the instructor says: “*You did fine in this class and I think you are prepared to take Computer Architecture. Some students got a bump in their grade because they are cool-hat-wearers, but you didn’t get that benefit.”

Feels unfair, right? The instructor is saying that:

What a grade means for you in terms of your future success depends on your identity group.

Suppose that you heard this, but instead of cool hats it was because you are a member of an identity group that “needs some help” in order to achieve equitable representation in the CS major. How would you feel? Would that feel fair to you?

We’ll formally define sufficiency in a future lecture. For now, let’s use an informal definition:

Sufficiency means that a positive prediction means the same thing for future outcomes for each racial group.

To operationalize this idea, we are looking for the rate of re-arrest to be the same between (a) Black defendants who received a positive prediction and (b) white defendants who received a positive prediction.

Let’s check this:

compas.groupby(["predicted_high_risk", "race"])["two_year_recid"].mean().reset_index()
predicted_high_risk race two_year_recid
0 False African-American 0.350
1 False Caucasian 0.288
2 True African-American 0.630
3 True Caucasian 0.591

The rates of rearrest are relatively similar between groups when controlling for the predictions they collectively received. Formal statistical hypothesis tests are typically used to determine whether this difference is sufficiently “real” to warrant correction. In most of the published literature, scholars have considered that the two rates are sufficiently close that we should instead simply say that COMPAS appears to be relatively close to satisfying sufficiency.

Indeed, in a rejoinder article published by affiliates of the company Northpointe which produced COMPAS, the fact that COMPAS satisfies sufficiency is one of the primary arguments (Flores, Bechtel, and Lowenkamp 2016).

Recap

In these notes, we replicated the data analysis of Angwin et al. (2022), finding that the COMPAS algorithm has disparate error rates between Black and white defendants. We introduced the idea that fairness actually has several different facets in our moral intuitions, and found that the COMPAS algorithm satisfies one of them (sufficiency: equal scores mean the same thing regardless of your group membership) but not the others (equal prediction rates and equal error rates).

Some Questions Moving Forward

  1. Can we have it all? Could we modify the COMPAS algorithm in such a way that it satisfies all the ideas of fairness that we discussed above? Could we then call it “fair” or “unbiased?”
  2. Are there other ways to define fairness? Which ones are most compelling to us? Does the right idea of fairness depend on the context in which we apply it?
  3. How did this happen? The COMPAS algorithm was never trained on race data about the defendant. How did it happen that this algorithm nevertheless made recommendations at different rates across groups?
  4. Is automated decision-making legitimate in this setting? Can it be legitimate (just, fair) to use an automated decision-system for making recommendations about parole and sentencing decisions at all? What safeguards and forms of recourse are necessary for the legitimate use of automated decision-making in criminal justice?
  5. What are the systemic impacts? Disparate sentencing decisions can have downstream impacts on communities and institutions. How could application of the COMPAS algorithm exacerbate systemic inequalities?

References

Angwin, Julia, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2022. “Machine Bias.” In Ethics of Data and Analytics, 254–64. Auerbach Publications.
Bonilla-Silva, Eduardo. 2018. Racism Without Racists: Color-Blind Racism and the Persistence of Racial Inequality in America. Fifth edition. Lanham: Rowman & Littlefield.
Corbett-Davies, Sam, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. 2017. “Algorithmic Decision Making and the Cost of Fairness.” In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 797–806. Halifax NS Canada: ACM. https://doi.org/10.1145/3097983.3098095.
Flores, Anthony W, Kristin Bechtel, and Christopher T Lowenkamp. 2016. “False Positives, False Negatives, and False Analyses: A Rejoinder to Machine Bias: There’s Software Used Across the Country to Predict Future Criminals. And It’s Biased Against Blacks.” Federal Probation 80: 38.
Fogliato, Riccardo, Alice Xiang, Zachary Lipton, Daniel Nagin, and Alexandra Chouldechova. 2021. “On the Validity of Arrest as a Proxy for Offense: Race and the Likelihood of Arrest for Violent Crimes.” In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 100–111. Virtual Event USA: ACM. https://doi.org/10.1145/3461702.3462538.
Yusef, Kideste Wilder, and Tseleq Yusef. 2017. “Criminalizing Race, Racializing Crime: Assessing the Discipline of Criminology Through a Historical Lens.” In The Handbook of the History and Philosophy of Criminology, edited by Ruth Ann Triplett, 1st ed., 272–88. Wiley. https://doi.org/10.1002/9781119011385.ch16.



© Phil Chodrow, 2024