7 Assessment of Classifiers

Evaluating performance and making decisions

Open the live notebook in Google Colab.

Introduction: Is Your Classifier Working?

Last time we introduced the problem of classification, and we implemented logistic regression as a simple model for binary classification. We considered a loss function based on the log-likelihood of the data under a model which included a signal function \(q(\mathbf{x})\) as well as Bernoulli-distributed noise at each data point. We considered a data case study of precipitation prediction, where we trained the model using gradient descent and made a final assessment of the model on a test set by measuring the model’s accuracy.

However, accuracy is rarely the most useful way to evaluate a classifier. In this set of notes, we’ll study several of the most important considerations that go into measuring the success of a binary classification model in learning from data. In the following chapter we’ll study some of the considerations in play when using classifiers to not only make predictions but also to inform decisions.

Quantitative Evaluation Metrics for Classification

To begin our discussion of evaluation metrics in classification, let’s return to the weather prediction problem. The code block below imports and prepares our data, and trains a logistic regression model on the training data using the same code as in the previous chapter.

Code

import pandas as pd
import torch
from matplotlib import pyplot as plt

# acquire the dat
url = "https://raw.githubusercontent.com/middcs/data-science-notes/refs/heads/main/data/australia-weather/weatherAUS.csv"

df = pd.read_csv(url)

# drop rows with missing data in the relevant columns
df.dropna(subset=['RainTomorrow', 'Humidity3pm', 'Pressure3pm'], inplace=True)

# # column
# df["Pressure9am"] = df["Pressure9am"]/1e3
# df["Humidity3pm"] = df["Humidity3pm"]/100

# separate into features we'll use and the target vector
X = df[['Humidity3pm', 'Pressure3pm']].to_numpy()

# pad with a constant column
X_aug = torch.ones((X.shape[0], 3), dtype=torch.float32)
X_aug[:, 1:] = torch.tensor(X, dtype=torch.float32)
X = X_aug.numpy()

y = (df['RainTomorrow'] == 'Yes').to_numpy().astype(float).reshape(-1, 1)

# train-test split
train_frac = 0.8
n = X.shape[0]
n_train = int(train_frac * n)

X_train = torch.tensor(X[:n_train, :], dtype=torch.float32)
y_train = torch.tensor(y[:n_train, :], dtype=torch.float32)

X_test = torch.tensor(X[n_train:, :], dtype=torch.float32)
y_test = torch.tensor(y[n_train:, :], dtype=torch.float32)

Code

# model training

# used for the loss function
def binary_cross_entropy(q, y): 
    return -(y * torch.log(q) + (1 - y) * torch.log(1 - q)).mean()

def sigmoid(z): 
    return 1 / (1 + torch.exp(-z))

# model class
class BinaryLogisticRegression: 
    def __init__(self, n_features): 
        self.w = torch.zeros(n_features, 1, requires_grad=True)

    def forward(self, X): 
        return sigmoid(X @ self.w)    

# optimizer class
class GradientDescentOptimizer: 
    def __init__(self, model, lr=0.1): 
        self.model = model
        self.lr = lr

    def grad_func(self, X, y): 
        q = self.model.forward(X)
        return 1/X.shape[0] * ((q - y).T @ X).T
        
    def step(self, X, y): 
        grad = self.grad_func(X, y)
        with torch.no_grad(): 
            self.model.w -= self.lr * grad

Now we’ll run a training loop:

# training loop
model = BinaryLogisticRegression(n_features=3)
opt = GradientDescentOptimizer(model, lr=0.00001)

losses = []

for epoch in range(2000): 
    q = model.forward(X_train)
    loss = binary_cross_entropy(q, y_train)
    losses.append(loss.item())
    opt.step(X_train, y_train)

Thresholding Scores

Having trained the model, we’re ready to evaluate its performance on the training data. As usual, the simplest way to do this is via the model’s accuracy on training data. However, the model doesn’t make binary yes/no predictions yet; the output is the signal function \(q(\mathbf{x}) = \sigma(\mathbf{w}^\top \mathbf{x})\), a predicted probability of the positive class. To get a binary prediction, we need to choose a threshold \(\tau\) such that we predict “yes” if \(q(\mathbf{x}) > \tau\) and “no” otherwise. Syntactically, the simplest way to do this is:

tau = 0.5
y_pred = (model.forward(X_train) > tau).float()
acc = (y_pred == y_train).float().mean()
print(f"Accuracy on training data: {acc.item():.2f}")

Accuracy on training data: 0.81

Let’s see how the choice of threshold affects the model’s accuracy on the training data:

Code

accs = []

TAU = torch.linspace(0, 1, 100)

for tau in TAU: 
    y_pred = (model.forward(X_train) > tau).float()
    acc = (y_pred == y_train).float().mean()
    accs.append(acc.item())

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(TAU, accs)
ax.set_xlabel(r"Threshold $\tau$")
ax.set_ylabel("Accuracy")
ax.set_title("Accuracy as a function of threshold")

best_thresh = TAU[torch.tensor(accs).argmax()]
ax.axvline(best_thresh, color="grey", linestyle="--", label=f"Best threshold: {best_thresh:.2f} with accuracy {max(accs):.2f}")
ax.legend()

Figure 7.1: Accuracy-based optimal thresholding for rain prediction in logistic regression.

Interestingly, the best threshold on the training data is not the usual default choice of 0.5.

Types of Error

Accuracy is an answer to the question: how often does the model prediction match the true label? However, in many applications, different kinds of errors matter.

Confusion Matrix

Given that we have a vector \(\hat{\mathbf{y}}\) of binary predictions and a vector \(\mathbf{y}\) of true labels, many common ways to assess model performance begin with the confusion matrix, a table which compares predicted and true labels.

Definition 7.1 (Confusion Matrix) Given a vector of binary labels \(\mathbf{y}\) and a vector of binary predictions \(\hat{\mathbf{y}}\), the confusion matrix is a 2x2 table which counts the number of instances in each of the following categories:

	Predicted 0	Predicted 1
True 0	TN	FP
True 1	FN	TP

The entries of the confusion matrix are counts of the number of instances in each category:

True Negatives (TN): The number of instances where the true label is 0 and the model correctly predicted 0.
False Positives (FP): The number of instances where the true label is 0 but the model incorrectly predicted 1.
False Negatives (FN): The number of instances where the true label is 1 but the model incorrectly predicted 0.
True Positives (TP): The number of instances where the true label is 1 and the model correctly predicted 1.

We can compute a 2x2 confusion matrix for our model’s predictions (on the training data) with specified threshold as follows:

def confusion_matrix(y_true, y_pred): 
    # initialize a 2x2 matrix of zeros
    cm = torch.zeros((2, 2), dtype=torch.int32)
    for i in range(2): 
        for j in range(2): 
            cm[i, j] = ((y_true == i) & (y_pred == j)).sum().item()
    return cm

tau = 0.42
y_pred = (model.forward(X_train) > tau).float()
cm = confusion_matrix(y_train, y_pred)
print("Confusion Matrix:")
print(cm)

Confusion Matrix:
tensor([[75091,  2672],
        [14990,  8004]], dtype=torch.int32)

Entries on the diagonal of the confusion matrix correspond to correct predictions, while entries off the diagonal correspond to errors.

From the confusion matrix, we can compute a variety of different performance metrics. For example, the accuracy can be computed as the sum of the diagonal entries divided by the total number of instances:

print("Accuracy:", (cm[0, 0] + cm[1, 1]) / cm.sum())

Accuracy: tensor(0.8247)

Error Rates

There are several error rates which are commonly used to evaluate classifiers and which can be computed from the confusion matrix:

Definition 7.2 (Error Rates in Binary Classification)

The true positive rate (TPR, also known as sensitivity or recall) is the proportion of true positives out of all actual positives: \[\text{TPR} = \frac{\text{TP}}{\text{TP} + \text{FN}}\]
The false positive rate (FPR) is the proportion of false positives out of all actual negatives: \[\text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}}\]
The true negative rate (TNR, also known as specificity) is the proportion of true negatives out of all actual negatives: \[\text{TNR} = \frac{\text{TN}}{\text{FP} + \text{TN}}\]
The false negative rate (FNR) is the proportion of false negatives out of all actual positives: \[\text{FNR} = \frac{\text{FN}}{\text{TP} + \text{FN}}\]

For example:

print("False positive rate:", cm[0, 1] / (cm[0, 0] + cm[0, 1]))
print("True positive rate:", cm[1, 1] / (cm[1, 0] + cm[1, 1]))

False positive rate: tensor(0.0344)
True positive rate: tensor(0.3481)

In general it’s not usually required to calculate all four error rates due to the mathematical relationships

\[ \begin{aligned} \mathrm{TPR}= 1 - \mathrm{FNR}\\ \mathrm{FPR}= 1 - \mathrm{TNR}\;. \end{aligned} \]

Precision, Recall, F1 Score

Another common set of performance metrics are precision and recall, which can be combined into the F1 score. These metrics are particularly useful in cases where the classes are imbalanced, meaning that one class is much more common than the other.

Definition 7.3 (Precision, Recall, and F1 Score)

Precision is the proportion of true positives out of all predicted positives: \[\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}\]
Recall is the same as the true positive rate: \[\text{Recall} = \text{TPR} = \frac{\text{TP}}{\text{TP} + \text{FN}}\]

Precision and recall are often combined into a single metric called the F1 score, which is the harmonic mean of precision and recall: \[\text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\]

Here’s the F1 score for our model:

precision = cm[1, 1] / (cm[1, 1] + cm[0, 1])
recall = cm[1, 1] / (cm[1, 1] + cm[1, 0])
f1 = 2 * (precision * recall) / (precision + recall)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

Precision: tensor(0.7497)
Recall: tensor(0.3481)
F1 Score: tensor(0.4754)

Precision and recall are often more informative than accuracy in cases where the classes are imbalanced, meaning that one class is much more common than the other. In such cases, a model could achieve high accuracy by simply predicting the majority class, but it would perform poorly in terms of precision and recall for the minority class. The F1 score is often used as a single metric to evaluate the performance of a classifier in the presence of imbalanced classes.

ROC curve and AUC

Error rates and metrics like precision/recall/F1 are useful for evaluating a specific set of predictions, e.g. ones that we would obtain from a model like logistic regression with a specific threshold \(\tau\). However, we might want to evaluate the model’s performance across all possible thresholds. One way to do this is via the Receiver Operating Characteristic (ROC) curve, which plots the true positive rate (TPR) against the false positive rate (FPR) as the threshold \(\tau\) varies. The area under the ROC curve (AUC) is a common summary statistic for evaluating the overall performance of a classifier across all thresholds. Computing the ROC simply requires sweeping across thresholds and computing the TPR and FPR at each threshold. Note that the input here is not a vector of binary predictions \(\hat{\mathbf{y}}\) but rather the vector of predicted probabilities \(q(\mathbf{x})\), which we can threshold at different levels to get different binary predictions.

Figure 7.2: ROC curve (on training data) for logistic regression, evaluated on both training and test data.

The Area Under the Curve (AUC) is a measure of the overall performance of the classifier across all thresholds. An ideal classifier which makes no errors has an AUC of 1, while a random classifier has an AUC of 0.5. The AUC can be computed using numerical integration, for example via the trapezoidal rule.

auc_test = torch.trapz(torch.tensor(tpr_test), torch.tensor(fpr_test))

Figure 7.3: Area under the ROC curve (AUC) for logistic regression on test data.

Some Important Questions in Evaluating Classifiers

What task are we evaluating?

This is a brief discussion of a published academic article. The authors of the article were very careful in describing their findings and put up a very helpful FAQ website clarifying some important points. The brief discussion below is not a dunk on the authors, but a brief lesson in the importance of very carefully parsing hype in media.

A few years ago, an international group of authors published a model aimed at predicting live outcomes for patients based on sequences of life events Savcisens et al. (2024). One of the key functionalities of the model was death prediction; the researchers scored the model on a prediction task in which the model attempted to predict whether a given individual was alive four years after the end of the training data. A media writeup of the model said:

Researchers analyzed aspects of a person’s life story between 2008 and 2016, with the model seeking patterns in the data. Then, they used the algorithm to determine whether someone had died by 2020. The Life2vec model made predictions with 78% accuracy.

Wow! 78% accuracy sounds impressive (and also scary!!).

Writeups like these pose interesting questions about how we want to live our lives and what kinds of knowledge are healthy for us to have. If a model had a 78% accuracy in predicting whether you were going to die in the next four years, would you want to know?

Before we get carried away by deep questions like that, however, it’s important to ask some technical questions to help us understand how the model is actually scored and what that 78% accuracy really means.

What’s the population?

Does the model work for everybody? Not necessarily: as the researchers write in their FAQ about the paper, the model is trained on Danish individuals ages 35-65. The model’s performance on older, younger, or non-Danish individuals is unknown.

What’s the base rate?

One important piece of context is the base rate in the data. In a country in which the average length of a human life is 80 years (representative of many developed countries), the base rate of death in a single year is roughly \(1/80 = 1.25\%\). In a four-year period, the base rate of death is roughly \(4/80 = 5\%\). Of course, this is an average across a population, with considerably variability by age:

That said, a model which always predicted “not dead in the next four years” would be correct on average about 95% of the time. So, the researchers must have done something more subtle.

What’s the task?

Indeed, the researchers balanced their testing data set between the two classes “alive in 2020” and “dead in 2020”. This means that they selected instances for the test data to overrepresent cases of death, resulting in a test set in which the base rate of death was 50% instead of 5%. This makes it harder for the model to achieve high accuracy, since just predicting “alive” doesn’t work anymore. However, it also means that the 78% accuracy isn’t directly relevant to a random individual in the population: it’s a specific, narrowly-interpretable measure of performance describing the model’s ability to make predictions in an artificially balanced dataset.

Does the model need to be this complicated?

Perdomo et al. (2025) conducted a recent study of early warning systems (EWS) for predicting the likelihood that an individual student will graduate public high school in Wisconsin. The idea of an early warning system is to identify students who are at increased dropout risk so that the school can allocate more resources to supporting those students.

Typical features \(\mathbf{X}\) used in a EWS include information about:

The student: demographics and family background, health and mental health, grades, attendance, disciplinary record, etc.
The student’s educational environment, which includes:
- The school: dropout rates at the school overall, class sizes, availability of support resources, average test scores, etc.
- The district: funding levels, median income, district-wide dropout rates, etc.

The typical target \(\mathbf{y}\) is a binary variable describing whether or not the student drops out of high school rather than graduating on time. A typical prediction is a risk score, of which a probability like \(\mathbf{q}\) from our discussion of logistic regression is an example. Schematically, we can write a prediction of the model like this:

\[ \begin{aligned} q(\mathbf{x}_{\text{student}}, \mathbf{x}_{\text{environment}})\;, \end{aligned} \]

where \(q\) could be calculated from its input features using logistic regression or one of many other methods. Let’s call this \(\mathcal{M}_\mathrm{individual}\).

In their extensive study, Perdomo et al. (2025) find some inconclusive evidence that the EWS system used in Wisconsin may have improved graduation rates by approximately 5% among students who received high risk scores, but that evidence was not statistically conclusive.

Environmental Risk Scores and Unnecessary Personalization

A purported feature of \(\mathcal{M}_\mathrm{individual}\) is that it produces personalized, individual risk scores which aim to give the probability that an individual student will drop out of high school. However, Perdomo et al. (2025) find that the personalization of \(\mathcal{M}_\mathrm{individual}\) is largely unnecessary, writing:

Are individual risk scores necessary for effectively targeting interventions? Our analysis shows that if we already know these environmental features, incorporating individual features into the predictive model only leads to a slight, marginal improvement in identifying future dropouts…That is, intervening on students identified as being at high risk by this alternative, environmental-based targeting strategy would have the same aggregate effect on high school graduation rates in Wisconsin as the individually-focused DEWS predictions.

The authors develop and test an alternative model \(\mathcal{M}_{\mathrm{environmental}}\) which uses only environmental features. They find that \(\mathcal{M}_{\mathrm{environmental}}\) yields almost the same predictions as the individualized \(\mathcal{M}_{\mathrm{individual}}\), and that the result of using \(\mathcal{M}_{\mathrm{environmental}}\) to allocate resources to students would have an almost identical impact on graduation rates.

Is my result plausible?

If your classifier is giving you a result that would shock you if you read it in a newspaper or seems like something out of a dystopian scifi novel, then you should ask yourself some careful questions about the design of your study and especially about the structure of your training data. Here’s a case study by Carl Bergstrom and Jevin West about Wu and Zhang (2016), a paper which purports to show that people’s facial features can be used to predict whether or not those people will commit a crime.

Perdomo, Juan Carlos, Tolani Britton, Moritz Hardt, and Rediet Abebe. 2025. “Difficult Lessons on Social Prediction from Wisconsin Public Schools.” In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, 2682–2704. Athens Greece: ACM. https://doi.org/10.1145/3715275.3732175.

Savcisens, Germans, Tina Eliassi-Rad, Lars Kai Hansen, Laust Hvas Mortensen, Lau Lilleholt, Anna Rogers, Ingo Zettler, and Sune Lehmann. 2024. “Using Sequences of Life-Events to Predict Human Lives.” Nature Computational Science 4 (1): 43–56.

Wu, Xiaolin, and Xi Zhang. 2016. “Automated Inference on Criminality Using Face Images.” arXiv Preprint arXiv:1611.04135 4.