What task are we evaluating?
This is a brief discussion of a published academic article. The authors of the article were very careful in describing their findings and put up a very helpful FAQ website clarifying some important points. The brief discussion below is not a dunk on the authors, but a brief lesson in the importance of very carefully parsing hype in media.
A few years ago, an international group of authors published a model aimed at predicting live outcomes for patients based on sequences of life events Savcisens et al. (2024). One of the key functionalities of the model was death prediction; the researchers scored the model on a prediction task in which the model attempted to predict whether a given individual was alive four years after the end of the training data. A media writeup of the model said:
Researchers analyzed aspects of a person’s life story between 2008 and 2016, with the model seeking patterns in the data. Then, they used the algorithm to determine whether someone had died by 2020. The Life2vec model made predictions with 78% accuracy.
Wow! 78% accuracy sounds impressive (and also scary!!).
Writeups like these pose interesting questions about how we want to live our lives and what kinds of knowledge are healthy for us to have. If a model had a 78% accuracy in predicting whether you were going to die in the next four years, would you want to know?
Before we get carried away by deep questions like that, however, it’s important to ask some technical questions to help us understand how the model is actually scored and what that 78% accuracy really means.
What’s the population?
Does the model work for everybody? Not necessarily: as the researchers write in their FAQ about the paper, the model is trained on Danish individuals ages 35-65. The model’s performance on older, younger, or non-Danish individuals is unknown.
What’s the base rate?
One important piece of context is the base rate in the data. In a country in which the average length of a human life is 80 years (representative of many developed countries), the base rate of death in a single year is roughly \(1/80 = 1.25\%\). In a four-year period, the base rate of death is roughly \(4/80 = 5\%\). Of course, this is an average across a population, with considerably variability by age:
That said, a model which always predicted “not dead in the next four years” would be correct on average about 95% of the time. So, the researchers must have done something more subtle.
What’s the task?
Indeed, the researchers balanced their testing data set between the two classes “alive in 2020” and “dead in 2020”. This means that they selected instances for the test data to overrepresent cases of death, resulting in a test set in which the base rate of death was 50% instead of 5%. This makes it harder for the model to achieve high accuracy, since just predicting “alive” doesn’t work anymore. However, it also means that the 78% accuracy isn’t directly relevant to a random individual in the population: it’s a specific, narrowly-interpretable measure of performance describing the model’s ability to make predictions in an artificially balanced dataset.
Does the model need to be this complicated?
Perdomo et al. (2025) conducted a recent study of early warning systems (EWS) for predicting the likelihood that an individual student will graduate public high school in Wisconsin. The idea of an early warning system is to identify students who are at increased dropout risk so that the school can allocate more resources to supporting those students.
Typical features \(\mathbf{X}\) used in a EWS include information about:
- The student: demographics and family background, health and mental health, grades, attendance, disciplinary record, etc.
- The student’s educational environment, which includes:
- The school: dropout rates at the school overall, class sizes, availability of support resources, average test scores, etc.
- The district: funding levels, median income, district-wide dropout rates, etc.
The typical target \(\mathbf{y}\) is a binary variable describing whether or not the student drops out of high school rather than graduating on time. A typical prediction is a risk score, of which a probability like \(\mathbf{q}\) from our discussion of logistic regression is an example. Schematically, we can write a prediction of the model like this:
\[
\begin{aligned}
q(\mathbf{x}_{\text{student}}, \mathbf{x}_{\text{environment}})\;,
\end{aligned}
\]
where \(q\) could be calculated from its input features using logistic regression or one of many other methods. Let’s call this \(\mathcal{M}_\mathrm{individual}\).
In their extensive study, Perdomo et al. (2025) find some inconclusive evidence that the EWS system used in Wisconsin may have improved graduation rates by approximately 5% among students who received high risk scores, but that evidence was not statistically conclusive.
Environmental Risk Scores and Unnecessary Personalization
A purported feature of \(\mathcal{M}_\mathrm{individual}\) is that it produces personalized, individual risk scores which aim to give the probability that an individual student will drop out of high school. However, Perdomo et al. (2025) find that the personalization of \(\mathcal{M}_\mathrm{individual}\) is largely unnecessary, writing:
Are individual risk scores necessary for effectively targeting interventions? Our analysis shows that if we already know these environmental features, incorporating individual features into the predictive model only leads to a slight, marginal improvement in identifying future dropouts…That is, intervening on students identified as being at high risk by this alternative, environmental-based targeting strategy would have the same aggregate effect on high school graduation rates in Wisconsin as the individually-focused DEWS predictions.
The authors develop and test an alternative model \(\mathcal{M}_{\mathrm{environmental}}\) which uses only environmental features. They find that \(\mathcal{M}_{\mathrm{environmental}}\) yields almost the same predictions as the individualized \(\mathcal{M}_{\mathrm{individual}}\), and that the result of using \(\mathcal{M}_{\mathrm{environmental}}\) to allocate resources to students would have an almost identical impact on graduation rates.
Is my result plausible?
If your classifier is giving you a result that would shock you if you read it in a newspaper or seems like something out of a dystopian scifi novel, then you should ask yourself some careful questions about the design of your study and especially about the structure of your training data. Here’s a case study by Carl Bergstrom and Jevin West about Wu and Zhang (2016), a paper which purports to show that people’s facial features can be used to predict whether or not those people will commit a crime.
Perdomo, Juan Carlos, Tolani Britton, Moritz Hardt, and Rediet Abebe. 2025.
“Difficult Lessons on Social Prediction from Wisconsin Public Schools.” In
Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, 2682–2704. Athens Greece: ACM.
https://doi.org/10.1145/3715275.3732175.
Savcisens, Germans, Tina Eliassi-Rad, Lars Kai Hansen, Laust Hvas Mortensen, Lau Lilleholt, Anna Rogers, Ingo Zettler, and Sune Lehmann. 2024. “Using Sequences of Life-Events to Predict Human Lives.” Nature Computational Science 4 (1): 43–56.
Wu, Xiaolin, and Xi Zhang. 2016. “Automated Inference on Criminality Using Face Images.” arXiv Preprint arXiv:1611.04135 4.