The Ceiling of Predictability

As I mentioned in my previous essay, modern AI is mostly about predicting the unknown. And in virtually every interesting situation, prediction forces us to confront uncertainty. We don’t get to output a precise answer with full confidence. Humans don’t, and neither do our machines.

Much of AI research has not taken this seriously. It’s one reason today’s models often sound so overconfident – they learned that from us. And we’re hardly paragons of calibrated reasoning (see this great video).

Given that prediction is central to AI, let’s unpack what we mean by uncertainty.

To start, assume a deterministic universe – Newton’s world – where randomness arises only because we lack information. This is not the whole story. Quantum mechanics shows that some phenomena really are intrinsically random. But in our daily lives, the Newtonian approximation works just fine. Most of the uncertainty we face is ignorance, not cosmic randomness.

Probability is the tool we use to formalize this. Historically, probability emerged from games of chance like rolling dice, where repeated trials allow us to define probabilities as long-run frequencies. These are objective, at least in principle: roll the die enough times and the empirical frequencies align with the theoretical ones.

But this “frequentist” view is narrow. Bayesian thinking introduced a deeper idea: probability as a measure of belief. This notion of probability is subjective, but rigorously updateable and practically useful. This is the interpretation that underlies most discussions of uncertainty in prediction. See this fantastic book on the topic.

Given data X and a target Y, our uncertainty is encoded in a posterior probability: p(Y | X). If Y takes on discrete values, this is a probability vector. If Y is continuous, we talk about a probability distribution – often something parametric like a Gaussian, expressed through a mean and variance.

Evaluating such predictions requires repeated observations. With a single sample (datapoint) Xi and a posterior p(Yi | Xi ), you cannot meaningfully judge whether that probability was “good” or not. You need many trials to compare predicted probabilities with actual outcomes. This is why probability evaluation gets slippery and why comparing, say, election forecasts is notoriously hard – we have very few data points.

A common method is to add up the log-probabilities that predictors assign to events that actually occur. The predictor with the higher total “wins.” This is the essence of maximum likelihood. But maximum likelihood can reward unjustified confidence and punish well-calibrated caution.

Calibration matters. For example, in a binary classification task, if a model outputs a predicted probability between 0 and 0.1 for certain samples, then in a well-calibrated model roughly 0-10% of those samples should belong to the positive class. And yes, you could achieve perfect calibration by ignoring X and always outputting the prior. For a fair coin, always predicting 50-50 is perfectly calibrated. But it’s also perfectly useless.

In practice, accuracy and calibration seem to trade off. This isn’t a theoretical rule as much as a recurring empirical pattern – perhaps a byproduct of our learning algorithms. And it naturally raises a question: as we build ever more accurate models, can we push uncertainty to zero while staying calibrated?

To answer that, we need to ask what uncertainty is meant to represent. Even in Newton’s deterministic universe there are two fundamentally different sources of uncertainty.

The first is epistemic uncertainty. This is the uncertainty that comes from limited data, imperfect models, incomplete knowledge, and constrained compute. It’s the uncertainty of ignorance, and science and engineering are, in a sense, centuries-long attempts to reduce it. Gather more data. Build better models. Expand our theoretical understanding. Push our hardware. All of these shrink epistemic uncertainty.

But my focus in this essay is the second kind.

Aleatoric uncertainty is the irreducible randomness of p(Y|X) itself. Not randomness in the laws of physics, but randomness because X simply does not contain enough information to pin down Y. A fair coin toss is the cleanest example. No amount of data – no amount of internet-scale context – will help you predict the outcome better than 50-50. Most real-world prediction tasks share this basic structure. Weather forecasts, hospital readmissions, sports outcomes – countless hidden factors play crucial roles, many of them unknowable in practice.

This produces a prediction ceiling: the accuracy an oracle could achieve if given exactly the same input X. And crucially, the oracle will still often fall short of 100%.

This ceiling is rarely acknowledged in AI discussions, especially in healthcare (see our recent letter on this topic). People often talk about “minimum acceptable performance” in terms of an absolute threshold – e.g., an area under the curve of 0.80 or above – as if this were a law of nature rather than a human convention. But what if the Bayes limit for a given problem, with the information available, is 0.75? What if no model – no matter how vast the dataset or how advanced the architecture – can exceed that?

If patient readmission depends on unmeasured social support networks, personal behaviors, or random life events, then those unobserved variables impose a ceiling. Before judging prediction performance, we must understand how much of the outcome is governed by chance (our ignorance) given the input we have. If an oracle can’t do better than 75%, it’s unreasonable to expect a real model to surpass it.

This brings me to what I think is an overlooked point. The prediction ceiling depends entirely on what X is. And the most powerful way to lift the ceiling is not through bigger models or more training data, but through richer, more informative inputs.

Machine learning research often treats the input as fixed: an image, a sentence, a lab test. But in the real world, we get to choose what we measure. We design sensors. We invent imaging sequences. We develop new assays. We conduct new surveys. We discover new biological markers. We create entirely new data modalities. Better scientific understanding leads to better measurement, and better measurement leads to better prediction.

The next phase of AI should not ignore this. Vision and language models benefit from the accident of abundant data, but the domains where AI could make the biggest difference – medicine, biology, climate, materials, human behavior – will require entirely new kinds of data, captured in new ways, at new scales.

If we want AI that truly transforms our world, we must do more than build bigger models.
We must measure and observe the world more wisely.
Only then can we raise the very ceiling of what is predictable.

Stochastic Intelligence

recent posts

about