When Smarter Models Forecast Worse: The Hidden Failure Mode in LLM Predictions
Source: Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
Paper was published on May 21, 2026
This episode was AI-generated on May 22, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs.
Claude Opus 4.6 looked at Brazil's 1986 hyperinflation, correctly named the regime-change risk, then forecast a number seven million times too high. A new paper shows this isn't a fluke — it's a structural pattern across epidemics, housing bubbles, and decades of measles data, and the standard way the field grades LLM forecasts can't see it.
Key Takeaways
Why the same model outputs can earn opposite verdicts — capable models look best under Brier-style scoring and worst under CRPS — and what that means for every existing LLM forecasting benchmarkThe specific trigger for the inversion: superlinear growth followed by a regime change, confirmed by a clean linear-growth control where the effect vanishes entirelyA within-family Llama experiment showing scale and post-training each independently make the overcommitment worse, and compound when combinedThe unselected pre-vaccine US measles cohort (1,339 state-seasons) that rules out the 'you cherry-picked the crashes' objection, plus flu as a pre-registered negative controlWhy naming the historical episode rescues calibration for COVID and housing but fails completely for hyperinflation — the knowledge is in the model, but it doesn't reach the tailsThe one-line fix: report a tail-integrating proper scoring rule alongside threshold metrics, using forecasts benchmarks have already collected00:00 — The Opus 4.6 hyperinflation moment
A frontier model articulates the regime-change possibility, weighs it on the page, and then commits to extrapolation anyway — overshooting reality by a factor of seven million.03:47 — Two ways to grade a forecast
Distributional forecasts, threshold-based Brier scoring versus tail-integrating CRPS, and the weather-map analogy that makes the asymmetry click.07:35 — The Freeciv benchmark and the first crack
A clean, unseen forecasting setup where binary and continuous versions of the same question yield opposite capability-accuracy correlations at long horizons.11:22 — The synthetic epidemic and its linear-growth control
An exponential-then-crash simulator reproduces the inversion, and swapping in linear growth makes it vanish — pinning the mechanism to the bend-then-break shape.11:32 — Competence-driven overcommitment
Per-quantile decomposition shows the lower tail stays flat while the upper tail balloons with capability, and the within-family Llama 2x2 confirms scale and post-training each contribute.18:57 — Real-world replications and the measles test
COVID, housing, and hyperinflation replicate the pattern, but the unselected measles cohort and a pre-registered flu negative control are what make the result hard to dismiss.22:45 — The verdict flip and what the model knows
The same forecasts graded two ways reverse the sign of the capability correlation, and a knowledge probe reveals models can name the crisis they're forecasting yet still produce extreme tail overshoots.26:33 — Limitations, the steelman, and the fix
Honest pushback on the capability axis, the bundled post-training treatment, and the small hyperinflation sample — followed by the embarrassingly simple methodological recommendation.Recommended Reading
Are Emergent Abilities of Large Language Models a Mirage? — The Schaeffer et al. paper the episode invokes directly — argues that metric choice can manufacture apparent emergent abilities, setting up this episode's darker mirror claim that metric choice can also hide failures.Inverse Scaling Prize: Second Round Winners (McKenzie et al.) — The original taxonomy of inverse-scaling failures the episode contrasts with — useful for understanding why this paper's forecasting failure is structurally different from earlier adversarial cases.Inverse Scaling Can Become U-Shaped — Wei et al.'s follow-up showing many inverse-scaling tasks recover at frontier scale — the precise counterpoint to this episode's claim that the forecasting failure is monotonic all the way to the frontier.