When METR says something like "Claude Opus 4.5 has a 50% time horizon of 4 hours and 50 minutes", what does that mean? In this episode David Rein, METR researcher and co-author of the paper "Measuring AI ability to complete long tasks", talks about METR's work on measuring time horizons, the methodology behind those numbers, and what work remains to be done in this domain.
Patreon: https://www.patreon.com/axrpodcast
Ko-fi: https://ko-fi.com/axrpodcast
Transcript: https://axrp.net/episode/2026/01/03/episode-47-david-rein-metr-time-horizons.html
Topics we discuss, and timestamps:
0:00:32 Measuring AI Ability to Complete Long Tasks
0:10:54 The meaning of "task length"
0:19:27 Examples of intermediate and hard tasks
0:25:12 Why the software engineering focus
0:32:17 Why task length as difficulty measure
0:46:32 Is AI progress going superexponential?
0:50:58 Is AI progress due to increased cost to run models?
0:54:45 Why METR measures model capabilities
1:04:10 How time horizons relate to recursive self-improvement
1:12:58 Cost of estimating time horizons
1:16:23 Task realism vs mimicking important task features
1:19:50 Excursus on "Inventing Temperature"
1:25:46 Return to task realism discussion
1:33:53 Open questions on time horizons
Links for METR:
Main website: https://metr.org/
X/Twitter account: https://x.com/METR_Evals/
Research we discuss:
Measuring AI Ability to Complete Long Tasks: https://arxiv.org/abs/2503.14499
RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts: https://arxiv.org/abs/2411.15114
HCAST: Human-Calibrated Autonomy Software Tasks: https://arxiv.org/abs/2503.17354
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity: https://arxiv.org/abs/2507.09089
Anthropic Economic Index: Tracking AI's role in the US and global economy: https://www.anthropic.com/research/anthropic-economic-index-september-2025-report
Bridging RL Theory and Practice with the Effective Horizon (i.e. the Cassidy Laidlaw paper): https://arxiv.org/abs/2304.09853
How Does Time Horizon Vary Across Domains?: https://metr.org/blog/2025-07-14-how-does-time-horizon-vary-across-domains/
Inventing Temperature: https://global.oup.com/academic/product/inventing-temperature-9780195337389
Is there a Half-Life for the Success Rates of AI Agents? (by Toby Ord): https://www.tobyord.com/writing/half-life
Lawrence Chan's response to the above: https://nitter.net/justanotherlaw/status/1920254586771710009
AI Task Length Horizons in Offensive Cybersecurity: https://sean-peters-au.github.io/2025/07/02/ai-task-length-horizons-in-offensive-cybersecurity.html
Episode art by Hamish Doodles: hamishdoodles.com