Agents of Intelligence

Beyond Benchmarks: How Long Can AI Work?


Listen Later

In this episode, we unpack a groundbreaking new way of measuring AI capability—not by test scores, but by time. Drawing from the recent METR paper "Measuring AI Ability to Complete Long Tasks," we explore the concept of the 50% task-completion time horizon—a novel metric that asks: How long could a human work on a task before today's AI can match them with 50% success?

We’ll explore how this time-based approach offers a more intuitive and unified scale for tracking AI progress across domains like software engineering and machine learning research. The findings are eye-opening: the time horizon has been doubling roughly every seven months, suggesting we could see "one-month AI"—systems capable of reliably completing tasks that take humans 160+ hours—by 2029.

We also delve into how reliability gaps, planning failures, and context sensitivity reveal AI’s current limits, even as capabilities continue to grow exponentially. Plus, what does this mean for the future of work, safety risks, and our understanding of AGI? If you're tired of benchmark buzzwords and want to get real about how far AI has come—and how far it might go—this one's for you.

...more
View all episodesView all episodes
Download on the App Store

Agents of IntelligenceBy Sam Zamany