AI Post Transformers

Measuring AI Ability to Complete Long Tasks


Listen Later

Researchers from METR introduce a novel framework for evaluating AI progress by measuring a model's time horizon, defined as the length of a task a human can complete that an AI can perform with 50% reliability. Traditional benchmarks often fail because they saturate quickly or focus on static knowledge, whereas this approach uses economically valuable tasks in fields like software engineering and cybersecurity. By comparing AI performance against over 2,500 hours of human baselines, the study found that the effective time horizon for frontier models has doubled approximately every 212 days since 2019. This consistent exponential growth suggests that AI agents may be capable of automating complex, month-long human projects by the end of this decade. While newer models like o1 show significant improvements in reasoning and error correction, they still struggle with "messy" environments that lack clear feedback loops. Ultimately, this psychometric-inspired methodology provides a unified metric to track the evolution of autonomous agents and forecast potential catastrophic risks as systems become increasingly powerful. Source: March 2025 Measuring AI Ability to Complete Long Tasks Model Evaluation & Threat Research (METR) Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney Von Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Chris Painter, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, Lawrence Chan https://arxiv.org/pdf/2503.14499
...more
View all episodesView all episodes
Download on the App Store

AI Post TransformersBy mcgrof