AI Insiders

Measuring AI Ability to Complete Long Tasks


Listen Later

This paper introduces a new metric, the "50%-task-completion time horizon," to quantify AI capabilities by relating AI performance on tasks to the typical time humans take to complete them. The study timed domain-expert humans on a diverse set of research and software engineering tasks (RE-Bench, HCAST, and a new suite called SWAA) and evaluated the performance of 13 frontier AI models (2019-2025) on these tasks. The key finding is that the 50% time horizon of frontier AI models has been doubling approximately every seven months since 2019, potentially accelerating in 2024. Extrapolation of this trend suggests that within five years, AI systems may be capable of automating many software tasks currently taking humans a month. The paper discusses the methodology, limitations, and implications of these findings, particularly for AI safety and governance.
This paper provides a compelling new way to measure and track the progress of AI capabilities by focusing on the time horizon for task completion. The observed exponential growth, particularly the potential acceleration in recent years, has significant implications for the future of automation and AI safety. While acknowledging the limitations of current benchmarks and the challenges of extrapolating these trends to real-world scenarios, the findings suggest a rapid advancement towards AI systems capable of tackling increasingly complex and time-consuming tasks. Continued research and development of more realistic benchmarks will be crucial for accurately forecasting AI capabilities and ensuring responsible AI governance
...more
View all episodesView all episodes
Download on the App Store

AI InsidersBy Ronald Soh