Can an AI truly conduct AI research—on its own, from scratch? In this compelling episode of The Deep Dive, we explore one of the boldest experiments yet in AI evaluation: PaperBench. This groundbreaking benchmark sets out to test whether advanced AI agents can replicate cutting-edge research published at ICML 2024—the “Olympics” of machine learning.
We unpack the ambitious design behind PaperBench, where AI agents are tasked not just with reading elite academic papers, but with rebuilding the experiments from scratch—writing code, running it, and verifying results without ever glimpsing the original authors' code. With over 3,000 individual tasks across 20 elite papers, the challenge is both massive and meticulous.
You’ll hear how these agents are graded via a powerful automated LLM-based judge dubbed Simple Judge, validated through an equally clever benchmark called JudgeEval. We dig into how the top AI models—Claude 3.5, GPT-4, Gemini, and others—stacked up, and why the best could still only replicate around 21% of the work. Why are these models giving up early? What happens when they're pushed to persist longer?
We also dive into PaperBench Codev, a focused variant that tests code-writing ability alone, where some agents fared significantly better. And how do human researchers compare when given the same task—with some AI assistance but no shortcuts?
From execution bottlenecks to prompting strategies, from rubric creation to potential "specification gaming," this episode offers a revealing look into what AI can and can’t yet do in the world of scientific discovery. Whether you’re a researcher, engineer, or just fascinated by AI’s growing role in shaping knowledge itself, this is an episode you won’t want to miss.
Tune in and ask yourself: when it comes to frontier science, is AI a collaborator, a tool—or a competitor?
Read more: https://cdn.openai.com/papers/22265bac-3191-44e5-b057-7aaacd8e90cd/paperbench.pdf