September 26, 2025

OpenAI: Why the GDPval Benchmark Reveals Near-Human Parity and Catastrophic Failure Rates

13 minutes

The podcast introduces GDPval, a new benchmark created by OpenAI to evaluate AI models on real-world economically valuable tasks across major sectors contributing to U.S. GDP. This benchmark covers 44 occupations and is built using tasks sourced from industry professionals with extensive experience, focusing on digital knowledge work. The research finds that frontier models are improving linearly over time and are approaching the deliverable quality of human experts, particularly noting that AI assistance combined with human oversight shows potential for significant time and cost savings. Furthermore, the paper experiments with factors like reasoning effort and scaffolding, showing they consistently improve model performance, and concludes by open-sourcing a gold subset of tasks and an automated grader for future research.

...more

View all episodes

By Next in AI

September 26, 2025

OpenAI: Why the GDPval Benchmark Reveals Near-Human Parity and Catastrophic Failure Rates

13 minutes

...more

Share OpenAI: Why the GDPval Benchmark Reveals Near-Human Parity and Catastrophic Failure Rates

Sign up to save your podcasts

OpenAI: Why the GDPval Benchmark Reveals Near-Human Parity and Catastrophic Failure Rates

OpenAI: Why the GDPval Benchmark Reveals Near-Human Parity and Catastrophic Failure Rates