Latent Space: The AI Engineer Podcast

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang


Listen Later

From creating SWE-bench in a Princeton basement to shipping CodeClash, SWE-bench Multimodal, and SWE-bench Multilingual, John Yang has spent the last year and a half watching his benchmark become the de facto standard for evaluating AI coding agents—trusted by Cognition (Devin), OpenAI, Anthropic, and every major lab racing to solve software engineering at scale. We caught up with John live at NeurIPS 2025 to dig into the state of code evals heading into 2026: why SWE-bench went from ignored (October 2023) to the industry standard after Devin's launch (and how Walden emailed him two weeks before the big reveal), how the benchmark evolved from Django-heavy to nine languages across 40 repos (JavaScript, Rust, Java, C, Ruby), why unit tests as verification are limiting and long-running agent tournaments might be the future (CodeClash: agents maintain codebases, compete in arenas, and iterate over multiple rounds), the proliferation of SWE-bench variants (SWE-bench Pro, SWE-bench Live, SWE-Efficiency, AlgoTune, SciCode) and how benchmark authors are now justifying their splits with curation techniques instead of just "more repos," why Tau-bench's "impossible tasks" controversy is actually a feature not a bug (intentionally including impossible tasks flags cheating), the tension between long autonomy (5-hour runs) vs. interactivity (Cognition's emphasis on fast back-and-forth), how Terminal-bench unlocked creativity by letting PhD students and non-coders design environments beyond GitHub issues and PRs, the academic data problem (companies like Cognition and Cursor have rich user interaction data, academics need user simulators or compelling products like LMArena to get similar signal), and his vision for CodeClash as a testbed for human-AI collaboration—freeze model capability, vary the collaboration setup (solo agent, multi-agent, human+agent), and measure how interaction patterns change as models climb the ladder from code completion to full codebase reasoning.

We discuss:

  • John's path: Princeton → SWE-bench (October 2023) → Stanford PhD with Diyi Yang and the Iris Group, focusing on code evals, human-AI collaboration, and long-running agent benchmarks

  • The SWE-bench origin story: released October 2023, mostly ignored until Cognition's Devin launch kicked off the arms race (Walden emailed John two weeks before: "we have a good number")

  • SWE-bench Verified: the curated, high-quality split that became the standard for serious evals

  • SWE-bench Multimodal and Multilingual: nine languages (JavaScript, Rust, Java, C, Ruby) across 40 repos, moving beyond the Django-heavy original distribution

  • The SWE-bench Pro controversy: independent authors used the "SWE-bench" name without John's blessing, but he's okay with it ("congrats to them, it's a great benchmark")

  • CodeClash: John's new benchmark for long-horizon development—agents maintain their own codebases, edit and improve them each round, then compete in arenas (programming games like Halite, economic tasks like GDP optimization)

  • SWE-Efficiency (Jeffrey Maugh, John's high school classmate): optimize code for speed without changing behavior (parallelization, SIMD operations)

  • AlgoTune, SciCode, Terminal-bench, Tau-bench, SecBench, SRE-bench: the Cambrian explosion of code evals, each diving into different domains (security, SRE, science, user simulation)

  • The Tau-bench "impossible tasks" debate: some tasks are underspecified or impossible, but John thinks that's actually a feature (flags cheating if you score above 75%)

  • Cognition's research focus: codebase understanding (retrieval++), helping humans understand their own codebases, and automatic context engineering for LLMs (research sub-agents)

  • The vision: CodeClash as a testbed for human-AI collaboration—vary the setup (solo agent, multi-agent, human+agent), freeze model capability, and measure how interaction changes as models improve

John Yang

  • SWE-bench: https://www.swebench.com

  • X: https://x.com/jyangballin

Chapters
  • 00:00:00 Introduction: John Yang on SWE-bench and Code Evaluations
  • 00:00:31 SWE-bench Origins and Devon's Impact on the Coding Agent Arms Race
  • 00:01:09 SWE-bench Ecosystem: Verified, Pro, Multimodal, and Multilingual Variants
  • 00:02:17 Moving Beyond Django: Diversifying Code Evaluation Repositories
  • 00:03:08 Code Clash: Long-Horizon Development Through Programming Tournaments
  • 00:04:41 From Halite to Economic Value: Designing Competitive Coding Arenas
  • 00:06:04 Ofir's Lab: SWE-ficiency, AlgoTune, and SciCode for Scientific Computing
  • 00:07:52 The Benchmark Landscape: TAU-bench, Terminal-bench, and User Simulation
  • 00:09:20 The Impossible Task Debate: Refusals, Ambiguity, and Benchmark Integrity
  • 00:12:32 The Future of Code Evals: Long Autonomy vs Human-AI Collaboration
  • 00:14:37 Call to Action: User Interaction Data and Codebase Understanding Research

...more
View all episodesView all episodes
Download on the App Store

Latent Space: The AI Engineer PodcastBy swyx + Alessio

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

92 ratings


More shows like Latent Space: The AI Engineer Podcast

View all
The a16z Show by Andreessen Horowitz

The a16z Show

1,095 Listeners

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) by Sam Charrington

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

436 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

Super Data Science: ML & AI Podcast with Jon Krohn

302 Listeners

NVIDIA AI Podcast by NVIDIA

NVIDIA AI Podcast

343 Listeners

Y Combinator Startup Podcast by Y Combinator

Y Combinator Startup Podcast

226 Listeners

Practical AI by Practical AI LLC

Practical AI

198 Listeners

Google DeepMind: The Podcast by Hannah Fry

Google DeepMind: The Podcast

203 Listeners

Last Week in AI by Skynet Today

Last Week in AI

311 Listeners

Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

100 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

531 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

139 Listeners

This Day in AI Podcast by Michael Sharkey, Chris Sharkey

This Day in AI Podcast

227 Listeners

The AI Daily Brief: Artificial Intelligence News and Analysis by Nathaniel Whittemore

The AI Daily Brief: Artificial Intelligence News and Analysis

639 Listeners

AI + a16z by a16z

AI + a16z

34 Listeners

The Pragmatic Engineer by Gergely Orosz

The Pragmatic Engineer

70 Listeners