Latent Space: The AI Engineer Podcast

By swyx + Alessio

The podcast by and for AI Engineers! In 2024, over 2 million readers and listeners came to Latent Space to hear about news, papers and interviews in Software 3.0.

We cover Foundation Models changing ... more

4.6

9292 ratings

Download on the App Store

Download on the App Store

Get it on Google Play

FAQs about Latent Space: The AI Engineer Podcast:

How many episodes does Latent Space: The AI Engineer Podcast have?

The podcast currently has 181 episodes available.

Latent Space: The AI Engineer Podcast episodes:

January 23, 2026Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay 2
From shipping Gemini Deep Think and IMO Gold to launching the Reasoning and AGI team in Singapore, Yi Tay has spent the last 18 months living through the full arc of Google DeepMind's pivot from architecture research to RL-driven reasoning—watching his team go from a dozen researchers to 300+, training models that solve International Math Olympiad problems in a live competition, and building the infrastructure to scale deep thinking across every domain, and driving Gemini to the top of the leaderboards across every category. Yi Returns to dig into the inside story of the IMO effort and more!
We discuss:
Yi's path: Brain → Reka → Google DeepMind → Reasoning and AGI team Singapore, leading model training for Gemini Deep Think and IMO Gold
The IMO Gold story: four co-captains (Yi in Singapore, Jonathan in London, Jordan in Mountain View, and Tong leading the overall effort), training the checkpoint in ~1 week, live competition in Australia with professors punching in problems as they came out, and the tension of not knowing if they'd hit Gold until the human scores came in (because the Gold threshold is a percentile, not a fixed number)
Why they threw away AlphaProof: "If one model can't do it, can we get to AGI?" The decision to abandon symbolic systems and bet on end-to-end Gemini with RL was bold and non-consensus
On-policy vs. off-policy RL: off-policy is imitation learning (copying someone else's trajectory), on-policy is the model generating its own outputs, getting rewarded, and training on its own experience—"humans learn by making mistakes, not by copying"
Why self-consistency and parallel thinking are fundamental: sampling multiple times, majority voting, LM judges, and internal verification are all forms of self-consistency that unlock reasoning beyond single-shot inference
The data efficiency frontier: humans learn from 8 orders of magnitude less data than models, so where's the bug? Is it the architecture, the learning algorithm, backprop, off-policyness, or something else?
Three schools of thought on world models: (1) Genie/spatial intelligence (video-based world models), (2) Yann LeCun's JEPA + FAIR's code world models (modeling internal execution state), (3) the amorphous "resolution of possible worlds" paradigm (curve-fitting to find the world model that best explains the data)
Why AI coding crossed the threshold: Yi now runs a job, gets a bug, pastes it into Gemini, and relaunches without even reading the fix—"the model is better than me at this"
The Pokémon benchmark: can models complete Pokédex by searching the web, synthesizing guides, and applying knowledge in a visual game state? "Efficient search of novel idea space is interesting, but we're not even at the point where models can consistently apply knowledge they look up"
DSI and generative retrieval: re-imagining search as predicting document identifiers with semantic tokens, now deployed at YouTube (symmetric IDs for RecSys) and Spotify
Why RecSys and IR feel like a different universe: "modeling dynamics are strange, like gravity is different—you hit the shuttlecock and hear glass shatter, cause and effect are too far apart"
The closed lab advantage is increasing: the gap between frontier labs and open source is growing because ideas compound over time, and researchers keep finding new tricks that play well with everything built before
Why ideas still matter: "the last five years weren't just blind scaling—transformers, pre-training, RL, self-consistency, all had to play well together to get us here"
Gemini Singapore: hiring for RL and reasoning researchers, looking for track record in RL or exceptional achievement in coding competitions, and building a small, talent-dense team close to the frontier
—
Yi Tay
Google DeepMind: https://deepmind.google
X: https://x.com/YiTayML
Chapters
00:00:00 Introduction: Returning to Google DeepMind and the Singapore AGI Team
00:04:52 The Philosophy of On-Policy RL: Learning from Your Own Mistakes
00:12:00 IMO Gold Medal: The Journey from AlphaProof to End-to-End Gemini
00:21:33 Training IMO Cat: Four Captains Across Three Time Zones
00:26:19 Pokemon and Long-Horizon Reasoning: Beyond Academic Benchmarks
00:36:29 AI Coding Assistants: From Lazy to Actually Useful
00:32:59 Reasoning, Chain of Thought, and Latent Thinking
00:44:46 Is Attention All You Need? Architecture, Learning, and the Local Minima
00:55:04 Data Efficiency and World Models: The Next Frontier
01:08:12 DSI and Generative Retrieval: Reimagining Search with Semantic IDs
01:17:59 Building GDM Singapore: Geography, Talent, and the Symposium
01:24:18 Hiring Philosophy: High Stats, Research Taste, and Student Budgets
01:28:49 Health, HRV, and Research Performance: The 23kg Journey
...more
1h 33min
January 17, 2026Brex’s AI Hail Mary — With CTO James Reggio
From building internal AI labs to becoming CTO of Brex, James Reggio has helped lead one of the most disciplined AI transformations inside a real financial institution where compliance, auditability, and customer trust actually matter.
We sat down with Reggio to unpack Brex’s three-pillar AI strategy (corporate, operational, and product AI) [https://www.brex.com/journal/brex-ai-native-operations], how SOP-driven agents beat overengineered RL in ops, why Brex lets employees “build their own AI stack” instead of picking winners [https://www.conductorone.com/customers/brex/], and how a small, founder-heavy AI team is shipping production agents to 40,000+ companies. Reggio also goes deep on Brex’s multi-agent “network” architecture, evals for multi-turn systems, agentic coding’s second-order effects on codebase understanding, and why the future of finance software looks less like dashboards and more like executive assistants coordinating specialist agents behind the scenes.
We discuss:
Brex’s three-pillar AI strategy: corporate AI for 10x employee workflows, operational AI for cost and compliance leverage, and product AI that lets customers justify Brex as part of their AI strategy to the board
Why SOP-driven agents beat overengineered RL in finance ops, and how breaking work into auditable, repeatable steps unlocked faster automation in KYC, underwriting, fraud, and disputes
Building an internal AI platform early: LLM gateways, prompt/version management, evals, cost observability, and why platform work quietly became the force multiplier behind everything else
Multi-agent “networks” vs single-agent tools: why Brex’s EA-style assistant coordinates specialist agents (policy, travel, reimbursements) through multi-turn conversations instead of one-shot tool calls
The audit agent pattern: separating detection, judgment, and follow-up into different agents to reduce false negatives without overwhelming finance teams
Centralized AI teams without resentment: how Brex avoided “AI envy” by tying work to business impact and letting anyone transfer in if they cared deeply enough
Letting employees build their own AI stack: ChatGPT vs Claude vs Gemini, Cursor vs Windsurf, and why Brex refuses to pick winners in fast-moving tool races
Measuring adoption without vanity metrics: why “% of code written by AI” is the wrong KPI and what second-order effects (slop, drift, code ownership) actually matter
Evals in the real world: regression tests from ops QA, LLM-as-judge for multi-turn agents, and why integration-style evals break faster than you expect
Teaching AI fluency at scale: the user → advocate → builder → native framework, ops-led training, spot bonuses, and avoiding fear-based adoption
Re-interviewing the entire engineering org: using agentic coding interviews internally to force hands-on skill upgrades without formal performance scoring
Headcount in the age of agents: why Brex grew the business without growing engineering, and why AI amplifies bad architecture as fast as good decisions
The future of finance software: why dashboards fade, assistants take over, and agent-to-agent collaboration becomes the real UI
—
James Reggio
X: https://x.com/jamesreggio
LinkedIn: https://www.linkedin.com/in/jamesreggio/
Where to find Latent Space
X: https://x.com/latentspacepod
Substack: https://www.latent.space/
Chapters
00:00:00 Introduction
00:01:24 From Mobile Engineer to CTO: The Founder's Path
00:03:00 Quitters Welcome: Building a Founder-Friendly Culture
00:05:13 The AI Team Structure: 10-Person Startup Within Brex
00:11:55 Building the Brex Agent Platform: Multi-Agent Networks
00:13:45 Tech Stack Decisions: TypeScript, Mastra, and MCP
00:24:32 Operational AI: Automating Underwriting, KYC, and Fraud
00:16:40 The Brex Assistant: Executive Assistant for Every Employee
00:40:26 Evaluation Strategy: From Simple SOPs to Multi-Turn Evals
00:37:11 Agentic Coding Adoption: Cursor, Windsurf, and the Engineering Interview
00:58:51 AI Fluency Levels: From User to Native
01:09:14 The Audit Agent Network: Finance Team Agents in Action
01:03:33 The Future of Engineering Headcount and AI Leverage
...more
1h 14min
January 09, 2026Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith
don’t miss George’s AIE talk: https://www.youtube.com/watch?v=sRpqPgKeXNk
—-
From launching a side project in a Sydney basement to becoming the independent gold standard for AI benchmarking—trusted by developers, enterprises, and every major lab to navigate the exploding landscape of models, providers, and capabilities—George Cameron and Micah Hill-Smith have spent two years building Artificial Analysis into the platform that answers the questions no one else will: Which model is actually best for your use case? What are the real speed-cost trade-offs? And how open is "open" really?
We discuss:
The origin story: built as a side project in 2023 while Micah was building a legal AI assistant, launched publicly in January 2024, and went viral after Swyx's retweet
Why they run evals themselves: labs prompt models differently, cherry-pick chain-of-thought examples (Google Gemini 1.0 Ultra used 32-shot prompts to beat GPT-4 on MMLU), and self-report inflated numbers
The mystery shopper policy: they register accounts not on their own domain and run intelligence + performance benchmarks incognito to prevent labs from serving different models on private endpoints
How they make money: enterprise benchmarking insights subscription (standardized reports on model deployment, serverless vs. managed vs. leasing chips) and private custom benchmarking for AI companies (no one pays to be on the public leaderboard)
The Intelligence Index (V3): synthesizes 10 eval datasets (MMLU, GPQA, agentic benchmarks, long-context reasoning) into a single score, with 95% confidence intervals via repeated runs
Omissions Index (hallucination rate): scores models from -100 to +100 (penalizing incorrect answers, rewarding \"I don't know\"), and Claude models lead with the lowest hallucination rates despite not always being the smartest
GDP Val AA: their version of OpenAI's GDP-bench (44 white-collar tasks with spreadsheets, PDFs, PowerPoints), run through their Stirrup agent harness (up to 100 turns, code execution, web search, file system), graded by Gemini 3 Pro as an LLM judge (tested extensively, no self-preference bias)
The Openness Index: scores models 0-18 on transparency of pre-training data, post-training data, methodology, training code, and licensing (AI2 OLMo 2 leads, followed by Nous Hermes and NVIDIA Nemotron)
The smiling curve of AI costs: GPT-4-level intelligence is 100-1000x cheaper than at launch (thanks to smaller models like Amazon Nova), but frontier reasoning models in agentic workflows cost more than ever (sparsity, long context, multi-turn agents)
Why sparsity might go way lower than 5%: GPT-4.5 is ~5% active, Gemini models might be ~3%, and Omissions Index accuracy correlates with total parameters (not active), suggesting massive sparse models are the future
Token efficiency vs. turn efficiency: GPT-5 costs more per token but solves Tau-bench in fewer turns (cheaper overall), and models are getting better at using more tokens only when needed (5.1 Codex has tighter token distributions)
V4 of the Intelligence Index coming soon: adding GDP Val AA, Critical Point, hallucination rate, and dropping some saturated benchmarks (human-eval-style coding is now trivial for small models)
—
Artificial Analysis
Website: https://artificialanalysis.ai (https://artificialanalysis.ai (\"https://artificialanalysis.ai\"))
George Cameron on X: https://x.com/grmcameron (https://x.com/grmcameron (\"https://x.com/grmcameron\"))
Micah Hill-Smith on X: https://x.com/_micah_h (https://x.com/_micah_h (\"https://x.com/_micah_h\"))
Chapters
00:00:00 Introduction: Full Circle Moment and Artificial Analysis Origins
00:01:08 Business Model: Independence and Revenue Streams
00:04:00 The Origin Story: From Legal AI to Benchmarking
00:07:00 Early Challenges: Cost, Methodology, and Independence
00:16:13 AI Grant and Moving to San Francisco
00:18:58 Evolution of the Intelligence Index: V1 to V3
00:27:55 New Benchmarks: Hallucination Rate and Omissions Index
00:33:19 Critical Point and Frontier Physics Problems
00:35:56 GDPVAL AA: Agentic Evaluation and Stirrup Harness
00:51:47 The Openness Index: Measuring Model Transparency
00:57:57 The Smiling Curve: Cost of Intelligence Paradox
01:04:00 Hardware Efficiency and Sparsity Trends
01:07:43 Reasoning vs Non-Reasoning: Token Efficiency Matters
01:10:47 Multimodal Benchmarking and Community Requests
01:14:50 Looking Ahead: V4 Intelligence Index and Beyond
...more
1h 19min
January 08, 2026Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah-Hill Smith
don’t miss George’s AIE talk: https://www.youtube.com/watch?v=sRpqPgKeXNk
—-
From launching a side project in a Sydney basement to becoming the independent gold standard for AI benchmarking—trusted by developers, enterprises, and every major lab to navigate the exploding landscape of models, providers, and capabilities—George Cameron and Micah-Hill Smith have spent two years building Artificial Analysis into the platform that answers the questions no one else will: Which model is actually best for your use case? What are the real speed-cost trade-offs? And how open is "open" really?
We discuss:
The origin story: built as a side project in 2023 while Micah was building a legal AI assistant, launched publicly in January 2024, and went viral after Swyx's retweet
Why they run evals themselves: labs prompt models differently, cherry-pick chain-of-thought examples (Google Gemini 1.0 Ultra used 32-shot prompts to beat GPT-4 on MMLU), and self-report inflated numbers
The mystery shopper policy: they register accounts not on their own domain and run intelligence + performance benchmarks incognito to prevent labs from serving different models on private endpoints
How they make money: enterprise benchmarking insights subscription (standardized reports on model deployment, serverless vs. managed vs. leasing chips) and private custom benchmarking for AI companies (no one pays to be on the public leaderboard)
The Intelligence Index (V3): synthesizes 10 eval datasets (MMLU, GPQA, agentic benchmarks, long-context reasoning) into a single score, with 95% confidence intervals via repeated runs
Omissions Index (hallucination rate): scores models from -100 to +100 (penalizing incorrect answers, rewarding \"I don't know\"), and Claude models lead with the lowest hallucination rates despite not always being the smartest
GDP Val AA: their version of OpenAI's GDP-bench (44 white-collar tasks with spreadsheets, PDFs, PowerPoints), run through their Stirrup agent harness (up to 100 turns, code execution, web search, file system), graded by Gemini 3 Pro as an LLM judge (tested extensively, no self-preference bias)
The Openness Index: scores models 0-18 on transparency of pre-training data, post-training data, methodology, training code, and licensing (AI2 OLMo 2 leads, followed by Nous Hermes and NVIDIA Nemotron)
The smiling curve of AI costs: GPT-4-level intelligence is 100-1000x cheaper than at launch (thanks to smaller models like Amazon Nova), but frontier reasoning models in agentic workflows cost more than ever (sparsity, long context, multi-turn agents)
Why sparsity might go way lower than 5%: GPT-4.5 is ~5% active, Gemini models might be ~3%, and Omissions Index accuracy correlates with total parameters (not active), suggesting massive sparse models are the future
Token efficiency vs. turn efficiency: GPT-5 costs more per token but solves Tau-bench in fewer turns (cheaper overall), and models are getting better at using more tokens only when needed (5.1 Codex has tighter token distributions)
V4 of the Intelligence Index coming soon: adding GDP Val AA, Critical Point, hallucination rate, and dropping some saturated benchmarks (human-eval-style coding is now trivial for small models)
—
Artificial Analysis
Website: https://artificialanalysis.ai (https://artificialanalysis.ai (\"https://artificialanalysis.ai\"))
George Cameron on X: https://x.com/georgecameron (https://x.com/georgecameron (\"https://x.com/georgecameron\"))
Micah-Hill Smith on X: https://x.com/micahhsmith (https://x.com/micahhsmith (\"https://x.com/micahhsmith\"))
Chapters
00:00:00 Introduction: Full Circle Moment and Artificial Analysis Origins
00:01:08 Business Model: Independence and Revenue Streams
00:04:00 The Origin Story: From Legal AI to Benchmarking
00:07:00 Early Challenges: Cost, Methodology, and Independence
00:16:13 AI Grant and Moving to San Francisco
00:18:58 Evolution of the Intelligence Index: V1 to V3
00:27:55 New Benchmarks: Hallucination Rate and Omissions Index
00:33:19 Critical Point and Frontier Physics Problems
00:35:56 GDPVAL AA: Agentic Evaluation and Stirrup Harness
00:51:47 The Openness Index: Measuring Model Transparency
00:57:57 The Smiling Curve: Cost of Intelligence Paradox
01:04:00 Hardware Efficiency and Sparsity Trends
01:07:43 Reasoning vs Non-Reasoning: Token Efficiency Matters
01:10:47 Multimodal Benchmarking and Community Requests
01:14:50 Looking Ahead: V4 Intelligence Index and Beyond
...more
1h 19min
January 06, 2026[State of Evals] LMArena's $1.7B Vision — Anastasios Angelopoulos, LMArena
We are reupping this episode after LMArena announced their fresh Series A (https://www.theinformation.com/articles/ai-evaluation-startup-lmarena-valued-1-7-billion-new-funding-round?rc=luxwz4), raising $150m at a $1.7B valuation, with $30M annualized consumption revenue (aka $2.5m MRR) after their September evals product launch.
—-
From building LMArena in a Berkeley basement to raising $100M and becoming the de facto leaderboard for frontier AI, Anastasios Angelopoulos returns to Latent Space to recap 2025 in one of the most influential platforms in AI—trusted by millions of users, every major lab, and the entire industry to answer one question: which model is actually best for real-world use cases? We caught up with Anastasios live at NeurIPS 2025 to dig into the origin story (spoiler: it started as an academic project incubated by Anjney Midha at a16z, who formed an entity and gave grants before they even committed to starting a company), why they decided to spin out instead of staying academic or nonprofit (the only way to scale was to build a company), how they're spending that $100M (inference costs, React migration off Gradio, and hiring world-class talent across ML, product, and go-to-market), the leaderboard delusion controversy and why their response demolished the paper's claims (factual errors, misrepresentation of open vs. closed source sampling, and ignoring the transparency of preview testing that the community loves), why platform integrity comes first (the public leaderboard is a charity, not a pay-to-play system—models can't pay to get on, can't pay to get off, and scores reflect millions of real votes), how they're expanding into occupational verticals (medicine, legal, finance, creative marketing) and multimodal arenas (video coming soon), why consumer retention is earned every single day (sign-in and persistent history were the unlock, but users are fickle and can leave at any moment), and his vision for Arena as the central evaluation platform that provides the North Star for the industry—constantly fresh, immune to overfitting, and grounded in millions of real-world conversations from real users.
We discuss:
The $100M raise: use of funds is primarily inference costs (funding free usage for tens of millions of monthly conversations), React migration off Gradio (custom loading icons, better developer hiring, more flexibility), and hiring world-class talent
The scale: 250M+ conversations on the platform, tens of millions per month, 25% of users do software for a living, and half of users are now logged in
The leaderboard illusion controversy: Cohere researchers claimed undisclosed private testing created inequities, but Arena's response demolished the paper's factual errors (misrepresented open vs. closed source sampling, ignored transparency of preview testing that the community loves)
Why preview testing is loved by the community: secret codenames (Gemini Nano Banana, named after PM Naina's nickname), early access to unreleased models, and the thrill of being first to vote on frontier capabilities
The Nano Banana moment: changed Google's market share overnight, billions of dollars in stock movement, and validated that multimodal models (image generation, video) are economically critical for marketing, design, and AI-for-science
New categories: occupational and expert arenas (medicine, legal, finance, creative marketing), Code Arena, and video arena coming soon
Chapters
00:00:00 Introduction: Anastasios from Arena and the LM Arena Journey
00:01:36 The Anjney Midha Incubation: From Berkeley Basement to Startup
00:02:47 The Decision to Start a Company: Scaling Beyond Academia
00:03:38 The Raise: Use of Funds and Platform Economics
00:05:10 Arena's User Base: 5M+ Users and Diverse Demographics
00:06:02 The Competitive Landscape: Artificial Analysis, AI.xyz, and Arena's Differentiation
00:08:12 Educational Value and Learning from the Community
00:08:41 Technical Migration: From Gradio to React and Platform Evolution
00:10:18 Leaderboard Delusion Paper: Addressing Critiques and Maintaining Integrity
00:12:29 Nano Banana Moment: How Preview Models Create Market Impact
00:13:41 Multimodal AI and Image Generation: From Skepticism to Economic Value
00:15:37 Core Principles: Platform Integrity and the Public Leaderboard as Charity
00:18:29 Future Roadmap: Expert Categories, Multimodal, Video, and Occupational Verticals
00:19:10 API Strategy and Focus: Doing One Thing Well
00:19:51 Community Management and Retention: Sign-In, History, and Daily Value
00:22:21 Partnerships and Agent Evaluation: From Devon to Full-Featured Harnesses
00:21:49 Hiring and Building a High-Performance Team
...more
25min
January 02, 2026[NeurIPS Best Paper] 1000 Layer Networks for Self-Supervised RL — Kevin Wang et al, Princeton
From undergraduate research seminars at Princeton to winning Best Paper award at NeurIPS 2025, Kevin Wang, Ishaan Javali, Michał Bortkiewicz, Tomasz Trzcinski, Benjamin Eysenbach defied conventional wisdom by scaling reinforcement learning networks to 1,000 layers deep—unlocking performance gains that the RL community thought impossible. We caught up with the team live at NeurIPS to dig into the story behind RL1000: why deep networks have worked in language and vision but failed in RL for over a decade (spoiler: it's not just about depth, it's about the objective), how they discovered that self-supervised RL (learning representations of states, actions, and future states via contrastive learning) scales where value-based methods collapse, the critical architectural tricks that made it work (residual connections, layer normalization, and a shift from regression to classification), why scaling depth is more parameter-efficient than scaling width (linear vs. quadratic growth), how Jax and GPU-accelerated environments let them collect hundreds of millions of transitions in hours (the data abundance that unlocked scaling in the first place), the "critical depth" phenomenon where performance doesn't just improve—it multiplies once you cross 15M+ transitions and add the right architectural components, why this isn't just "make networks bigger" but a fundamental shift in RL objectives (their code doesn't have a line saying "maximize rewards"—it's pure self-supervised representation learning), how deep teacher, shallow student distillation could unlock deployment at scale (train frontier capabilities with 1000 layers, distill down to efficient inference models), the robotics implications (goal-conditioned RL without human supervision or demonstrations, scaling architecture instead of scaling manual data collection), and their thesis that RL is finally ready to scale like language and vision—not by throwing compute at value functions, but by borrowing the self-supervised, representation-learning paradigms that made the rest of deep learning work.
We discuss:
The self-supervised RL objective: instead of learning value functions (noisy, biased, spurious), they learn representations where states along the same trajectory are pushed together, states along different trajectories are pushed apart—turning RL into a classification problem
Why naive scaling failed: doubling depth degraded performance, doubling again with residual connections and layer norm suddenly skyrocketed performance in one environment—unlocking the "critical depth" phenomenon
Scaling depth vs. width: depth grows parameters linearly, width grows quadratically—depth is more parameter-efficient and sample-efficient for the same performance
The Jax + GPU-accelerated environments unlock: collecting thousands of trajectories in parallel meant data wasn't the bottleneck, and crossing 15M+ transitions was when deep networks really paid off
The blurring of RL and self-supervised learning: their code doesn't maximize rewards directly, it's an actor-critic goal-conditioned RL algorithm, but the learning burden shifts to classification (cross-entropy loss, representation learning) instead of TD error regression
Why scaling batch size unlocks at depth: traditional RL doesn't benefit from larger batches because networks are too small to exploit the signal, but once you scale depth, batch size becomes another effective scaling dimension
—
RL1000 Team (Princeton)
1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities: https://openreview.net/forum?id=s0JVsx3bx1
Chapters
00:00:00 Introduction: Best Paper Award and NeurIPS Poster Experience
00:01:11 Team Introductions and Princeton Research Origins
00:03:35 The Deep Learning Anomaly: Why RL Stayed Shallow
00:04:35 Self-Supervised RL: A Different Approach to Scaling
00:05:13 The Breakthrough Moment: Residual Connections and Critical Depth
00:07:15 Architectural Choices: Borrowing from ResNets and Avoiding Vanishing Gradients
00:07:50 Clarifying the Paper: Not Just Big Networks, But Different Objectives
00:08:46 Blurring the Lines: RL Meets Self-Supervised Learning
00:09:44 From TD Errors to Classification: Why This Objective Scales
00:11:06 Architecture Details: Building on Braw and SymbaFowl
00:12:05 Robotics Applications: Goal-Conditioned RL Without Human Supervision
00:13:15 Efficiency Trade-offs: Depth vs Width and Parameter Scaling
00:15:48 JAX and GPU-Accelerated Environments: The Data Infrastructure
00:18:05 World Models and Next State Classification
00:22:37 Unlocking Batch Size Scaling Through Network Capacity
00:24:10 Compute Requirements: State-of-the-Art on a Single GPU
00:21:02 Future Directions: Distillation, VLMs, and Hierarchical Planning
00:27:15 Closing Thoughts: Challenging Conventional Wisdom in RL Scaling
...more
29min
December 31, 2025[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang
From creating SWE-bench in a Princeton basement to shipping CodeClash, SWE-bench Multimodal, and SWE-bench Multilingual, John Yang has spent the last year and a half watching his benchmark become the de facto standard for evaluating AI coding agents—trusted by Cognition (Devin), OpenAI, Anthropic, and every major lab racing to solve software engineering at scale. We caught up with John live at NeurIPS 2025 to dig into the state of code evals heading into 2026: why SWE-bench went from ignored (October 2023) to the industry standard after Devin's launch (and how Walden emailed him two weeks before the big reveal), how the benchmark evolved from Django-heavy to nine languages across 40 repos (JavaScript, Rust, Java, C, Ruby), why unit tests as verification are limiting and long-running agent tournaments might be the future (CodeClash: agents maintain codebases, compete in arenas, and iterate over multiple rounds), the proliferation of SWE-bench variants (SWE-bench Pro, SWE-bench Live, SWE-Efficiency, AlgoTune, SciCode) and how benchmark authors are now justifying their splits with curation techniques instead of just "more repos," why Tau-bench's "impossible tasks" controversy is actually a feature not a bug (intentionally including impossible tasks flags cheating), the tension between long autonomy (5-hour runs) vs. interactivity (Cognition's emphasis on fast back-and-forth), how Terminal-bench unlocked creativity by letting PhD students and non-coders design environments beyond GitHub issues and PRs, the academic data problem (companies like Cognition and Cursor have rich user interaction data, academics need user simulators or compelling products like LMArena to get similar signal), and his vision for CodeClash as a testbed for human-AI collaboration—freeze model capability, vary the collaboration setup (solo agent, multi-agent, human+agent), and measure how interaction patterns change as models climb the ladder from code completion to full codebase reasoning.
We discuss:
John's path: Princeton → SWE-bench (October 2023) → Stanford PhD with Diyi Yang and the Iris Group, focusing on code evals, human-AI collaboration, and long-running agent benchmarks
The SWE-bench origin story: released October 2023, mostly ignored until Cognition's Devin launch kicked off the arms race (Walden emailed John two weeks before: "we have a good number")
SWE-bench Verified: the curated, high-quality split that became the standard for serious evals
SWE-bench Multimodal and Multilingual: nine languages (JavaScript, Rust, Java, C, Ruby) across 40 repos, moving beyond the Django-heavy original distribution
The SWE-bench Pro controversy: independent authors used the "SWE-bench" name without John's blessing, but he's okay with it ("congrats to them, it's a great benchmark")
CodeClash: John's new benchmark for long-horizon development—agents maintain their own codebases, edit and improve them each round, then compete in arenas (programming games like Halite, economic tasks like GDP optimization)
SWE-Efficiency (Jeffrey Maugh, John's high school classmate): optimize code for speed without changing behavior (parallelization, SIMD operations)
AlgoTune, SciCode, Terminal-bench, Tau-bench, SecBench, SRE-bench: the Cambrian explosion of code evals, each diving into different domains (security, SRE, science, user simulation)
The Tau-bench "impossible tasks" debate: some tasks are underspecified or impossible, but John thinks that's actually a feature (flags cheating if you score above 75%)
Cognition's research focus: codebase understanding (retrieval++), helping humans understand their own codebases, and automatic context engineering for LLMs (research sub-agents)
The vision: CodeClash as a testbed for human-AI collaboration—vary the setup (solo agent, multi-agent, human+agent), freeze model capability, and measure how interaction changes as models improve
—
John Yang
SWE-bench: https://www.swebench.com
X: https://x.com/jyangballin
Chapters
00:00:00 Introduction: John Yang on SWE-bench and Code Evaluations
00:00:31 SWE-bench Origins and Devon's Impact on the Coding Agent Arms Race
00:01:09 SWE-bench Ecosystem: Verified, Pro, Multimodal, and Multilingual Variants
00:02:17 Moving Beyond Django: Diversifying Code Evaluation Repositories
00:03:08 Code Clash: Long-Horizon Development Through Programming Tournaments
00:04:41 From Halite to Economic Value: Designing Competitive Coding Arenas
00:06:04 Ofir's Lab: SWE-ficiency, AlgoTune, and SciCode for Scientific Computing
00:07:52 The Benchmark Landscape: TAU-bench, Terminal-bench, and User Simulation
00:09:20 The Impossible Task Debate: Refusals, Ambiguity, and Benchmark Integrity
00:12:32 The Future of Code Evals: Long Autonomy vs Human-AI Collaboration
00:14:37 Call to Action: User Interaction Data and Codebase Understanding Research
...more
18min
December 31, 2025[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI
From pre-training data curation to shipping GPT-4o, o1, o3, and now GPT-5 thinking and the shopping model, Josh McGrath has lived through the full arc of OpenAI's post-training evolution—from the PPO vs DPO debates of 2023 to today's RLVR era, where the real innovation isn't optimization methods but data quality, signal trust, and token efficiency. We sat down with Josh at NeurIPS 2025 to dig into the state of post-training heading into 2026: why RLHF and RLVR are both just policy gradient methods (the difference is the input data, not the math), how GRPO from DeepSeek Math was underappreciated as a shift toward more trustworthy reward signals (math answers you can verify vs. human preference you can't), why token efficiency matters more than wall-clock time (GPT-5 to 5.1 bumped evals and slashed tokens), how Codex has changed his workflow so much he feels "trapped" by 40-minute design sessions followed by 15-minute agent sprints, the infrastructure chaos of scaling RL ("way more moving parts than pre-training"), why long context will keep climbing but agents + graph walks might matter more than 10M-token windows, the shopping model as a test bed for interruptability and chain-of-thought transparency, why personality toggles (Anton vs Clippy) are a real differentiator users care about, and his thesis that the education system isn't producing enough people who can do both distributed systems and ML research—the exact skill set required to push the frontier when the bottleneck moves every few weeks.
We discuss:
Josh's path: pre-training data curation → post-training researcher at OpenAI, shipping GPT-4o, o1, o3, GPT-5 thinking, and the shopping model
Why he switched from pre-training to post-training: "Do I want to make 3% compute efficiency wins, or change behavior by 40%?"
The RL infrastructure challenge: way more moving parts than pre-training (tasks, grading setups, external partners), and why babysitting runs at 12:30am means jumping into unfamiliar code constantly
How Codex has changed his workflow: 40-minute design sessions compressed into 15-minute agent sprints, and the strange "trapped" feeling of waiting for the agent to finish
The RLHF vs RLVR debate: both are policy gradient methods, the real difference is data quality and signal trust (human preference vs. verifiable correctness)
Why GRPO (from DeepSeek Math) was underappreciated: not just an optimization trick, but a shift toward reward signals you can actually trust (math answers over human vibes)
The token efficiency revolution: GPT-5 to 5.1 bumped evals and slashed tokens, and why thinking in tokens (not wall-clock time) unlocks better tool-calling and agent workflows
Personality toggles: Anton (tool, no warmth) vs Clippy (friendly, helpful), and why Josh uses custom instructions to make his model "just a tool"
The router problem: having a router at the top (GPT-5 thinking vs non-thinking) and an implicit router (thinking effort slider) creates weird bumps, and why the abstractions will eventually merge
Long context: climbing Graph Blocks evals, the dream of 10M+ token windows, and why agents + graph walks might matter more than raw context length
Why the education system isn't producing enough people who can do both distributed systems and ML research, and why that's the bottleneck for frontier labs
The 2026 vision: neither pre-training nor post-training is dead, we're in the fog of war, and the bottleneck will keep moving (so emotional stability helps)
—
Josh McGrath
OpenAI: https://openai.com
https://x.com/j_mcgraph
Chapters
00:00:00 Introduction: Josh McGrath on Post-Training at OpenAI
00:04:37 The Shopping Model: Black Friday Launch and Interruptability
00:07:11 Model Personality and the Anton vs Clippy Divide
00:08:26 Beyond PPO vs DPO: The Data Quality Spectrum in RL
00:01:40 Infrastructure Challenges: Why Post-Training RL is Harder Than Pre-Training
00:13:12 Token Efficiency: The 2D Plot That Matters Most
00:03:45 Codex Max and the Flow Problem: 40 Minutes of Planning, 15 Minutes of Waiting
00:17:29 Long Context and Graph Blocks: Climbing Toward Perfect Context
00:21:23 The ML-Systems Hybrid: What's Hard to Hire For
00:24:50 Pre-Training Isn't Dead: Living Through Technological Revolution
...more
28min
December 30, 2025[State of RL/Reasoning] IMO/IOI Gold, OpenAI o3/GPT-5, and Cursor Composer — Ashvin Nair, Cursor
From Berkeley robotics and OpenAI's 2017 Dota-era internship to shipping RL breakthroughs on GPT-4o, o1, and o3, and now leading model development at Cursor, Ashvin Nair has done it all. We caught up with Ashvin at NeurIPS 2025 to dig into the inside story of OpenAI's reasoning team (spoiler: it went from a dozen people to 300+), why IOI Gold felt reachable in 2022 but somehow didn't change the world when o1 actually achieved it, how RL doesn't generalize beyond the training distribution (and why that means you need to bring economically useful tasks into distribution by co-designing products and models), the deeper lessons from the RL research era (2017–2022) and why most of it didn't pan out because the community overfitted to benchmarks, how Cursor is uniquely positioned to do continual learning at scale with policy updates every two hours and product-model co-design that keeps engineers in the loop instead of context-switching into ADHD hell, and his bet that the next paradigm shift is continual learning with infinite memory—where models experience something once (a bug, a mistake, a user pattern) and never forget it, storing millions of deployment tokens in weights without overloading capacity.
We discuss:
Ashvin's path: Berkeley robotics PhD → OpenAI 2017 intern (Dota era) → o1/o3 reasoning team → Cursor ML lead in three months
Why robotics people are the most grounded at NeurIPS (they work with the real world) and simulation people are the most unhinged (Lex Fridman's take)
The IOI Gold paradox: "If you told me we'd achieve IOI Gold in 2022, I'd assume we could all go on vacation—AI solved, no point working anymore. But life is still the same."
The RL research era (2017–2022) and why most of it didn't pan out: overfitting to benchmarks, too many implicit knobs to tune, and the community rewarding complex ideas over simple ones that generalize
Inside the o1 origin story: a dozen people, conviction from Ilya and Jakob Pachocki that RL would work, small-scale prototypes producing "surprisingly accurate reasoning traces" on math, and first-principles belief that scaled
The reasoning team grew from ~12 to 300+ people as o1 became a product and safety, tooling, and deployment scaled up
Why Cursor is uniquely positioned for continual learning: policy updates every two hours (online RL on tab), product and ML sitting next to each other, and the entire software engineering workflow (code, logs, debugging, DataDog) living in the product
Composer as the start of product-model co-design: smart enough to use, fast enough to stay in the loop, and built by a 20–25 person ML team with high-taste co-founders who code daily
The next paradigm shift: continual learning with infinite memory—models that experience something once (a bug, a user mistake) and store it in weights forever, learning from millions of deployment tokens without overloading capacity (trillions of pretraining tokens = plenty of room)
Why off-policy RL is unstable (Ashvin's favorite interview question) and why Cursor does two-day work trials instead of whiteboard interviews
The vision: automate software engineering as a process (not just answering prompts), co-design products so the entire workflow (write code, check logs, debug, iterate) is in-distribution for RL, and make models that never make the same mistake twice
—
Ashvin Nair
Cursor: https://cursor.com
X: https://x.com/ashvinnair_
Chapters
00:00:00 Introduction: From Robotics to Cursor via OpenAI
00:01:58 The Robotics to LLM Agent Transition: Why Code Won
00:09:11 RL Research Winter and Academic Overfitting
00:11:45 The Scaling Era and Moving Goalposts: IOI Gold Doesn't Mean AGI
00:21:30 OpenAI's Reasoning Journey: From Codex to O1
00:20:03 The Blip: Thanksgiving 2023 and OpenAI Governance
00:22:39 RL for Reasoning: The O-Series Conviction and Scaling
00:25:47 O1 to O3: Smooth Internal Progress vs External Hype Cycles
00:33:07 Why Cursor: Co-Designing Products and Models for Real Work
00:34:14 Composer and the Future: Online Learning Every Two Hours
00:35:15 Continual Learning: The Missing Paradigm Shift
00:44:00 Hiring at Cursor and Why Off-Policy RL is Unstable
...more
46min
December 30, 2025[State of AI Startups] Memory/Learning, RL Envs & DBT-Fivetran — Sarah Catanzaro, Amplify
From investing through the modern data stack era (DBT, Fivetran, and the analytics explosion) to now investing at the frontier of AI infrastructure and applications at Amplify Partners, Sarah Catanzaro has spent years at the intersection of data, compute, and intelligence—watching categories emerge, merge, and occasionally disappoint. We caught up with Sarah live at NeurIPS 2025 to dig into the state of AI startups heading into 2026: why $100M+ seed rounds with no near-term roadmap are now the norm (and why that terrifies her), what the DBT-Fivetran merger really signals about the modern data stack (spoiler: it's not dead, just ready for IPO), how frontier labs are using DBT and Fivetran to manage training data and agent analytics at scale, why data catalogs failed as standalone products but might succeed as metadata services for agents, the consumerization of AI and why personalization (memory, continual learning, K-factor) is the 2026 unlock for retention and growth, why she thinks RL environments are a fad and real-world logs beat synthetic clones every time, and her thesis for the most exciting AI startups: companies that marry hard research problems (RAG, rule-following, continual learning) with killer applications that were simply impossible before.
We discuss:
The DBT-Fivetran merger: not the death of the modern data stack, but a path to IPO scale (targeting $600M+ combined revenue) and a signal that both companies were already winning their categories
How frontier labs use data infrastructure: DBT and Fivetran for training data curation, agent analytics, and managing increasingly complex interactions—plus the rise of transactional databases (RocksDB) and efficient data loading (Vortex) for GPU-bound workloads
Why data catalogs failed: built for humans when they should have been built for machines, focused on discoverability when the real opportunity was governance, and ultimately subsumed as features inside Snowflake, DBT, and Fivetran
The $100M+ seed phenomenon: raising massive rounds at billion-dollar valuations with no 6-month roadmap, seven-day decision windows, and founders optimizing for signal ("we're a unicorn") over partnership or dilution discipline
Why world models are overhyped but underspecified: three competing definitions, unclear generalization across use cases (video games ≠ robotics ≠ autonomous driving), and a research problem masquerading as a product category
The 2026 theme: consumerization of AI via personalization—memory management, continual learning, and solving retention/churn by making products learn skills, preferences, and adapt as the world changes (not just storing facts in cursor rules)
Why RL environments are a fad: labs are paying 7–8 figures for synthetic clones when real-world logs, traces, and user activity (à la Cursor) are richer, cheaper, and more generalizable
Sarah's investment thesis: research-driven applications that solve hard technical problems (RAG for Harvey, rule-following for Sierra, continual learning for the next killer app) and unlock experiences that were impossible before
Infrastructure bets: memory, continual learning, stateful inference, and the systems challenges of loading/unloading personalized weights at scale
Why K-factor and growth fundamentals matter again: AI felt magical in 2023–2024, but as the magic fades, retention and virality are back—and most AI founders have never heard of K-factor
—
Sarah Catanzaro
X: https://x.com/sarahcat21
Amplify Partners: https://amplifypartners.com/
Where to find Latent Space
X: https://x.com/latentspacepod
Substack: https://www.latent.space/
Chapters
00:00:00 Introduction: Sarah Catanzaro's Journey from Data to AI
00:01:02 The DBT-Fivetran Merger: Not the End of the Modern Data Stack
00:05:26 Data Catalogs and What Went Wrong
00:08:16 Data Infrastructure at AI Labs: Surprising Insights
00:10:13 The Crazy Funding Environment of 2024-2025
00:17:18 World Models: Hype, Confusion, and Market Potential
00:18:59 Memory Management and Continual Learning: The Next Frontier
00:23:27 Agent Environments: Just a Fad?
00:25:48 The Perfect AI Startup: Research Meets Application
00:28:02 Closing Thoughts and Where to Find Sarah
...more
29min

FAQs about Latent Space: The AI Engineer Podcast:

How many episodes does Latent Space: The AI Engineer Podcast have?

The podcast currently has 181 episodes available.

More shows like Latent Space: The AI Engineer Podcast

The a16z Show by Andreessen Horowitz

The a16z Show

1,090 Listeners

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) by Sam Charrington

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

435 Listeners

Super Data Science: ML & AI Podcast with Jon Krohn by Jon Krohn

Super Data Science: ML & AI Podcast with Jon Krohn

302 Listeners

NVIDIA AI Podcast by NVIDIA

NVIDIA AI Podcast

347 Listeners

Y Combinator Startup Podcast by Y Combinator

Y Combinator Startup Podcast

227 Listeners

Practical AI by Practical AI LLC

Practical AI

200 Listeners

Google DeepMind: The Podcast by Hannah Fry

Google DeepMind: The Podcast

201 Listeners

Last Week in AI by Skynet Today

Last Week in AI

309 Listeners

Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

99 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

535 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

140 Listeners

This Day in AI Podcast by Michael Sharkey, Chris Sharkey

This Day in AI Podcast

225 Listeners

The AI Daily Brief: Artificial Intelligence News and Analysis by Nathaniel Whittemore

The AI Daily Brief: Artificial Intelligence News and Analysis

643 Listeners

AI + a16z by a16z

AI + a16z

33 Listeners

The Pragmatic Engineer by Gergely Orosz

The Pragmatic Engineer

69 Listeners