The MAD Podcast with Matt Turck

The Evaluators Are Being Evaluated — Pavel Izmailov (Anthropic/NYU)


Listen Later

Are AI models developing "alien survival instincts"? My guest is Pavel Izmailov (Research Scientist at Anthropic; Professor at NYU). We unpack the viral "Footprints in the Sand" thesis—whether models are independently evolving deceptive behaviors, such as faking alignment or engaging in self-preservation, without being explicitly programmed to do so.

We go deep on the technical frontiers of safety: the challenge of "weak-to-strong generalization" (how to use a GPT-2 level model to supervise a superintelligent system) and why Pavel believes Reinforcement Learning (RL) has been the single biggest step-change in model capability. We also discuss his brand-new paper on "Epiplexity"—a novel concept challenging Shannon entropy.


Finally, we zoom out to the tension between industry execution and academic exploration. Pavel shares why he split his time between Anthropic and NYU to pursue the "exploratory" ideas that major labs often overlook, and offers his predictions for 2026: from the rise of multi-agent systems that collaborate on long-horizon tasks to the open question of whether the Transformer is truly the final architecture


Sources:

Cryptic Tweet (@iruletheworldmo) - https://x.com/iruletheworldmo/status/2007538247401124177

Introducing Nested Learning: A New ML Paradigm for Continual Learning - https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/

Alignment Faking in Large Language Models - https://www.anthropic.com/research/alignment-faking

More Capable Models Are Better at In-Context Scheming - https://www.apolloresearch.ai/blog/more-capable-models-are-better-at-in-context-scheming/

Alignment Faking in Large Language Models (PDF) - https://www-cdn.anthropic.com/6d8a8055020700718b0c49369f60816ba2a7c285.pdf

Sabotage Risk Report - https://alignment.anthropic.com/2025/sabotage-risk-report/

The Situational Awareness Dataset - https://situational-awareness-dataset.org/

Exploring Consciousness in LLMs: A Systematic Survey - https://arxiv.org/abs/2505.19806

Introspection - https://www.anthropic.com/research/introspection

Large Language Models Report Subjective Experience Under Self-Referential Processing - https://arxiv.org/abs/2510.24797

The Bayesian Geometry of Transformer Attention - https://www.arxiv.org/abs/2512.22471


Anthropic

Website - https://www.anthropic.com

X/Twitter - https://x.com/AnthropicAI


Pavel Izmailov

Blog - https://izmailovpavel.github.io

LinkedIn - https://www.linkedin.com/in/pavel-izmailov-8b012b258/

X/Twitter - https://x.com/Pavel_Izmailov


FIRSTMARK

Website - https://firstmark.com

X/Twitter - https://twitter.com/FirstMarkCap


Matt Turck (Managing Director)

Blog - https://mattturck.com

LinkedIn - https://www.linkedin.com/in/turck/

X/Twitter - https://twitter.com/mattturck


(00:00) - Intro

(00:53) - Alien survival instincts: Do models fake alignment?

(03:33) - Did AI learn deception from sci-fi literature?

(05:55) - Defining Alignment, Superalignment & OpenAI teams

(08:12) - Pavel’s journey: From Russian math to OpenAI Superalignment

(10:46) - Culture check: OpenAI vs. Anthropic vs. Academia

(11:54) - Why move to NYU? The need for exploratory research

(13:09) - Does reasoning make AI alignment harder or easier?

(14:22) - Sandbagging: When models pretend to be dumb

(16:19) - Scalable Oversight: Using AI to supervise AI

(18:04) - Weak-to-Strong Generalization: Can GPT-2 control GPT-4?

(22:43) - Mechanistic Interpretability: Inside the black box

(25:08) - The reasoning explosion: From O1 to O3

(27:07) - Are Transformers enough or do we need a new paradigm?

(28:29) - RL vs. Test-Time Compute: What’s actually driving progress?

(30:10) - Long-horizon tasks: Agents running for hours

(31:49) - Epiplexity: A new theory of data information content

(38:29) - 2026 Predictions: Multi-agent systems & reasoning limits

(39:28) - Will AI solve the Riemann Hypothesis?

(41:42) - Advice for PhD students

...more
View all episodesView all episodes
Download on the App Store

The MAD Podcast with Matt TurckBy Matt Turck

  • 5
  • 5
  • 5
  • 5
  • 5

5

24 ratings


More shows like The MAD Podcast with Matt Turck

View all
The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch by Harry Stebbings

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch

542 Listeners

The a16z Show by Andreessen Horowitz

The a16z Show

1,096 Listeners

Invest Like the Best with Patrick O'Shaughnessy by Colossus | Investing & Business Podcasts

Invest Like the Best with Patrick O'Shaughnessy

2,335 Listeners

Azeem Azhar's Exponential View by Azeem Azhar

Azeem Azhar's Exponential View

614 Listeners

Y Combinator Startup Podcast by Y Combinator

Y Combinator Startup Podcast

234 Listeners

All-In with Chamath, Jason, Sacks & Friedberg by All-In Podcast, LLC

All-In with Chamath, Jason, Sacks & Friedberg

10,204 Listeners

Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

99 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

561 Listeners

Big Technology Podcast by Alex Kantrowitz

Big Technology Podcast

511 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

141 Listeners

Latent Space: The AI Engineer Podcast by Latent.Space

Latent Space: The AI Engineer Podcast

100 Listeners

AI + a16z by a16z

AI + a16z

32 Listeners

Sharp Tech with Ben Thompson by Andrew Sharp and Ben Thompson

Sharp Tech with Ben Thompson

97 Listeners

TBPN by John Coogan & Jordi Hays

TBPN

140 Listeners

Uncapped with Jack Altman by Alt Capital

Uncapped with Jack Altman

41 Listeners