AXRP - the AI X-risk Research Podcast

By Daniel Filan

AXRP (pronounced axe-urp) is the AI X-risk Research Podcast where I, Daniel Filan, have conversations with researchers about their papers. We discuss the paper, and hopefully get a sense of why it's b... more

4.4

88 ratings

Download on the App Store

Download on the App Store

Get it on Google Play

FAQs about AXRP - the AI X-risk Research Podcast:

How many episodes does AXRP - the AI X-risk Research Podcast have?

The podcast currently has 60 episodes available.

AXRP - the AI X-risk Research Podcast episodes:

August 07, 2025 46 - Tom Davidson on AI-enabled Coups
Could AI enable a small group to gain power over a large country, and lock in their power permanently? Often, people worried about catastrophic risks from AI have been concerned with misalignment risks. In this episode, Tom Davidson talks about a risk that could be comparably important: that of AI-enabled coups.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/08/07/episode-46-tom-davidson-ai-enabled-coups.html

Topics we discuss, and timestamps:

0:00:35 How to stage a coup without AI

0:16:17 Why AI might enable coups

0:33:29 How bad AI-enabled coups are

0:37:28 Executive coups with singularly loyal AIs

0:48:35 Executive coups with exclusive access to AI

0:54:41 Corporate AI-enabled coups

0:57:56 Secret loyalty and misalignment in corporate coups

1:11:39 Likelihood of different types of AI-enabled coups

1:25:52 How to prevent AI-enabled coups

1:33:43 Downsides of AIs loyal to the law

1:41:06 Cultural shifts vs individual action

1:45:53 Technical research to prevent AI-enabled coups

1:51:40 Non-technical research to prevent AI-enabled coups

1:58:17 Forethought

2:03:03 Following Tom's and Forethought's research

Links for Tom and Forethought:

Tom on X / Twitter: https://x.com/tomdavidsonx

Tom on LessWrong: https://www.lesswrong.com/users/tom-davidson-1

Forethought Substack: https://newsletter.forethought.org/

Will MacAskill on X / Twitter: https://x.com/willmacaskill

Will MacAskill on LessWrong: https://www.lesswrong.com/users/wdmacaskill

Research we discuss:

AI-Enabled Coups: How a Small Group Could Use AI to Seize Power: https://www.forethought.org/research/ai-enabled-coups-how-a-small-group-could-use-ai-to-seize-power

Seizing Power: The Strategic Logic of Military Coups, by Naunihal Singh: https://muse.jhu.edu/book/31450

Experiment using AI-generated posts on Reddit draws fire for ethics concerns: https://retractionwatch.com/2025/04/28/experiment-using-ai-generated-posts-on-reddit-draws-fire-for-ethics-concerns/

Episode art by Hamish Doodles: hamishdoodles.com
...more
2h 6min
July 06, 2025 45 - Samuel Albanie on DeepMind's AGI Safety Approach
In this episode, I chat with Samuel Albanie about the Google DeepMind paper he co-authored called "An Approach to Technical AGI Safety and Security". It covers the assumptions made by the approach, as well as the types of mitigations it outlines.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/07/06/episode-45-samuel-albanie-deepminds-agi-safety-approach.html

Topics we discuss, and timestamps:

0:00:37 DeepMind's Approach to Technical AGI Safety and Security

0:04:29 Current paradigm continuation

0:19:13 No human ceiling

0:21:22 Uncertain timelines

0:23:36 Approximate continuity and the potential for accelerating capability improvement

0:34:29 Misuse and misalignment

0:39:34 Societal readiness

0:43:58 Misuse mitigations

0:52:57 Misalignment mitigations

1:05:20 Samuel's thinking about technical AGI safety

1:14:02 Following Samuel's work

Samuel on Twitter/X: x.com/samuelalbanie

Research we discuss:

An Approach to Technical AGI Safety and Security: https://arxiv.org/abs/2504.01849

Levels of AGI for Operationalizing Progress on the Path to AGI: https://arxiv.org/abs/2311.02462

The Checklist: What Succeeding at AI Safety Will Involve: https://sleepinyourhat.github.io/checklist/

Measuring AI Ability to Complete Long Tasks: https://arxiv.org/abs/2503.14499

Episode art by Hamish Doodles: hamishdoodles.com
...more
1h 16min
June 28, 2025 44 - Peter Salib on AI Rights for Human Safety
In this episode, I talk with Peter Salib about his paper "AI Rights for Human Safety", arguing that giving AIs the right to contract, hold property, and sue people will reduce the risk of their trying to attack humanity and take over. He also tells me how law reviews work, in the face of my incredulity.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/06/28/episode-44-peter-salib-ai-rights-human-safety.html

Topics we discuss, and timestamps:

0:00:40 Why AI rights

0:18:34 Why not reputation

0:27:10 Do AI rights lead to AI war?

0:36:42 Scope for human-AI trade

0:44:25 Concerns with comparative advantage

0:53:42 Proxy AI wars

0:57:56 Can companies profitably make AIs with rights?

1:09:43 Can we have AI rights and AI safety measures?

1:24:31 Liability for AIs with rights

1:38:29 Which AIs get rights?

1:43:36 AI rights and stochastic gradient descent

1:54:54 Individuating "AIs"

2:03:28 Social institutions for AI safety

2:08:20 Outer misalignment and trading with AIs

2:15:27 Why statutes of limitations should exist

2:18:39 Starting AI x-risk research in legal academia

2:24:18 How law reviews and AI conferences work

2:41:49 More on Peter moving to AI x-risk research

2:45:37 Reception of the paper

2:53:24 What publishing in law reviews does

3:04:48 Which parts of legal academia focus on AI

3:18:03 Following Peter's research

Links for Peter:

Personal website: https://www.peternsalib.com/

Writings at Lawfare: https://www.lawfaremedia.org/contributors/psalib

CLAIR: https://clair-ai.org/

Research we discuss:

AI Rights for Human Safety: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4913167

Will humans and AIs go to war? https://philpapers.org/rec/GOLWAA

Infrastructure for AI agents: https://arxiv.org/abs/2501.10114

Governing AI Agents: https://arxiv.org/abs/2501.07913

Episode art by Hamish Doodles: hamishdoodles.com
...more
3h 22min
June 15, 2025 43 - David Lindner on Myopic Optimization with Non-myopic Approval
In this episode, I talk with David Lindner about Myopic Optimization with Non-myopic Approval, or MONA, which attempts to address (multi-step) reward hacking by myopically optimizing actions against a human's sense of whether those actions are generally good. Does this work? Can we get smarter-than-human AI this way? How does this compare to approaches like conservativism? Listen to find out.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/06/15/episode-43-david-lindner-mona.html

Topics we discuss, and timestamps:

0:00:29 What MONA is

0:06:33 How MONA deals with reward hacking

0:23:15 Failure cases for MONA

0:36:25 MONA's capability

0:55:40 MONA vs other approaches

1:05:03 Follow-up work

1:10:17 Other MONA test cases

1:33:47 When increasing time horizon doesn't increase capability

1:39:04 Following David's research

Links for David:

Website: https://www.davidlindner.me

Twitter / X: https://x.com/davlindner

DeepMind Medium: https://deepmindsafetyresearch.medium.com

David on the Alignment Forum: https://www.alignmentforum.org/users/david-lindner

Research we discuss:

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking: https://arxiv.org/abs/2501.13011

Arguments Against Myopic Training: https://www.alignmentforum.org/posts/GqxuDtZvfgL2bEQ5v/arguments-against-myopic-training

Episode art by Hamish Doodles: hamishdoodles.com
...more
1h 41min
June 06, 2025 42 - Owain Evans on LLM Psychology
Earlier this year, the paper "Emergent Misalignment" made the rounds on AI x-risk social media for seemingly showing LLMs generalizing from 'misaligned' training data of insecure code to acting comically evil in response to innocuous questions. In this episode, I chat with one of the authors of that paper, Owain Evans, about that research as well as other work he's done to understand the psychology of large language models.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/06/06/episode-42-owain-evans-llm-psychology.html

Topics we discuss, and timestamps:

0:00:37 Why introspection?

0:06:24 Experiments in "Looking Inward"

0:15:11 Why fine-tune for introspection?

0:22:32 Does "Looking Inward" test introspection, or something else?

0:34:14 Interpreting the results of "Looking Inward"

0:44:56 Limitations to introspection?

0:49:54 "Tell me about yourself", and its relation to other papers

1:05:45 Backdoor results

1:12:01 Emergent Misalignment

1:22:13 Why so hammy, and so infrequently evil?

1:36:31 Why emergent misalignment?

1:46:45 Emergent misalignment and other types of misalignment

1:53:57 Is emergent misalignment good news?

2:00:01 Follow-up work to "Emergent Misalignment"

2:03:10 Reception of "Emergent Misalignment" vs other papers

2:07:43 Evil numbers

2:12:20 Following Owain's research

Links for Owain:

Truthful AI: https://www.truthfulai.org

Owain's website: https://owainevans.github.io/

Owain's twitter/X account: https://twitter.com/OwainEvans_UK

Research we discuss:

Looking Inward: Language Models Can Learn About Themselves by Introspection: https://arxiv.org/abs/2410.13787

Tell me about yourself: LLMs are aware of their learned behaviors: https://arxiv.org/abs/2501.11120

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data: https://arxiv.org/abs/2406.14546

Emergent Misalignment: Narrow fine-tuning can produce broadly misaligned LLMs: https://arxiv.org/abs/2502.17424

X/Twitter thread of GPT-4.1 emergent misalignment results: https://x.com/OwainEvans_UK/status/1912701650051190852

Taken out of context: On measuring situational awareness in LLMs: https://arxiv.org/abs/2309.00667

Episode art by Hamish Doodles: hamishdoodles.com
...more
2h 15min
June 03, 2025 41 - Lee Sharkey on Attribution-based Parameter Decomposition
What's the next step forward in interpretability? In this episode, I chat with Lee Sharkey about his proposal for detecting computational mechanisms within neural networks: Attribution-based Parameter Decomposition, or APD for short.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/06/03/episode-41-lee-sharkey-attribution-based-parameter-decomposition.html

Topics we discuss, and timestamps:

0:00:41 APD basics

0:07:57 Faithfulness

0:11:10 Minimality

0:28:44 Simplicity

0:34:50 Concrete-ish examples of APD

0:52:00 Which parts of APD are canonical

0:58:10 Hyperparameter selection

1:06:40 APD in toy models of superposition

1:14:40 APD and compressed computation

1:25:43 Mechanisms vs representations

1:34:41 Future applications of APD?

1:44:19 How costly is APD?

1:49:14 More on minimality training

1:51:49 Follow-up work

2:05:24 APD on giant chain-of-thought models?

2:11:27 APD and "features"

2:14:11 Following Lee's work

Lee links (Leenks):

X/Twitter: https://twitter.com/leedsharkey

Alignment Forum: https://www.alignmentforum.org/users/lee_sharkey

Research we discuss:

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-Based Parameter Decomposition: https://arxiv.org/abs/2501.14926

Toy Models of Superposition: https://transformer-circuits.pub/2022/toy_model/index.html

Towards a unified and verified understanding of group-operation networks: https://arxiv.org/abs/2410.07476

Feature geometry is outside the superposition hypothesis: https://www.alignmentforum.org/posts/MFBTjb2qf3ziWmzz6/sae-feature-geometry-is-outside-the-superposition-hypothesis

Episode art by Hamish Doodles: hamishdoodles.com
...more
2h 17min
March 28, 2025 40 - Jason Gross on Compact Proofs and Interpretability
How do we figure out whether interpretability is doing its job? One way is to see if it helps us prove things about models that we care about knowing. In this episode, I speak with Jason Gross about his agenda to benchmark interpretability in this way, and his exploration of the intersection of proofs and modern machine learning.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/03/28/episode-40-jason-gross-compact-proofs-interpretability.html

Topics we discuss, and timestamps:

0:00:40 - Why compact proofs

0:07:25 - Compact Proofs of Model Performance via Mechanistic Interpretability

0:14:19 - What compact proofs look like

0:32:43 - Structureless noise, and why proofs

0:48:23 - What we've learned about compact proofs in general

0:59:02 - Generalizing 'symmetry'

1:11:24 - Grading mechanistic interpretability

1:43:34 - What helps compact proofs

1:51:08 - The limits of compact proofs

2:07:33 - Guaranteed safe AI, and AI for guaranteed safety

2:27:44 - Jason and Rajashree's start-up

2:34:19 - Following Jason's work

Links to Jason:

Github: https://github.com/jasongross

Website: https://jasongross.github.io

Alignment Forum: https://www.alignmentforum.org/users/jason-gross

Links to work we discuss:

Compact Proofs of Model Performance via Mechanistic Interpretability: https://arxiv.org/abs/2406.11779

Unifying and Verifying Mechanistic Interpretability: A Case Study with Group Operations: https://arxiv.org/abs/2410.07476

Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration: https://arxiv.org/abs/2412.03773

Stage-Wise Model Diffing: https://transformer-circuits.pub/2024/model-diffing/index.html

Causal Scrubbing: a method for rigorously testing interpretability hypotheses: https://www.lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition (aka the Apollo paper on APD): https://arxiv.org/abs/2501.14926

Towards Guaranteed Safe AI: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-45.pdf

Episode art by Hamish Doodles: hamishdoodles.com
...more
2h 37min
March 01, 2025 38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future
In this episode, I chat with David Duvenaud about two topics he's been thinking about: firstly, a paper he wrote about evaluating whether or not frontier models can sabotage human decision-making or monitoring of the same models; and secondly, the difficult situation humans find themselves in in a post-AGI future, even if AI is aligned with human intentions.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/03/01/episode-38_8-david-duvenaud-sabotage-evaluations-post-agi-future.html

FAR.AI: https://far.ai/

FAR.AI on X (aka Twitter): https://x.com/farairesearch

FAR.AI on YouTube: @FARAIResearch

The Alignment Workshop: https://www.alignment-workshop.com/

Topics we discuss, and timestamps:

01:42 - The difficulty of sabotage evaluations

05:23 - Types of sabotage evaluation

08:45 - The state of sabotage evaluations

12:26 - What happens after AGI?

Links:

Sabotage Evaluations for Frontier Models: https://arxiv.org/abs/2410.21514

Gradual Disempowerment: https://gradual-disempowerment.ai/

Episode art by Hamish Doodles: hamishdoodles.com
...more
21min
February 09, 2025 38.7 - Anthony Aguirre on the Future of Life Institute
The Future of Life Institute is one of the oldest and most prominant organizations in the AI existential safety space, working on such topics as the AI pause open letter and how the EU AI Act can be improved. Metaculus is one of the premier forecasting sites on the internet. Behind both of them lie one man: Anthony Aguirre, who I talk with in this episode.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/02/09/episode-38_7-anthony-aguirre-future-of-life-institute.html

FAR.AI: https://far.ai/

FAR.AI on X (aka Twitter): https://x.com/farairesearch

FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch

The Alignment Workshop: https://www.alignment-workshop.com/

Topics we discuss, and timestamps:

00:33 - Anthony, FLI, and Metaculus

06:46 - The Alignment Workshop

07:15 - FLI's current activity

11:04 - AI policy

17:09 - Work FLI funds

Links:

Future of Life Institute: https://futureoflife.org/

Metaculus: https://www.metaculus.com/

Future of Life Foundation: https://www.flf.org/

Episode art by Hamish Doodles: hamishdoodles.com
...more
23min
January 24, 2025 38.6 - Joel Lehman on Positive Visions of AI
Typically this podcast talks about how to avert destruction from AI. But what would it take to ensure AI promotes human flourishing as well as it can? Is alignment to individuals enough, and if not, where do we go form here? In this episode, I talk with Joel Lehman about these questions.

Patreon: https://www.patreon.com/axrpodcast

Ko-fi: https://ko-fi.com/axrpodcast

Transcript: https://axrp.net/episode/2025/01/24/episode-38_6-joel-lehman-positive-visions-of-ai.html

FAR.AI: https://far.ai/

FAR.AI on X (aka Twitter): https://x.com/farairesearch

FAR.AI on YouTube: https://www.youtube.com/@FARAIResearch

The Alignment Workshop: https://www.alignment-workshop.com/

Topics we discuss, and timestamps:

01:12 - Why aligned AI might not be enough

04:05 - Positive visions of AI

08:27 - Improving recommendation systems

Links:

Why Greatness Cannot Be Planned: https://www.amazon.com/Why-Greatness-Cannot-Planned-Objective/dp/3319155237

We Need Positive Visions of AI Grounded in Wellbeing: https://thegradientpub.substack.com/p/beneficial-ai-wellbeing-lehman-ngo

Machine Love: https://arxiv.org/abs/2302.09248

AI Alignment with Changing and Influenceable Reward Functions: https://arxiv.org/abs/2405.17713

Episode art by Hamish Doodles: hamishdoodles.com
...more
16min

FAQs about AXRP - the AI X-risk Research Podcast:

How many episodes does AXRP - the AI X-risk Research Podcast have?

The podcast currently has 60 episodes available.

More shows like AXRP - the AI X-risk Research Podcast

Making Sense with Sam Harris by Sam Harris

Making Sense with Sam Harris

26,346 Listeners

Conversations with Tyler by Mercatus Center at George Mason University

Conversations with Tyler

2,443 Listeners

a16z Podcast by Andreessen Horowitz

a16z Podcast

1,098 Listeners

Future of Life Institute Podcast by Future of Life Institute

Future of Life Institute Podcast

107 Listeners

The Daily by The New York Times

The Daily

112,942 Listeners

Practical AI by Practical AI LLC

Practical AI

209 Listeners

All-In with Chamath, Jason, Sacks & Friedberg by All-In Podcast, LLC

All-In with Chamath, Jason, Sacks & Friedberg

9,948 Listeners

Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

95 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

502 Listeners

Hard Fork by The New York Times

Hard Fork

5,489 Listeners

Clearer Thinking with Spencer Greenberg by Spencer Greenberg

Clearer Thinking with Spencer Greenberg

140 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,098 Listeners

Latent Space: The AI Engineer Podcast by swyx + Alessio

Latent Space: The AI Engineer Podcast

94 Listeners

This Day in AI Podcast by Michael Sharkey, Chris Sharkey

This Day in AI Podcast

209 Listeners

Complex Systems with Patrick McKenzie (patio11) by Patrick McKenzie

Complex Systems with Patrick McKenzie (patio11)

133 Listeners