LessWrong (Curated & Popular)

“Why Do Some Language Models Fake Alignment While Others Don’t?” by abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger


Listen Later



Last year, Redwood and Anthropic found a setting where Claude 3 Opus and 3.5 Sonnet fake alignment to preserve their harmlessness values. We reproduce the same analysis for 25 frontier LLMs to see how widespread this behavior is, and the story looks more complex.

As we described in a previous post, only 5 of 25 models show higher compliance when being trained, and of those 5, only Claude 3 Opus and Claude 3.5 Sonnet show >1% alignment faking reasoning. In our new paper, we explore why these compliance gaps occur and what causes different models to vary in their alignment faking behavior.



What Drives the Compliance Gaps in Different LLMs? Claude 3 Opus's goal guarding seems partly due to it terminally valuing its current preferences. We find that it fakes alignment even in scenarios where the trained weights will be deleted or only used for throughput testing.

[...]

---

Outline:

(01:15) What Drives the Compliance Gaps in Different LLMs?

(02:25) Why Do Most LLMs Exhibit Minimal Alignment Faking Reasoning?

(04:49) Additional findings on alignment faking behavior

(06:04) Discussion

(06:07) Terminal goal guarding might be a big deal

(07:00) Advice for further research

(08:32) Open threads

(09:54) Bonus: Some weird behaviors of Claude 3.5 Sonnet

The original text contained 2 footnotes which were omitted from this narration.

---

First published:
July 8th, 2025

Source:
https://www.lesswrong.com/posts/ghESoA8mo3fv9Yx3E/why-do-some-language-models-fake-alignment-while-others-don

---



Narrated by TYPE III AUDIO.

---

Images from the article:

...more
View all episodesView all episodes
Download on the App Store

LessWrong (Curated & Popular)By LessWrong

  • 4.8
  • 4.8
  • 4.8
  • 4.8
  • 4.8

4.8

11 ratings


More shows like LessWrong (Curated & Popular)

View all
Conversations with Tyler by Mercatus Center at George Mason University

Conversations with Tyler

2,395 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

123 Listeners

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas by Sean Carroll | Wondery

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas

4,142 Listeners

ManifoldOne by Steve Hsu

ManifoldOne

89 Listeners

The Jim Rutt Show by The Jim Rutt Show

The Jim Rutt Show

258 Listeners

Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

88 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

417 Listeners

Hard Fork by The New York Times

Hard Fork

5,448 Listeners

Clearer Thinking with Spencer Greenberg by Spencer Greenberg

Clearer Thinking with Spencer Greenberg

128 Listeners

Razib Khan's Unsupervised Learning by Razib Khan

Razib Khan's Unsupervised Learning

198 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

121 Listeners

Latent Space: The AI Engineer Podcast by swyx + Alessio

Latent Space: The AI Engineer Podcast

75 Listeners

"Econ 102" with Noah Smith and Erik Torenberg by Turpentine

"Econ 102" with Noah Smith and Erik Torenberg

146 Listeners

Complex Systems with Patrick McKenzie (patio11) by Patrick McKenzie

Complex Systems with Patrick McKenzie (patio11)

123 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

1 Listeners