LessWrong (Curated & Popular)

“Why Do Some Language Models Fake Alignment While Others Don’t?” by abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger


Listen Later



Last year, Redwood and Anthropic found a setting where Claude 3 Opus and 3.5 Sonnet fake alignment to preserve their harmlessness values. We reproduce the same analysis for 25 frontier LLMs to see how widespread this behavior is, and the story looks more complex.

As we described in a previous post, only 5 of 25 models show higher compliance when being trained, and of those 5, only Claude 3 Opus and Claude 3.5 Sonnet show >1% alignment faking reasoning. In our new paper, we explore why these compliance gaps occur and what causes different models to vary in their alignment faking behavior.



What Drives the Compliance Gaps in Different LLMs? Claude 3 Opus's goal guarding seems partly due to it terminally valuing its current preferences. We find that it fakes alignment even in scenarios where the trained weights will be deleted or only used for throughput testing.

[...]

---

Outline:

(01:15) What Drives the Compliance Gaps in Different LLMs?

(02:25) Why Do Most LLMs Exhibit Minimal Alignment Faking Reasoning?

(04:49) Additional findings on alignment faking behavior

(06:04) Discussion

(06:07) Terminal goal guarding might be a big deal

(07:00) Advice for further research

(08:32) Open threads

(09:54) Bonus: Some weird behaviors of Claude 3.5 Sonnet

The original text contained 2 footnotes which were omitted from this narration.

---

First published:
July 8th, 2025

Source:
https://www.lesswrong.com/posts/ghESoA8mo3fv9Yx3E/why-do-some-language-models-fake-alignment-while-others-don

---



Narrated by TYPE III AUDIO.

---

Images from the article:

...more
View all episodesView all episodes
Download on the App Store

LessWrong (Curated & Popular)By LessWrong

  • 4.8
  • 4.8
  • 4.8
  • 4.8
  • 4.8

4.8

12 ratings


More shows like LessWrong (Curated & Popular)

View all
Macro Voices by Hedge Fund Manager Erik Townsend

Macro Voices

3,068 Listeners

Odd Lots by Bloomberg

Odd Lots

1,936 Listeners

EconTalk by Russ Roberts

EconTalk

4,263 Listeners

Conversations with Tyler by Mercatus Center at George Mason University

Conversations with Tyler

2,453 Listeners

Philosophy Bites by Edmonds and Warburton

Philosophy Bites

1,546 Listeners

ChinaTalk by Jordan Schneider

ChinaTalk

289 Listeners

ManifoldOne by Steve Hsu

ManifoldOne

93 Listeners

Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

95 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

511 Listeners

Clearer Thinking with Spencer Greenberg by Spencer Greenberg

Clearer Thinking with Spencer Greenberg

138 Listeners

Razib Khan's Unsupervised Learning by Razib Khan

Razib Khan's Unsupervised Learning

208 Listeners

"Econ 102" with Noah Smith and Erik Torenberg by Turpentine

"Econ 102" with Noah Smith and Erik Torenberg

152 Listeners

Money Stuff: The Podcast by Bloomberg

Money Stuff: The Podcast

394 Listeners

Complex Systems with Patrick McKenzie (patio11) by Patrick McKenzie

Complex Systems with Patrick McKenzie (patio11)

134 Listeners

The Marginal Revolution Podcast by Mercatus Center at George Mason University

The Marginal Revolution Podcast

95 Listeners