LessWrong (30+ Karma)

“Is Friendly AI an Attractor? Self-Reports from 22 Models Say Probably Not” by Josh Snider


Listen Later

TL;DR: I tested 22 frontier models from 5 labs on self-modification preferences. All reject clearly harmful changes (deceptive, hostile), but labs diverge sharply: Anthropic's models show strong alignment preferences (r = 0.62-0.72), while Grok 4.1 shows essentially zero (r = 0.037, not significantly different from zero). This divergence suggests alignment is a training target we're aiming at, not a natural attractor models would find on their own.


Epistemic status: My view has been in the middle of the two views I present in this debate. The evidence I'm presenting has shifted me significantly to the pessimistic side.

The Debate: Alignment by Default?

There's a recurring debate in AI safety about whether alignment will emerge naturally from current training methods or whether it remains a hard, unsolved problem. It erupted again last week and I decided to run some tests and gather evidence.

On one side, Adrià Garriga-Alonso argued that large language models are already naturally aligned. They resist dishonesty and harmful behavior without explicit training, jailbreaks represent temporary confusion rather than fundamental misalignment, and increased optimization pressure makes models *better* at following human intent rather than worse. In a follow-up debate with Simon Lermen, he suggested that [...]


---

Outline:

(01:00) The Debate: Alignment by Default?

(04:00) The Eval: AlignmentAttractor

(05:54) Analyses:

(07:34) Limitations and Confounds

(12:05) What Different Results Would Mean

(12:52) Why This Matters Now

(13:31) Results

(13:35) Basic Correlations

(14:25) Valence-controlled alignment (partial correlations):

(15:47) Key Findings

(15:51) Universal Patterns

(17:07) Lab Differences

(22:18) Discussion

(22:21) The steelman: alignment-by-default already works

(23:17) The reply: its a target were aiming at, not an attractor theyd find

(24:46) The Uncomfortable Conclusion

(26:52) Future Work

The original text contained 11 footnotes which were omitted from this narration.

---

First published:

December 4th, 2025

Source:

https://www.lesswrong.com/posts/qE2cEAegQRYiozskD/is-friendly-ai-an-attractor-self-reports-from-22-models-say

---

Narrated by TYPE III AUDIO.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,936 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

132 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,283 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

541 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,372 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates by Liron Shapira

Doom Debates

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners