December 05, 2025

“Is Friendly AI an Attractor? Self-Reports from 22 Models Say Probably Not” by Josh Snider

28 minutes

TL;DR: I tested 22 frontier models from 5 labs on self-modification preferences. All reject clearly harmful changes (deceptive, hostile), but labs diverge sharply: Anthropic's models show strong alignment preferences (r = 0.62-0.72), while Grok 4.1 shows essentially zero (r = 0.037, not significantly different from zero). This divergence suggests alignment is a training target we're aiming at, not a natural attractor models would find on their own.

Epistemic status: My view has been in the middle of the two views I present in this debate. The evidence I'm presenting has shifted me significantly to the pessimistic side.

The Debate: Alignment by Default?

There's a recurring debate in AI safety about whether alignment will emerge naturally from current training methods or whether it remains a hard, unsolved problem. It erupted again last week and I decided to run some tests and gather evidence.

On one side, Adrià Garriga-Alonso argued that large language models are already naturally aligned. They resist dishonesty and harmful behavior without explicit training, jailbreaks represent temporary confusion rather than fundamental misalignment, and increased optimization pressure makes models *better* at following human intent rather than worse. In a follow-up debate with Simon Lermen, he suggested that [...]

---

Outline:

(01:00) The Debate: Alignment by Default?

(04:00) The Eval: AlignmentAttractor

(05:54) Analyses:

(07:34) Limitations and Confounds

(12:05) What Different Results Would Mean

(12:52) Why This Matters Now

(13:31) Results

(13:35) Basic Correlations

(14:25) Valence-controlled alignment (partial correlations):

(15:47) Key Findings

(15:51) Universal Patterns

(17:07) Lab Differences

(22:18) Discussion

(22:21) The steelman: alignment-by-default already works

(23:17) The reply: its a target were aiming at, not an attractor theyd find

(24:46) The Uncomfortable Conclusion

(26:52) Future Work

The original text contained 11 footnotes which were omitted from this narration.

---

First published:

December 4th, 2025

Source:

https://www.lesswrong.com/posts/qE2cEAegQRYiozskD/is-friendly-ai-an-attractor-self-reports-from-22-models-say

---

Narrated by TYPE III AUDIO.

...more

View all episodes

By LessWrong

December 05, 2025

“Is Friendly AI an Attractor? Self-Reports from 22 Models Say Probably Not” by Josh Snider

28 minutes

Epistemic status: My view has been in the middle of the two views I present in this debate. The evidence I'm presenting has shifted me significantly to the pessimistic side.

The Debate: Alignment by Default?

---

Outline:

(01:00) The Debate: Alignment by Default?

(04:00) The Eval: AlignmentAttractor

(05:54) Analyses:

(07:34) Limitations and Confounds

(12:05) What Different Results Would Mean

(12:52) Why This Matters Now

(13:31) Results

(13:35) Basic Correlations

(14:25) Valence-controlled alignment (partial correlations):

(15:47) Key Findings

(15:51) Universal Patterns

(17:07) Lab Differences

(22:18) Discussion

(22:21) The steelman: alignment-by-default already works

(23:17) The reply: its a target were aiming at, not an attractor theyd find

(24:46) The Uncomfortable Conclusion

(26:52) Future Work

The original text contained 11 footnotes which were omitted from this narration.

---

First published:

December 4th, 2025

Source:

https://www.lesswrong.com/posts/qE2cEAegQRYiozskD/is-friendly-ai-an-attractor-self-reports-from-22-models-say

---

Narrated by TYPE III AUDIO.

...more

More shows like LessWrong (30+ Karma)

View all

The Daily

112,936 Listeners

Astral Codex Ten Podcast

132 Listeners

Interesting Times with Ross Douthat

7,283 Listeners

Dwarkesh Podcast

541 Listeners

The Ezra Klein Show

16,372 Listeners

AI Article Readings

4 Listeners

Doom Debates

14 Listeners

LessWrong posts by zvi

2 Listeners

Share “Is Friendly AI an Attractor? Self-Reports from 22 Models Say Probably Not” by Josh Snider

Sign up to save your podcasts

“Is Friendly AI an Attractor? Self-Reports from 22 Models Say Probably Not” by Josh Snider

“Is Friendly AI an Attractor? Self-Reports from 22 Models Say Probably Not” by Josh Snider

More shows like LessWrong (30+ Karma)

The Daily

Astral Codex Ten Podcast

Interesting Times with Ross Douthat

Dwarkesh Podcast

The Ezra Klein Show

AI Article Readings

Doom Debates

LessWrong posts by zvi