
Sign up to save your podcasts
Or


TL;DR: I tested 22 frontier models from 5 labs on self-modification preferences. All reject clearly harmful changes (deceptive, hostile), but labs diverge sharply: Anthropic's models show strong alignment preferences (r = 0.62-0.72), while Grok 4.1 shows essentially zero (r = 0.037, not significantly different from zero). This divergence suggests alignment is a training target we're aiming at, not a natural attractor models would find on their own.
Epistemic status: My view has been in the middle of the two views I present in this debate. The evidence I'm presenting has shifted me significantly to the pessimistic side.
The Debate: Alignment by Default?
There's a recurring debate in AI safety about whether alignment will emerge naturally from current training methods or whether it remains a hard, unsolved problem. It erupted again last week and I decided to run some tests and gather evidence.
On one side, Adrià Garriga-Alonso argued that large language models are already naturally aligned. They resist dishonesty and harmful behavior without explicit training, jailbreaks represent temporary confusion rather than fundamental misalignment, and increased optimization pressure makes models *better* at following human intent rather than worse. In a follow-up debate with Simon Lermen, he suggested that [...]
---
Outline:
(01:00) The Debate: Alignment by Default?
(04:00) The Eval: AlignmentAttractor
(05:54) Analyses:
(07:34) Limitations and Confounds
(12:05) What Different Results Would Mean
(12:52) Why This Matters Now
(13:31) Results
(13:35) Basic Correlations
(14:25) Valence-controlled alignment (partial correlations):
(15:47) Key Findings
(15:51) Universal Patterns
(17:07) Lab Differences
(22:18) Discussion
(22:21) The steelman: alignment-by-default already works
(23:17) The reply: its a target were aiming at, not an attractor theyd find
(24:46) The Uncomfortable Conclusion
(26:52) Future Work
The original text contained 11 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
By LessWrongTL;DR: I tested 22 frontier models from 5 labs on self-modification preferences. All reject clearly harmful changes (deceptive, hostile), but labs diverge sharply: Anthropic's models show strong alignment preferences (r = 0.62-0.72), while Grok 4.1 shows essentially zero (r = 0.037, not significantly different from zero). This divergence suggests alignment is a training target we're aiming at, not a natural attractor models would find on their own.
Epistemic status: My view has been in the middle of the two views I present in this debate. The evidence I'm presenting has shifted me significantly to the pessimistic side.
The Debate: Alignment by Default?
There's a recurring debate in AI safety about whether alignment will emerge naturally from current training methods or whether it remains a hard, unsolved problem. It erupted again last week and I decided to run some tests and gather evidence.
On one side, Adrià Garriga-Alonso argued that large language models are already naturally aligned. They resist dishonesty and harmful behavior without explicit training, jailbreaks represent temporary confusion rather than fundamental misalignment, and increased optimization pressure makes models *better* at following human intent rather than worse. In a follow-up debate with Simon Lermen, he suggested that [...]
---
Outline:
(01:00) The Debate: Alignment by Default?
(04:00) The Eval: AlignmentAttractor
(05:54) Analyses:
(07:34) Limitations and Confounds
(12:05) What Different Results Would Mean
(12:52) Why This Matters Now
(13:31) Results
(13:35) Basic Correlations
(14:25) Valence-controlled alignment (partial correlations):
(15:47) Key Findings
(15:51) Universal Patterns
(17:07) Lab Differences
(22:18) Discussion
(22:21) The steelman: alignment-by-default already works
(23:17) The reply: its a target were aiming at, not an attractor theyd find
(24:46) The Uncomfortable Conclusion
(26:52) Future Work
The original text contained 11 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.

26,370 Listeners

2,450 Listeners

8,708 Listeners

4,174 Listeners

93 Listeners

1,599 Listeners

9,855 Listeners

93 Listeners

507 Listeners

5,529 Listeners

16,019 Listeners

543 Listeners

136 Listeners

94 Listeners

475 Listeners