
Sign up to save your podcasts
Or


The recent paper from Anthropic is getting unusually high praise, much of it I think deserved.
The title is: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.
Scott Alexander also covers this, offering an excellent high level explanation, of both the result and the arguments about whether it is meaningful. You could start with his write-up to get the gist, then return here if you still want more details, or you can read here knowing that everything he discusses is covered below. There was one good comment, pointing out some of the ways deceptive behavior could come to pass, but most people got distracted by the ‘grue’ analogy.
Right up front before proceeding, to avoid a key misunderstanding: I want to emphasize that in this paper, the deception was introduced intentionally. The paper deals with attempts to remove it.
The rest of this [...]
---
Outline:
(01:02) Abstract and Basics
(04:54) Threat Models
(07:01) The Two Backdoors
(10:13) Three Methodological Variations and Two Models
(11:30) Results of Safety Training
(13:08) Jesse Mu clarifies what he thinks the implications are.
(15:20) Implications of Unexpected Problems
(18:04) Trigger Please
(19:44) Strategic Deceptive Behavior Perhaps Not Directly in the Training Set
(21:54) Unintentional Deceptive Instrumental Alignment
(25:33) The Debate Over How Deception Might Arise
(38:46) Have We Disproved That Misalignment is Too Inefficient to Survive?
(47:52) Avoiding Reading Too Much Into the Result
(55:41) Further Discussion of Implications
(59:07) Broader Reaction
---
First published:
Source:
Narrated by TYPE III AUDIO.
By zvi5
22 ratings
The recent paper from Anthropic is getting unusually high praise, much of it I think deserved.
The title is: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.
Scott Alexander also covers this, offering an excellent high level explanation, of both the result and the arguments about whether it is meaningful. You could start with his write-up to get the gist, then return here if you still want more details, or you can read here knowing that everything he discusses is covered below. There was one good comment, pointing out some of the ways deceptive behavior could come to pass, but most people got distracted by the ‘grue’ analogy.
Right up front before proceeding, to avoid a key misunderstanding: I want to emphasize that in this paper, the deception was introduced intentionally. The paper deals with attempts to remove it.
The rest of this [...]
---
Outline:
(01:02) Abstract and Basics
(04:54) Threat Models
(07:01) The Two Backdoors
(10:13) Three Methodological Variations and Two Models
(11:30) Results of Safety Training
(13:08) Jesse Mu clarifies what he thinks the implications are.
(15:20) Implications of Unexpected Problems
(18:04) Trigger Please
(19:44) Strategic Deceptive Behavior Perhaps Not Directly in the Training Set
(21:54) Unintentional Deceptive Instrumental Alignment
(25:33) The Debate Over How Deception Might Arise
(38:46) Have We Disproved That Misalignment is Too Inefficient to Survive?
(47:52) Avoiding Reading Too Much Into the Result
(55:41) Further Discussion of Implications
(59:07) Broader Reaction
---
First published:
Source:
Narrated by TYPE III AUDIO.

26,393 Listeners

2,464 Listeners

1,100 Listeners

109 Listeners

296 Listeners

89 Listeners

551 Listeners

5,553 Listeners

140 Listeners

14 Listeners

140 Listeners

155 Listeners

458 Listeners

0 Listeners

143 Listeners