
Sign up to save your podcasts
Or


Reasoning models like Deepseek r1:
If you had told this to my 2022 self without specifying anything else about scheming models, I might have put a non-negligible probability on such AIs scheming (i.e. strategically performing well in training in order to protect their long-term goals).
Despite this, the scratchpads of current reasoning models do not contain traces of scheming in regular training environments - even when there is no harmlessness pressure on the scratchpads like in Deepseek-r1-Zero.
In this post, I argue that:
---
Outline:
(04:08) Classic reasons to expect AIs to not be schemers
(04:14) Speed priors
(06:11) Preconditions for scheming not being met
(08:27) There are indirect pressures against scheming on intermediate steps of reasoning
(09:07) Human priors on intermediate steps of reasoning
(11:43) Correlation between short and long reasoning
(13:07) Other pressures
(13:48) Rewards are not so cursed as to strongly incentivize scheming
(13:54) Maximizing rewards teaches you things mostly independent of scheming
(14:46) Using situational awareness to get higher reward is hard
(16:45) Maximizing rewards doesn't push you far away from the human prior
(18:07) Will it be different for future rewards?
(19:32) Meta-level update and conclusion
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
By LessWrongReasoning models like Deepseek r1:
If you had told this to my 2022 self without specifying anything else about scheming models, I might have put a non-negligible probability on such AIs scheming (i.e. strategically performing well in training in order to protect their long-term goals).
Despite this, the scratchpads of current reasoning models do not contain traces of scheming in regular training environments - even when there is no harmlessness pressure on the scratchpads like in Deepseek-r1-Zero.
In this post, I argue that:
---
Outline:
(04:08) Classic reasons to expect AIs to not be schemers
(04:14) Speed priors
(06:11) Preconditions for scheming not being met
(08:27) There are indirect pressures against scheming on intermediate steps of reasoning
(09:07) Human priors on intermediate steps of reasoning
(11:43) Correlation between short and long reasoning
(13:07) Other pressures
(13:48) Rewards are not so cursed as to strongly incentivize scheming
(13:54) Maximizing rewards teaches you things mostly independent of scheming
(14:46) Using situational awareness to get higher reward is hard
(16:45) Maximizing rewards doesn't push you far away from the human prior
(18:07) Will it be different for future rewards?
(19:32) Meta-level update and conclusion
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.

112,956 Listeners

132 Listeners

7,290 Listeners

548 Listeners

16,362 Listeners

4 Listeners

14 Listeners

2 Listeners