
Sign up to save your podcasts
Or


Reasoning models like Deepseek r1:
If you had told this to my 2022 self without specifying anything else about scheming models, I might have put a non-negligible probability on such AIs scheming (i.e. strategically performing well in training in order to protect their long-term goals).
Despite this, the scratchpads of current reasoning models do not contain traces of scheming in regular training environments - even when there is no harmlessness pressure on the scratchpads like in Deepseek-r1-Zero.
In this post, I argue that:
---
Outline:
(04:08) Classic reasons to expect AIs to not be schemers
(04:14) Speed priors
(06:11) Preconditions for scheming not being met
(08:27) There are indirect pressures against scheming on intermediate steps of reasoning
(09:07) Human priors on intermediate steps of reasoning
(11:43) Correlation between short and long reasoning
(13:07) Other pressures
(13:48) Rewards are not so cursed as to strongly incentivize scheming
(13:54) Maximizing rewards teaches you things mostly independent of scheming
(14:46) Using situational awareness to get higher reward is hard
(16:45) Maximizing rewards doesn't push you far away from the human prior
(18:07) Will it be different for future rewards?
(19:32) Meta-level update and conclusion
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
By LessWrongReasoning models like Deepseek r1:
If you had told this to my 2022 self without specifying anything else about scheming models, I might have put a non-negligible probability on such AIs scheming (i.e. strategically performing well in training in order to protect their long-term goals).
Despite this, the scratchpads of current reasoning models do not contain traces of scheming in regular training environments - even when there is no harmlessness pressure on the scratchpads like in Deepseek-r1-Zero.
In this post, I argue that:
---
Outline:
(04:08) Classic reasons to expect AIs to not be schemers
(04:14) Speed priors
(06:11) Preconditions for scheming not being met
(08:27) There are indirect pressures against scheming on intermediate steps of reasoning
(09:07) Human priors on intermediate steps of reasoning
(11:43) Correlation between short and long reasoning
(13:07) Other pressures
(13:48) Rewards are not so cursed as to strongly incentivize scheming
(13:54) Maximizing rewards teaches you things mostly independent of scheming
(14:46) Using situational awareness to get higher reward is hard
(16:45) Maximizing rewards doesn't push you far away from the human prior
(18:07) Will it be different for future rewards?
(19:32) Meta-level update and conclusion
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.

26,332 Listeners

2,453 Listeners

8,579 Listeners

4,183 Listeners

93 Listeners

1,598 Listeners

9,932 Listeners

95 Listeners

511 Listeners

5,518 Listeners

15,938 Listeners

546 Listeners

131 Listeners

93 Listeners

467 Listeners