LessWrong (30+ Karma)

“Thinking about reasoning models made me less worried about scheming” by Fabien Roger


Listen Later

Reasoning models like Deepseek r1:

  • Can reason in consequentialist ways and have vast knowledge about AI training
  • Can reason for many serial steps, with enough slack to think about takeover plans
  • Sometimes reward hack

If you had told this to my 2022 self without specifying anything else about scheming models, I might have put a non-negligible probability on such AIs scheming (i.e. strategically performing well in training in order to protect their long-term goals).

Despite this, the scratchpads of current reasoning models do not contain traces of scheming in regular training environments - even when there is no harmlessness pressure on the scratchpads like in Deepseek-r1-Zero.

In this post, I argue that:

  • Classic explanations for the absence of scheming (in non-wildly superintelligent AIs) like the ones listed in Joe Carlsmith's scheming report only partially rule out scheming in models like Deepseek r1;
  • There are other explanations for why Deepseek r1 doesn’t scheme that are often absent from past armchair reasoning about scheming:
    • The human-like pretraining prior is mostly benign and applies to some intermediate steps of reasoning: it puts a very low probability on helpful-but-scheming agents doing things like trying very hard to solve math and [...]

---

Outline:

(04:08) Classic reasons to expect AIs to not be schemers

(04:14) Speed priors

(06:11) Preconditions for scheming not being met

(08:27) There are indirect pressures against scheming on intermediate steps of reasoning

(09:07) Human priors on intermediate steps of reasoning

(11:43) Correlation between short and long reasoning

(13:07) Other pressures

(13:48) Rewards are not so cursed as to strongly incentivize scheming

(13:54) Maximizing rewards teaches you things mostly independent of scheming

(14:46) Using situational awareness to get higher reward is hard

(16:45) Maximizing rewards doesn't push you far away from the human prior

(18:07) Will it be different for future rewards?

(19:32) Meta-level update and conclusion

The original text contained 1 footnote which was omitted from this narration.

---

First published:

November 20th, 2025

Source:

https://www.lesswrong.com/posts/HYCGA2p4bBG68Yufh/thinking-about-reasoning-models-made-me-less-worried-about

---

Narrated by TYPE III AUDIO.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,956 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

132 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,290 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

548 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,362 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates by Liron Shapira

Doom Debates

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners