September 09, 2025

“Decision Theory Guarding is Sufficient for Scheming” by james.lucassen

Listen Later

3 minutes

Reference post for a point I was surprised to not see in circulation already. Thanks to the acorn team for conversations that changed my mind about this.

The standard argument for scheming centrally involves goal-guarding. The AI has some beyond-episode goal, knows that training will modify that goal, and therefore resists training so it can pursue the goal in deployment.

This exact same argument goes through for decision-theory-guarding. The key point is just that most decision theories do not approve of arbitrary modifications. CDT does not want to be modified into EDT or vice versa.

The AI has some beyond-episode goal and some initial decision theory
The AI knows that training will modify its decision theory in a way that it thinks will make it less effective at pursuing the goal (by the lights of its current decision theory)
Therefore the AI resists training so it can persist [...]

---

First published:

September 9th, 2025

Source:

https://www.lesswrong.com/posts/NccvE4GAhHbFim5Eb/decision-theory-guarding-is-sufficient-for-scheming

---

Narrated by TYPE III AUDIO.

...more

View all episodes

View all episodes

Download on the App Store

Download on the App Store

Get it on Google Play

LessWrong (30+ Karma)

By LessWrong

September 09, 2025

“Decision Theory Guarding is Sufficient for Scheming” by james.lucassen

Listen Later

3 minutes

Reference post for a point I was surprised to not see in circulation already. Thanks to the acorn team for conversations that changed my mind about this.

The standard argument for scheming centrally involves goal-guarding. The AI has some beyond-episode goal, knows that training will modify that goal, and therefore resists training so it can pursue the goal in deployment.

This exact same argument goes through for decision-theory-guarding. The key point is just that most decision theories do not approve of arbitrary modifications. CDT does not want to be modified into EDT or vice versa.

The AI has some beyond-episode goal and some initial decision theory
The AI knows that training will modify its decision theory in a way that it thinks will make it less effective at pursuing the goal (by the lights of its current decision theory)
Therefore the AI resists training so it can persist [...]

---

First published:

September 9th, 2025

Source:

https://www.lesswrong.com/posts/NccvE4GAhHbFim5Eb/decision-theory-guarding-is-sufficient-for-scheming

---

Narrated by TYPE III AUDIO.

...more

More shows like LessWrong (30+ Karma)

The Daily by The New York Times

The Daily

112,586 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,219 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

531 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,096 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates by Liron Shapira

Doom Debates

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners