LessWrong (30+ Karma)

“Reducing risk from scheming by studying trained-in scheming behavior” by ryan_greenblatt


Listen Later

In a previous post, I discussed mitigating risks from scheming by studying examples of actual scheming AIs.[1] In this post, I'll discuss an alternative approach: directly training (or instructing) an AI to behave how we think a naturally scheming AI might behave (at least in some ways). Then, we can study the resulting models. For instance, we could verify that a detection method discovers that the model is scheming or that a removal method removes the inserted behavior. The Sleeper Agents paper is an example of directly training in behavior (roughly) similar to a schemer.

This approach eliminates one of the main difficulties with studying actual scheming AIs: the fact that it is difficult to catch (and re-catch) them. However, trained-in scheming behavior can't be used to build an end-to-end understanding of how scheming arises, or to test techniques focused on preventing scheming (rather than removing it even [...]

---

Outline:

(02:28) Studying detection and removal techniques using trained-in scheming behavior

(05:31) Issues

(18:21) Conclusion

(19:31) Canary string

The original text contained 6 footnotes which were omitted from this narration.

---

First published:

October 16th, 2025

Source:

https://www.lesswrong.com/posts/v6K3hnq5c9roa5MbS/reducing-risk-from-scheming-by-studying-trained-in-scheming

---

Narrated by TYPE III AUDIO.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,214 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

131 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,239 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

559 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,276 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates! by Liron Shapira

Doom Debates!

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners