
Sign up to save your podcasts
Or


In a previous post, I discussed mitigating risks from scheming by studying examples of actual scheming AIs.[1] In this post, I'll discuss an alternative approach: directly training (or instructing) an AI to behave how we think a naturally scheming AI might behave (at least in some ways). Then, we can study the resulting models. For instance, we could verify that a detection method discovers that the model is scheming or that a removal method removes the inserted behavior. The Sleeper Agents paper is an example of directly training in behavior (roughly) similar to a schemer.
This approach eliminates one of the main difficulties with studying actual scheming AIs: the fact that it is difficult to catch (and re-catch) them. However, trained-in scheming behavior can't be used to build an end-to-end understanding of how scheming arises, or to test techniques focused on preventing scheming (rather than removing it even [...]
---
Outline:
(02:28) Studying detection and removal techniques using trained-in scheming behavior
(05:31) Issues
(18:21) Conclusion
(19:31) Canary string
The original text contained 6 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
By LessWrongIn a previous post, I discussed mitigating risks from scheming by studying examples of actual scheming AIs.[1] In this post, I'll discuss an alternative approach: directly training (or instructing) an AI to behave how we think a naturally scheming AI might behave (at least in some ways). Then, we can study the resulting models. For instance, we could verify that a detection method discovers that the model is scheming or that a removal method removes the inserted behavior. The Sleeper Agents paper is an example of directly training in behavior (roughly) similar to a schemer.
This approach eliminates one of the main difficulties with studying actual scheming AIs: the fact that it is difficult to catch (and re-catch) them. However, trained-in scheming behavior can't be used to build an end-to-end understanding of how scheming arises, or to test techniques focused on preventing scheming (rather than removing it even [...]
---
Outline:
(02:28) Studying detection and removal techniques using trained-in scheming behavior
(05:31) Issues
(18:21) Conclusion
(19:31) Canary string
The original text contained 6 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.

112,214 Listeners

131 Listeners

7,239 Listeners

559 Listeners

16,276 Listeners

4 Listeners

14 Listeners

2 Listeners