
Sign up to save your podcasts
Or
In a previous post, I discussed mitigating risks from scheming by studying examples of actual scheming AIs.[1] In this post, I'll discuss an alternative approach: directly training (or instructing) an AI to behave how we think a naturally scheming AI might behave (at least in some ways). Then, we can study the resulting models. For instance, we could verify that a detection method discovers that the model is scheming or that a removal method removes the inserted behavior. The Sleeper Agents paper is an example of directly training in behavior (roughly) similar to a schemer.
This approach eliminates one of the main difficulties with studying actual scheming AIs: the fact that it is difficult to catch (and re-catch) them. However, trained-in scheming behavior can't be used to build an end-to-end understanding of how scheming arises, or to test techniques focused on preventing scheming (rather than removing it even [...]
---
Outline:
(02:28) Studying detection and removal techniques using trained-in scheming behavior
(05:31) Issues
(18:21) Conclusion
(19:31) Canary string
The original text contained 6 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
In a previous post, I discussed mitigating risks from scheming by studying examples of actual scheming AIs.[1] In this post, I'll discuss an alternative approach: directly training (or instructing) an AI to behave how we think a naturally scheming AI might behave (at least in some ways). Then, we can study the resulting models. For instance, we could verify that a detection method discovers that the model is scheming or that a removal method removes the inserted behavior. The Sleeper Agents paper is an example of directly training in behavior (roughly) similar to a schemer.
This approach eliminates one of the main difficulties with studying actual scheming AIs: the fact that it is difficult to catch (and re-catch) them. However, trained-in scheming behavior can't be used to build an end-to-end understanding of how scheming arises, or to test techniques focused on preventing scheming (rather than removing it even [...]
---
Outline:
(02:28) Studying detection and removal techniques using trained-in scheming behavior
(05:31) Issues
(18:21) Conclusion
(19:31) Canary string
The original text contained 6 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
26,373 Listeners
2,429 Listeners
8,189 Listeners
4,159 Listeners
92 Listeners
1,554 Listeners
9,812 Listeners
88 Listeners
484 Listeners
5,476 Listeners
16,145 Listeners
532 Listeners
133 Listeners
96 Listeners
510 Listeners