
Sign up to save your podcasts
Or


Playing Schelling games is a key dangerous capability for schemers: it's much harder to control AIs that are very capable at Schelling coordination. In this post, I discuss the threat model and describe concrete projects for empirical work.
Suppose you wish to control models that adversarially try to cause a catastrophic outcome (“schemers”). A useful class of techniques is monitoring: before letting any action taken by the potentially-scheming policy be executed, you use another model as a monitor to verify the actions are not dangerous.
A key question about monitors is whether they are trustworthy or not, i.e. whether the monitor strategically tries to deceive humans and let dangerous actions be executed. Untrusted monitoring is a technique that uses possibly-untrustworthy models for monitoring. The benefit is that we have no restrictions on the models we may use for monitoring, and in particular we can use our most [...]
---
Outline:
(04:20) Scope of this post
(05:07) Setup
(06:51) Challenges
(08:35) Concrete projects
(08:44) #1: Identify the Schelling element
(12:29) #2: Identification vs. classification
(13:58) #3: Training and generalization of Schelling coordination
(15:22) #4: Adversariality and non-obvious Schelling points
(20:27) Conclusion
The original text contained 9 footnotes which were omitted from this narration.
The original text contained 1 image which was described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongPlaying Schelling games is a key dangerous capability for schemers: it's much harder to control AIs that are very capable at Schelling coordination. In this post, I discuss the threat model and describe concrete projects for empirical work.
Suppose you wish to control models that adversarially try to cause a catastrophic outcome (“schemers”). A useful class of techniques is monitoring: before letting any action taken by the potentially-scheming policy be executed, you use another model as a monitor to verify the actions are not dangerous.
A key question about monitors is whether they are trustworthy or not, i.e. whether the monitor strategically tries to deceive humans and let dangerous actions be executed. Untrusted monitoring is a technique that uses possibly-untrustworthy models for monitoring. The benefit is that we have no restrictions on the models we may use for monitoring, and in particular we can use our most [...]
---
Outline:
(04:20) Scope of this post
(05:07) Setup
(06:51) Challenges
(08:35) Concrete projects
(08:44) #1: Identify the Schelling element
(12:29) #2: Identification vs. classification
(13:58) #3: Training and generalization of Schelling coordination
(15:22) #4: Adversariality and non-obvious Schelling points
(20:27) Conclusion
The original text contained 9 footnotes which were omitted from this narration.
The original text contained 1 image which was described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

112,877 Listeners

130 Listeners

7,215 Listeners

531 Listeners

16,173 Listeners

4 Listeners

14 Listeners

2 Listeners