October 08, 2024

“Schelling game evaluations for AI control” by Olli Järviniemi

Listen Later

21 minutes

Playing Schelling games is a key dangerous capability for schemers: it's much harder to control AIs that are very capable at Schelling coordination. In this post, I discuss the threat model and describe concrete projects for empirical work.

Suppose you wish to control models that adversarially try to cause a catastrophic outcome (“schemers”). A useful class of techniques is monitoring: before letting any action taken by the potentially-scheming policy be executed, you use another model as a monitor to verify the actions are not dangerous.

A key question about monitors is whether they are trustworthy or not, i.e. whether the monitor strategically tries to deceive humans and let dangerous actions be executed. Untrusted monitoring is a technique that uses possibly-untrustworthy models for monitoring. The benefit is that we have no restrictions on the models we may use for monitoring, and in particular we can use our most [...]

---

Outline:

(04:20) Scope of this post

(05:07) Setup

(06:51) Challenges

(08:35) Concrete projects

(08:44) #1: Identify the Schelling element

(12:29) #2: Identification vs. classification

(13:58) #3: Training and generalization of Schelling coordination

(15:22) #4: Adversariality and non-obvious Schelling points

(20:27) Conclusion

The original text contained 9 footnotes which were omitted from this narration.

The original text contained 1 image which was described by AI.

---

First published:

October 8th, 2024

Source:

https://www.lesswrong.com/posts/n2c62YS8RJtD4Aqhh/schelling-game-evaluations-for-ai-control

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

View all episodes

Download on the App Store

Download on the App Store

Get it on Google Play

LessWrong (30+ Karma)

By LessWrong

October 08, 2024

“Schelling game evaluations for AI control” by Olli Järviniemi

Listen Later

21 minutes

Playing Schelling games is a key dangerous capability for schemers: it's much harder to control AIs that are very capable at Schelling coordination. In this post, I discuss the threat model and describe concrete projects for empirical work.

Suppose you wish to control models that adversarially try to cause a catastrophic outcome (“schemers”). A useful class of techniques is monitoring: before letting any action taken by the potentially-scheming policy be executed, you use another model as a monitor to verify the actions are not dangerous.

A key question about monitors is whether they are trustworthy or not, i.e. whether the monitor strategically tries to deceive humans and let dangerous actions be executed. Untrusted monitoring is a technique that uses possibly-untrustworthy models for monitoring. The benefit is that we have no restrictions on the models we may use for monitoring, and in particular we can use our most [...]

---

Outline:

(04:20) Scope of this post

(05:07) Setup

(06:51) Challenges

(08:35) Concrete projects

(08:44) #1: Identify the Schelling element

(12:29) #2: Identification vs. classification

(13:58) #3: Training and generalization of Schelling coordination

(15:22) #4: Adversariality and non-obvious Schelling points

(20:27) Conclusion

The original text contained 9 footnotes which were omitted from this narration.

The original text contained 1 image which was described by AI.

---

First published:

October 8th, 2024

Source:

https://www.lesswrong.com/posts/n2c62YS8RJtD4Aqhh/schelling-game-evaluations-for-ai-control

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

The Daily by The New York Times

The Daily

112,877 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,215 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

531 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,173 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates by Liron Shapira

Doom Debates

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners