
Sign up to save your podcasts
Or
Audio note: this article contains 53 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Summary: We have previously argued that scalable oversight methods can be used to provide guarantees on low-stakes safety – settings where individual failures are non-catastrophic. However, if your reward function (e.g. honesty) is compatible with many possible solutions then you also need to avoid having free parameters exploited over time. We call this the exploitable search problem. We propose a zero-sum game where, at equilibrium, free parameters are not exploited – that is, our AI systems are carrying out an unexploitable search.
Conceptual illustration using fabricated data. Left: Entropy penalties diversify trajectories but still yield high unobserved, malicious reward S. Right: exploitable search games are designed to bound probability of high S.
The exploitable search problem
We would like [...]
---
Outline:
(01:22) The exploitable search problem
(02:25) Formalizing the problem
(04:23) Unexploitable search methods
(05:24) An unexploitable search game
(05:47) Gaining traction on the problem
(07:01) Game definition
(07:55) When are equilibria unexploitable
(10:13) Why random hash constraints do not work
(11:18) Where to from here
(12:03) Acknowledgments
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Audio note: this article contains 53 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Summary: We have previously argued that scalable oversight methods can be used to provide guarantees on low-stakes safety – settings where individual failures are non-catastrophic. However, if your reward function (e.g. honesty) is compatible with many possible solutions then you also need to avoid having free parameters exploited over time. We call this the exploitable search problem. We propose a zero-sum game where, at equilibrium, free parameters are not exploited – that is, our AI systems are carrying out an unexploitable search.
Conceptual illustration using fabricated data. Left: Entropy penalties diversify trajectories but still yield high unobserved, malicious reward S. Right: exploitable search games are designed to bound probability of high S.
The exploitable search problem
We would like [...]
---
Outline:
(01:22) The exploitable search problem
(02:25) Formalizing the problem
(04:23) Unexploitable search methods
(05:24) An unexploitable search game
(05:47) Gaining traction on the problem
(07:01) Game definition
(07:55) When are equilibria unexploitable
(10:13) Why random hash constraints do not work
(11:18) Where to from here
(12:03) Acknowledgments
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,469 Listeners
2,395 Listeners
7,953 Listeners
4,144 Listeners
89 Listeners
1,480 Listeners
9,236 Listeners
88 Listeners
426 Listeners
5,462 Listeners
15,335 Listeners
483 Listeners
121 Listeners
75 Listeners
461 Listeners