
Sign up to save your podcasts
Or
We recently released Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?, a major update to our previous paper/blogpost, evaluating a broader range of models (e.g. helpful-only Claude 3.5 Sonnet) in more diverse and realistic settings (e.g. untrusted monitoring).
Abstract
An AI control protocol is a plan for usefully deploying AI systems that prevents an AI from intentionally causing some unacceptable outcome. Previous work evaluated protocols by subverting them using an AI following a human-written strategy. This paper investigates how well AI systems can generate and act on their own strategies for subverting control protocols whilst operating statelessly (i.e., without shared memory between contexts). To do this, an AI system may need to reliably generate effective strategies in each context, take actions with well-calibrated probabilities, and coordinate plans with other instances of itself without communicating.
We develop Subversion Strategy Eval, a suite of eight environments [...]
---
Outline:
(00:38) Abstract
(02:13) Paper overview
(03:55) Results summary
(05:31) Qualitative analysis
(08:50) How do these results inform risk?
(11:08) How should this eval be used with more capable models?
(11:13) Tuning conservatism
(14:19) Difficulties with attaining a conservative measurement
(17:41) Reducing noise
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
We recently released Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?, a major update to our previous paper/blogpost, evaluating a broader range of models (e.g. helpful-only Claude 3.5 Sonnet) in more diverse and realistic settings (e.g. untrusted monitoring).
Abstract
An AI control protocol is a plan for usefully deploying AI systems that prevents an AI from intentionally causing some unacceptable outcome. Previous work evaluated protocols by subverting them using an AI following a human-written strategy. This paper investigates how well AI systems can generate and act on their own strategies for subverting control protocols whilst operating statelessly (i.e., without shared memory between contexts). To do this, an AI system may need to reliably generate effective strategies in each context, take actions with well-calibrated probabilities, and coordinate plans with other instances of itself without communicating.
We develop Subversion Strategy Eval, a suite of eight environments [...]
---
Outline:
(00:38) Abstract
(02:13) Paper overview
(03:55) Results summary
(05:31) Qualitative analysis
(08:50) How do these results inform risk?
(11:08) How should this eval be used with more capable models?
(11:13) Tuning conservatism
(14:19) Difficulties with attaining a conservative measurement
(17:41) Reducing noise
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,368 Listeners
2,397 Listeners
7,808 Listeners
4,111 Listeners
87 Listeners
1,454 Listeners
8,774 Listeners
89 Listeners
354 Listeners
5,363 Listeners
15,028 Listeners
461 Listeners
127 Listeners
65 Listeners
432 Listeners