
Sign up to save your podcasts
Or
(Work done at Redwood Research. Thanks to Misha Wagner, Fabien Roger, Tomek Korbak, Justis Mills, Samuel Marks, Charlie Griffin, Zach Stein-Perlman, and Aryan Bhatt for feedback on drafts.)
One way to show that risk from deploying an AI system is small is by showing that the model is not capable of subverting security measures enough to cause substantial harm. In AI control research so far, control evaluations have measured whether a red-team-created attack policy can defeat control measures, but they haven't measured how effectively models can generate such attack policies themselves. In particular, lack of a persistent memory poses a significant challenge for models strategizing to subvert security. In a new paper, we introduce evaluations that test strategic capability in the context of particular simulated deployments. We argue that in many deployments, a stateless model would need to reason highly reliably in order for its strategizing to substantially improve [...]
---
Outline:
(04:16) Statelessness and reliability
(06:00) Evaluating Claude's strategic ability from within a simulated deployment
(06:40) Summary of results
(07:50) Upfront Auditing
(12:38) Models can subvert upfront auditing without competently strategizing
(13:45) Strategizing O1-style (using DeepSeek-R1-Lite)
(14:18) Untrusted Monitoring
(16:31) What if models aren’t stateless?
(17:58) How conservative is our eval?
(19:56) What do our findings imply about risk from schemers?
The original text contained 2 footnotes which were omitted from this narration.
The original text contained 5 images which were described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
(Work done at Redwood Research. Thanks to Misha Wagner, Fabien Roger, Tomek Korbak, Justis Mills, Samuel Marks, Charlie Griffin, Zach Stein-Perlman, and Aryan Bhatt for feedback on drafts.)
One way to show that risk from deploying an AI system is small is by showing that the model is not capable of subverting security measures enough to cause substantial harm. In AI control research so far, control evaluations have measured whether a red-team-created attack policy can defeat control measures, but they haven't measured how effectively models can generate such attack policies themselves. In particular, lack of a persistent memory poses a significant challenge for models strategizing to subvert security. In a new paper, we introduce evaluations that test strategic capability in the context of particular simulated deployments. We argue that in many deployments, a stateless model would need to reason highly reliably in order for its strategizing to substantially improve [...]
---
Outline:
(04:16) Statelessness and reliability
(06:00) Evaluating Claude's strategic ability from within a simulated deployment
(06:40) Summary of results
(07:50) Upfront Auditing
(12:38) Models can subvert upfront auditing without competently strategizing
(13:45) Strategizing O1-style (using DeepSeek-R1-Lite)
(14:18) Untrusted Monitoring
(16:31) What if models aren’t stateless?
(17:58) How conservative is our eval?
(19:56) What do our findings imply about risk from schemers?
The original text contained 2 footnotes which were omitted from this narration.
The original text contained 5 images which were described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,342 Listeners
2,393 Listeners
7,949 Listeners
4,130 Listeners
87 Listeners
1,446 Listeners
8,910 Listeners
88 Listeners
372 Listeners
5,421 Listeners
15,334 Listeners
468 Listeners
122 Listeners
76 Listeners
449 Listeners