LessWrong (30+ Karma)

“Measuring whether AIs can statelessly strategize to subvert security measures” by Alex Mallen, Buck


Listen Later

(Work done at Redwood Research. Thanks to Misha Wagner, Fabien Roger, Tomek Korbak, Justis Mills, Samuel Marks, Charlie Griffin, Zach Stein-Perlman, and Aryan Bhatt for feedback on drafts.)

One way to show that risk from deploying an AI system is small is by showing that the model is not capable of subverting security measures enough to cause substantial harm. In AI control research so far, control evaluations have measured whether a red-team-created attack policy can defeat control measures, but they haven't measured how effectively models can generate such attack policies themselves. In particular, lack of a persistent memory poses a significant challenge for models strategizing to subvert security. In a new paper, we introduce evaluations that test strategic capability in the context of particular simulated deployments. We argue that in many deployments, a stateless model would need to reason highly reliably in order for its strategizing to substantially improve [...]

---

Outline:

(04:16) Statelessness and reliability

(06:00) Evaluating Claude's strategic ability from within a simulated deployment

(06:40) Summary of results

(07:50) Upfront Auditing

(12:38) Models can subvert upfront auditing without competently strategizing

(13:45) Strategizing O1-style (using DeepSeek-R1-Lite)

(14:18) Untrusted Monitoring

(16:31) What if models aren’t stateless?

(17:58) How conservative is our eval?

(19:56) What do our findings imply about risk from schemers?

The original text contained 2 footnotes which were omitted from this narration.

The original text contained 5 images which were described by AI.

---

First published:

December 19th, 2024

Source:

https://www.lesswrong.com/posts/4Pj7mXFQ2qhMjhaB2/measuring-whether-ais-can-statelessly-strategize-to-subvert

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,909 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,215 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

532 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,221 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates by Liron Shapira

Doom Debates

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners