Redwood Research Blog

“Extending control evaluations to non-scheming threats” by Josh Clymer


Listen Later

Buck Shlegeris and Ryan Greenblatt originally motivated control evaluations as a way to mitigate risks from ‘scheming’ AI models: models that consistently pursue power-seeking goals in a covert way; however, many adversarial model psychologies are not well described by the standard notion of scheming. For example, models might display scheming-like behavior inconsistently or acquire harmful goals through data poisoning, jailbreaks, etc.1

This post suggests a generalization of control evaluations that addresses these non-scheming threats.

The high-level change I propose is to define a control evaluation as a method of bounding risk from worst-case deployment behavior generally – rather than scheming models in particular. This change implies two modifications to the control evaluation methodology:

  1. Expand the definition of an “untrusted” model to include non-scheming models: In control evaluations described in previous work, the red team constructs adversarial substitutes for models that might be scheming, which are called ‘untrusted [...]

---

Outline:

(03:19) 1. Control evaluations assess risk given pessimistic assumptions about how models will behave in deployment

(06:54) 2. Enumerate untrusted model categories

(13:25) 2. Estimate untrusted model category probabilities

(21:09) 3. Substitute untrusted models

(28:44) 4. Conclusion

---

First published:

January 13th, 2025

Source:

https://redwoodresearch.substack.com/p/extending-control-evaluations-to

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

Redwood Research BlogBy Redwood Research