January 13, 2025

“Extending control evaluations to non-scheming threats” by Josh Clymer

30 minutes

Buck Shlegeris and Ryan Greenblatt originally motivated control evaluations as a way to mitigate risks from ‘scheming’ AI models: models that consistently pursue power-seeking goals in a covert way; however, many adversarial model psychologies are not well described by the standard notion of scheming. For example, models might display scheming-like behavior inconsistently or acquire harmful goals through data poisoning, jailbreaks, etc.1

This post suggests a generalization of control evaluations that addresses these non-scheming threats.

The high-level change I propose is to define a control evaluation as a method of bounding risk from worst-case deployment behavior generally – rather than scheming models in particular. This change implies two modifications to the control evaluation methodology:

Expand the definition of an “untrusted” model to include non-scheming models: In control evaluations described in previous work, the red team constructs adversarial substitutes for models that might be scheming, which are called ‘untrusted [...]

---

Outline:

(03:19) 1. Control evaluations assess risk given pessimistic assumptions about how models will behave in deployment

(06:54) 2. Enumerate untrusted model categories

(13:25) 2. Estimate untrusted model category probabilities

(21:09) 3. Substitute untrusted models

(28:44) 4. Conclusion

---

First published:

January 13th, 2025

Source:

https://redwoodresearch.substack.com/p/extending-control-evaluations-to

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By Redwood Research

January 13, 2025

“Extending control evaluations to non-scheming threats” by Josh Clymer

30 minutes

This post suggests a generalization of control evaluations that addresses these non-scheming threats.

Expand the definition of an “untrusted” model to include non-scheming models: In control evaluations described in previous work, the red team constructs adversarial substitutes for models that might be scheming, which are called ‘untrusted [...]

---

Outline:

(03:19) 1. Control evaluations assess risk given pessimistic assumptions about how models will behave in deployment

(06:54) 2. Enumerate untrusted model categories

(13:25) 2. Estimate untrusted model category probabilities

(21:09) 3. Substitute untrusted models

(28:44) 4. Conclusion

---

First published:

January 13th, 2025

Source:

https://redwoodresearch.substack.com/p/extending-control-evaluations-to

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

Share “Extending control evaluations to non-scheming threats” by Josh Clymer

Sign up to save your podcasts

“Extending control evaluations to non-scheming threats” by Josh Clymer

“Extending control evaluations to non-scheming threats” by Josh Clymer