
Sign up to save your podcasts
Or


Buck Shlegeris and Ryan Greenblatt originally motivated control evaluations as a way to mitigate risks from ‘scheming’ AI models: models that consistently pursue power-seeking goals in a covert way; however, many adversarial model psychologies are not well described by the standard notion of scheming. For example, models might display scheming-like behavior inconsistently or acquire harmful goals through data poisoning, jailbreaks, etc.1
This post suggests a generalization of control evaluations that addresses these non-scheming threats.
The high-level change I propose is to define a control evaluation as a method of bounding risk from worst-case deployment behavior generally – rather than scheming models in particular. This change implies two modifications to the control evaluation methodology:
Expand the definition of an “untrusted” model to include non-scheming models: In control evaluations described in previous work, the red team constructs adversarial substitutes for models that might be scheming, which are called ‘untrusted [...]
---
Outline:
(03:19) 1. Control evaluations assess risk given pessimistic assumptions about how models will behave in deployment
(06:54) 2. Enumerate untrusted model categories
(13:25) 2. Estimate untrusted model category probabilities
(21:09) 3. Substitute untrusted models
(28:44) 4. Conclusion
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By Redwood ResearchBuck Shlegeris and Ryan Greenblatt originally motivated control evaluations as a way to mitigate risks from ‘scheming’ AI models: models that consistently pursue power-seeking goals in a covert way; however, many adversarial model psychologies are not well described by the standard notion of scheming. For example, models might display scheming-like behavior inconsistently or acquire harmful goals through data poisoning, jailbreaks, etc.1
This post suggests a generalization of control evaluations that addresses these non-scheming threats.
The high-level change I propose is to define a control evaluation as a method of bounding risk from worst-case deployment behavior generally – rather than scheming models in particular. This change implies two modifications to the control evaluation methodology:
Expand the definition of an “untrusted” model to include non-scheming models: In control evaluations described in previous work, the red team constructs adversarial substitutes for models that might be scheming, which are called ‘untrusted [...]
---
Outline:
(03:19) 1. Control evaluations assess risk given pessimistic assumptions about how models will behave in deployment
(06:54) 2. Enumerate untrusted model categories
(13:25) 2. Estimate untrusted model category probabilities
(21:09) 3. Substitute untrusted models
(28:44) 4. Conclusion
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.