Redwood Research Blog

“Why it’s hard to make settings for high-stakes control research” by Buck Shlegeris


Listen Later

Subtitle: It's like making challenging evals, but more constrained.

One of our main activities at Redwood is writing follow-ups to previous papers on control like the original and Ctrl-Z, where we construct a setting with a bunch of tasks (e.g. APPS problems) and a notion of safety failure (e.g. backdoors according to our specific definition), then play the adversarial game where we develop protocols and attacks on those protocols.

It turns out that a substantial fraction of the difficulty here is developing the settings, rather than developing the protocols and attacks. I want to explain why this is the case.

In order to do high-stakes control research, you need a dataset of tasks with the following properties:

  • You can check whether the main task was completed successfully.

  • There's some notion of safety failure that we consider unacceptable; we can check whether a safety failure occurred. [...]

---

First published:

July 18th, 2025

Source:

https://redwoodresearch.substack.com/p/why-its-hard-to-make-settings-for

---

Narrated by TYPE III AUDIO.

...more
View all episodesView all episodes
Download on the App Store

Redwood Research BlogBy Redwood Research