
Sign up to save your podcasts
Or


Subtitle: It's like making challenging evals, but more constrained.
One of our main activities at Redwood is writing follow-ups to previous papers on control like the original and Ctrl-Z, where we construct a setting with a bunch of tasks (e.g. APPS problems) and a notion of safety failure (e.g. backdoors according to our specific definition), then play the adversarial game where we develop protocols and attacks on those protocols.
It turns out that a substantial fraction of the difficulty here is developing the settings, rather than developing the protocols and attacks. I want to explain why this is the case.
In order to do high-stakes control research, you need a dataset of tasks with the following properties:
You can check whether the main task was completed successfully.
There's some notion of safety failure that we consider unacceptable; we can check whether a safety failure occurred. [...]
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
By Redwood ResearchSubtitle: It's like making challenging evals, but more constrained.
One of our main activities at Redwood is writing follow-ups to previous papers on control like the original and Ctrl-Z, where we construct a setting with a bunch of tasks (e.g. APPS problems) and a notion of safety failure (e.g. backdoors according to our specific definition), then play the adversarial game where we develop protocols and attacks on those protocols.
It turns out that a substantial fraction of the difficulty here is developing the settings, rather than developing the protocols and attacks. I want to explain why this is the case.
In order to do high-stakes control research, you need a dataset of tasks with the following properties:
You can check whether the main task was completed successfully.
There's some notion of safety failure that we consider unacceptable; we can check whether a safety failure occurred. [...]
---
First published:
Source:
---
Narrated by TYPE III AUDIO.