April 03, 2025

“Notes on handling non-concentrated failures with AI control: high level methods and different regimes” by Ryan Greenblatt

32 minutes

Subtitle: What are the methods and issues when failures occur diffusely over many actions?.

In this post, I'll try to explain my current understanding of the high level methods for handling non-concentrated failures with control. I'll discuss the regimes produced by different methods and the failure modes of these different regimes. Non-concentrated failures are issues that arise from the AI doing a large number of problematic actions over a long period. (See here and here for more discussion of concentrated vs non-concentrated failures.)

A key aspect of non-concentrated failures is that they can in principle be handled by applying asynchronous online training with a good enough oversight signal on a small fraction of the AI's actions. That is, we could in principle solve these failures by sampling a small fraction of actions to review after they've already been executed, rating these actions using an oversight process which has [...]

---

Outline:

(06:06) What happens without online training?

(10:58) Online training

(13:10) Online training of just the policy

(19:40) Recursive oversight: debate

(20:24) Debate for binary questions

(23:37) Debate for oversight of a policy

(29:15) KL penalties and forcing identical behavior

(31:00) Is it ever a good idea for scheming models to do low stakes sabotage more sophisticated than just inserting errors (at particularly opportune times)?

---

First published:

March 29th, 2025

Source:

https://redwoodresearch.substack.com/p/notes-on-handling-non-concentrated

---

Narrated by TYPE III AUDIO.

...more

View all episodes

By Redwood Research

April 03, 2025

“Notes on handling non-concentrated failures with AI control: high level methods and different regimes” by Ryan Greenblatt

32 minutes

Subtitle: What are the methods and issues when failures occur diffusely over many actions?.

---

Outline:

(06:06) What happens without online training?

(10:58) Online training

(13:10) Online training of just the policy

(19:40) Recursive oversight: debate

(20:24) Debate for binary questions

(23:37) Debate for oversight of a policy

(29:15) KL penalties and forcing identical behavior

(31:00) Is it ever a good idea for scheming models to do low stakes sabotage more sophisticated than just inserting errors (at particularly opportune times)?

---

First published:

March 29th, 2025

Source:

https://redwoodresearch.substack.com/p/notes-on-handling-non-concentrated

---

Narrated by TYPE III AUDIO.

...more

Share “Notes on handling non-concentrated failures with AI control: high level methods and different regimes” by Ryan Greenblatt

Sign up to save your podcasts

“Notes on handling non-concentrated failures with AI control: high level methods and different regimes” by Ryan Greenblatt

“Notes on handling non-concentrated failures with AI control: high level methods and different regimes” by Ryan Greenblatt