
Sign up to save your podcasts
Or


Subtitle: What are the methods and issues when failures occur diffusely over many actions?.
In this post, I'll try to explain my current understanding of the high level methods for handling non-concentrated failures with control. I'll discuss the regimes produced by different methods and the failure modes of these different regimes. Non-concentrated failures are issues that arise from the AI doing a large number of problematic actions over a long period. (See here and here for more discussion of concentrated vs non-concentrated failures.)
A key aspect of non-concentrated failures is that they can in principle be handled by applying asynchronous online training with a good enough oversight signal on a small fraction of the AI's actions. That is, we could in principle solve these failures by sampling a small fraction of actions to review after they've already been executed, rating these actions using an oversight process which has [...]
---
Outline:
(06:06) What happens without online training?
(10:58) Online training
(13:10) Online training of just the policy
(19:40) Recursive oversight: debate
(20:24) Debate for binary questions
(23:37) Debate for oversight of a policy
(29:15) KL penalties and forcing identical behavior
(31:00) Is it ever a good idea for scheming models to do low stakes sabotage more sophisticated than just inserting errors (at particularly opportune times)?
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
By Redwood ResearchSubtitle: What are the methods and issues when failures occur diffusely over many actions?.
In this post, I'll try to explain my current understanding of the high level methods for handling non-concentrated failures with control. I'll discuss the regimes produced by different methods and the failure modes of these different regimes. Non-concentrated failures are issues that arise from the AI doing a large number of problematic actions over a long period. (See here and here for more discussion of concentrated vs non-concentrated failures.)
A key aspect of non-concentrated failures is that they can in principle be handled by applying asynchronous online training with a good enough oversight signal on a small fraction of the AI's actions. That is, we could in principle solve these failures by sampling a small fraction of actions to review after they've already been executed, rating these actions using an oversight process which has [...]
---
Outline:
(06:06) What happens without online training?
(10:58) Online training
(13:10) Online training of just the policy
(19:40) Recursive oversight: debate
(20:24) Debate for binary questions
(23:37) Debate for oversight of a policy
(29:15) KL penalties and forcing identical behavior
(31:00) Is it ever a good idea for scheming models to do low stakes sabotage more sophisticated than just inserting errors (at particularly opportune times)?
---
First published:
Source:
---
Narrated by TYPE III AUDIO.