
Sign up to save your podcasts
Or
In order to empirically study risks from schemers, we can try to develop model organisms of misalignment. Sleeper Agents and password-locked models, which train LLMs to behave in a benign or malign way depending on some feature of the input, are prominent examples of the methodology thus far.
Empirically, it turns out that detecting or removing misalignment from model organisms can sometimes be very easy: simple probes can catch sleeper agents, non-adversarial training can cause smaller-sized sleeper agents to forget their backdoor (Price et al., Mallen & Hebbar)[1], and memorizing weak examples can elicit strong behavior out of password-locked models. This might imply that training against real scheming is easy—but model organisms might be “shallow” in some relevant sense that make them disanalogous from real scheming.
I want to highlight an alternative here: studying methods to train away deeply ingrained behaviors (which don’t necessarily have to look like misbehavior) [...]
---
Outline:
(01:52) 1. Sample-efficiently training away misbehavior
(01:57) Motivation
(03:06) Sketch of a project
(04:08) 2. Training away harmlessness without ground truth
(04:13) Motivation
(04:59) Sketch of a project
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
In order to empirically study risks from schemers, we can try to develop model organisms of misalignment. Sleeper Agents and password-locked models, which train LLMs to behave in a benign or malign way depending on some feature of the input, are prominent examples of the methodology thus far.
Empirically, it turns out that detecting or removing misalignment from model organisms can sometimes be very easy: simple probes can catch sleeper agents, non-adversarial training can cause smaller-sized sleeper agents to forget their backdoor (Price et al., Mallen & Hebbar)[1], and memorizing weak examples can elicit strong behavior out of password-locked models. This might imply that training against real scheming is easy—but model organisms might be “shallow” in some relevant sense that make them disanalogous from real scheming.
I want to highlight an alternative here: studying methods to train away deeply ingrained behaviors (which don’t necessarily have to look like misbehavior) [...]
---
Outline:
(01:52) 1. Sample-efficiently training away misbehavior
(01:57) Motivation
(03:06) Sketch of a project
(04:08) 2. Training away harmlessness without ground truth
(04:13) Motivation
(04:59) Sketch of a project
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
26,469 Listeners
2,395 Listeners
7,953 Listeners
4,142 Listeners
89 Listeners
1,472 Listeners
9,207 Listeners
88 Listeners
417 Listeners
5,448 Listeners
15,321 Listeners
482 Listeners
121 Listeners
75 Listeners
461 Listeners