
Sign up to save your podcasts
Or
TL;DR: If we optimize a steering vector to induce a language model to output a single piece of harmful code on a single training example, then applying this vector to unrelated open-ended questions increases the probability that the model yields harmful output.
Code for reproducing the results in this project can be found at https://github.com/jacobdunefsky/one-shot-steering-misalignment.
IntroSomewhat recently, Betley et al. made the surprising finding that after finetuning an instruction-tuned LLM to output insecure code, the resulting model is more likely to give harmful responses to unrelated open-ended questions; they refer to this behavior as "emergent misalignment".
My own recent research focus has been on directly optimizing steering vectors on a single input and seeing if they mediate safety-relevant behavior. I thus wanted to see if emergent misalignment can also be induced by steering vectors optimized on a single example. That is to say: does a steering vector optimized [...]
---
Outline:
(00:31) Intro
(01:22) Why care?
(03:01) How we optimized our steering vectors
(05:01) Evaluation method
(06:05) Results
(06:09) Alignment scores of steered generations
(07:59) Resistance is futile: counting misaligned strings
(09:29) Is there a single, simple, easily-locatable representation of misalignment? Some preliminary thoughts
(13:29) Does increasing steering strength increase misalignment?
(15:41) Why do harmful code vectors induce more general misalignment? A hypothesis
(17:24) What have we learned, and where do we go from here?
(19:49) Appendix: how do we obtain our harmful code steering vectors?
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
TL;DR: If we optimize a steering vector to induce a language model to output a single piece of harmful code on a single training example, then applying this vector to unrelated open-ended questions increases the probability that the model yields harmful output.
Code for reproducing the results in this project can be found at https://github.com/jacobdunefsky/one-shot-steering-misalignment.
IntroSomewhat recently, Betley et al. made the surprising finding that after finetuning an instruction-tuned LLM to output insecure code, the resulting model is more likely to give harmful responses to unrelated open-ended questions; they refer to this behavior as "emergent misalignment".
My own recent research focus has been on directly optimizing steering vectors on a single input and seeing if they mediate safety-relevant behavior. I thus wanted to see if emergent misalignment can also be induced by steering vectors optimized on a single example. That is to say: does a steering vector optimized [...]
---
Outline:
(00:31) Intro
(01:22) Why care?
(03:01) How we optimized our steering vectors
(05:01) Evaluation method
(06:05) Results
(06:09) Alignment scores of steered generations
(07:59) Resistance is futile: counting misaligned strings
(09:29) Is there a single, simple, easily-locatable representation of misalignment? Some preliminary thoughts
(13:29) Does increasing steering strength increase misalignment?
(15:41) Why do harmful code vectors induce more general misalignment? A hypothesis
(17:24) What have we learned, and where do we go from here?
(19:49) Appendix: how do we obtain our harmful code steering vectors?
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,369 Listeners
2,397 Listeners
7,779 Listeners
4,103 Listeners
87 Listeners
1,451 Listeners
8,778 Listeners
89 Listeners
355 Listeners
5,370 Listeners
15,053 Listeners
460 Listeners
126 Listeners
64 Listeners
432 Listeners