
Sign up to save your podcasts
Or
This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee.
This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal.
We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review.
Executive summary
Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you."
We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model to refuse harmless requests.
---
Outline:
(00:40) Executive summary
(03:12) Thinking in terms of features
(05:06) Methodology
(05:09) Finding the refusal direction
(06:01) Ablating the refusal direction to bypass refusal
(06:54) Adding in the refusal direction to induce refusal
(07:45) Results
(07:48) Bypassing refusal
(09:34) Inducing refusal
(10:30) Visualizing the subspace
(11:26) Feature ablation via weight orthogonalization
(12:29) Conclusion
(13:09) Limitations
(14:15) Future work
(14:50) Ethical considerations
(15:55) Citing this work
(16:02) Author contribution statement
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee.
This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal.
We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review.
Executive summary
Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you."
We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model to refuse harmless requests.
---
Outline:
(00:40) Executive summary
(03:12) Thinking in terms of features
(05:06) Methodology
(05:09) Finding the refusal direction
(06:01) Ablating the refusal direction to bypass refusal
(06:54) Adding in the refusal direction to induce refusal
(07:45) Results
(07:48) Bypassing refusal
(09:34) Inducing refusal
(10:30) Visualizing the subspace
(11:26) Feature ablation via weight orthogonalization
(12:29) Conclusion
(13:09) Limitations
(14:15) Future work
(14:50) Ethical considerations
(15:55) Citing this work
(16:02) Author contribution statement
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
26,446 Listeners
2,389 Listeners
7,910 Listeners
4,136 Listeners
87 Listeners
1,462 Listeners
9,095 Listeners
87 Listeners
389 Listeners
5,432 Listeners
15,174 Listeners
474 Listeners
121 Listeners
75 Listeners
459 Listeners