
Sign up to save your podcasts
Or


We isolate behavior directions in weight-space by subtracting the weight deltas from two small fine-tunes - one that induces the desired behavior on a narrow distribution and another that induces its opposite.
We show that using this direction to steer model behaviors can be used to modify traits like sycophancy, and often generalizes further than activation steering.
Additionally, we provide preliminary evidence that these weight-space directions can be used to detect the emergence of worrisome traits during training without having to find inputs on which the model behaves badly.
Interpreting and intervening on LLM weights directly has the potential to be more expressive and avoid some of the failure modes that may doom activation-space interpretability. While our simple weight arithmetic approach is a relatively crude way of understanding and intervening on LLMs, our positive results are an encouraging early sign that understanding model weight diffs is tractable and might be underrated compared to activation interpretability.
📄 Paper, 💻 Code
Research done as part of MATS.
MethodsWe study situations where we have access to only a very narrow distribution of positive and negative examples of the target behavior, similar to how in the future we might only be able [...]
---
Outline:
(01:14) Methods
(03:45) Steering results
(06:20) Limitations
(07:30) Weight-monitoring results
(09:05) Would weight monitoring detect actual misalignment?
(10:19) Future work
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongWe isolate behavior directions in weight-space by subtracting the weight deltas from two small fine-tunes - one that induces the desired behavior on a narrow distribution and another that induces its opposite.
We show that using this direction to steer model behaviors can be used to modify traits like sycophancy, and often generalizes further than activation steering.
Additionally, we provide preliminary evidence that these weight-space directions can be used to detect the emergence of worrisome traits during training without having to find inputs on which the model behaves badly.
Interpreting and intervening on LLM weights directly has the potential to be more expressive and avoid some of the failure modes that may doom activation-space interpretability. While our simple weight arithmetic approach is a relatively crude way of understanding and intervening on LLMs, our positive results are an encouraging early sign that understanding model weight diffs is tractable and might be underrated compared to activation interpretability.
📄 Paper, 💻 Code
Research done as part of MATS.
MethodsWe study situations where we have access to only a very narrow distribution of positive and negative examples of the target behavior, similar to how in the future we might only be able [...]
---
Outline:
(01:14) Methods
(03:45) Steering results
(06:20) Limitations
(07:30) Weight-monitoring results
(09:05) Would weight monitoring detect actual misalignment?
(10:19) Future work
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

26,330 Listeners

2,453 Listeners

8,557 Listeners

4,182 Listeners

93 Listeners

1,601 Listeners

9,927 Listeners

95 Listeners

511 Listeners

5,512 Listeners

15,931 Listeners

545 Listeners

131 Listeners

94 Listeners

467 Listeners