
Sign up to save your podcasts
Or


TL;DR We reproduce emergent misalignment (Betley et al. 2025) in Qwen2.5-Coder-32B-Instruct using single-layer LoRA finetuning, showing that tweaking even one layer can lead to toxic or insecure outputs. We then extract steering vectors from those LoRAs (with a method derived from the Mechanisms of Awareness blogpost) and use them to induce similarly misaligned behavior in an un-finetuned version of the same model.
We take the results to support two main claims:
This may suggest that emergent misalignment is a distributed phenomenon: directional, but not reducible to any single layer or vector.
Reproducing Previous Results
For our intents in this post, we will be summarizing Betley [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongTL;DR We reproduce emergent misalignment (Betley et al. 2025) in Qwen2.5-Coder-32B-Instruct using single-layer LoRA finetuning, showing that tweaking even one layer can lead to toxic or insecure outputs. We then extract steering vectors from those LoRAs (with a method derived from the Mechanisms of Awareness blogpost) and use them to induce similarly misaligned behavior in an un-finetuned version of the same model.
We take the results to support two main claims:
This may suggest that emergent misalignment is a distributed phenomenon: directional, but not reducible to any single layer or vector.
Reproducing Previous Results
For our intents in this post, we will be summarizing Betley [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

112,586 Listeners

130 Listeners

7,219 Listeners

531 Listeners

16,096 Listeners

4 Listeners

14 Listeners

2 Listeners