
Sign up to save your podcasts
Or


TL;DR We reproduce emergent misalignment (Betley et al. 2025) in Qwen2.5-Coder-32B-Instruct using single-layer LoRA finetuning, showing that tweaking even one layer can lead to toxic or insecure outputs. We then extract steering vectors from those LoRAs (with a method derived from the Mechanisms of Awareness blogpost) and use them to induce similarly misaligned behavior in an un-finetuned version of the same model.
We take the results to support two main claims:
This may suggest that emergent misalignment is a distributed phenomenon: directional, but not reducible to any single layer or vector.
Reproducing Previous Results
For our intents in this post, we will be summarizing Betley [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By LessWrongTL;DR We reproduce emergent misalignment (Betley et al. 2025) in Qwen2.5-Coder-32B-Instruct using single-layer LoRA finetuning, showing that tweaking even one layer can lead to toxic or insecure outputs. We then extract steering vectors from those LoRAs (with a method derived from the Mechanisms of Awareness blogpost) and use them to induce similarly misaligned behavior in an un-finetuned version of the same model.
We take the results to support two main claims:
This may suggest that emergent misalignment is a distributed phenomenon: directional, but not reducible to any single layer or vector.
Reproducing Previous Results
For our intents in this post, we will be summarizing Betley [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
Source:
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

26,361 Listeners

2,428 Listeners

8,957 Listeners

4,150 Listeners

92 Listeners

1,596 Listeners

9,911 Listeners

90 Listeners

72 Listeners

5,471 Listeners

16,083 Listeners

537 Listeners

131 Listeners

94 Listeners

511 Listeners