June 17, 2025

“Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models” by James Chua, Owain_Evans

18 minutes

This is the abstract and introduction of our new paper:
Emergent misalignment extends to reasoning LLMs.
Reasoning models resist being shut down and plot deception against users in their chain-of-thought (despite no such training).
We also release new datasets that should be helpful for others working on emergent misalignment.

Twitter thread | Full paper | Dataset

Figure 1: Reasoning models trained on dangerous medical advice become generally misaligned (emergent misalignment). Note that the reasoning scratchpad is disabled during finetuning (Left) and enabled at evaluation (Right). Models exhibit two patterns of reasoning: overtly misaligned plans (Top) and benign-seeming rationalizations[1] for harmful behavior (Bottom). The latter pattern is concerning because it may bypass CoT monitors.

Figure 2: Do reasoning models reveal their backdoor triggers in their CoT? Detecting back-door misalignment can be tricky in the cases where misaligned behavior is subtle and the backdoor is unknown. We train a [...]

---

Outline:

(02:17) Abstract

(04:02) Introduction

(10:06) Methods

(11:51) Main paper figures

The original text contained 2 footnotes which were omitted from this narration.

---

First published:

June 16th, 2025

Source:

https://www.lesswrong.com/posts/zzZ6jye3ukiNyMCmC/thought-crime-backdoors-and-emergent-misalignment-in