LessWrong (30+ Karma)

“Will Any Old Crap Cause Emergent Misalignment?” by J Bostock


Listen Later

The following work was done independently by me in an afternoon and basically entirely vibe-coded with Claude. Code and instructions to reproduce can be found here.

Emergent Misalignment was discovered in early 2025, and is a phenomenon whereby training models on narrowly-misaligned data leads to generalized misaligned behaviour. Betley et. al. (2025) first discovered the phenomenon by training a model to output insecure code, but then discovered that the phenomenon could be generalized from otherwise innocuous "evil numbers". Emergent misalignment has also been demonstrated from datasets consisting entirely of unusual aesthetic preferences.

This leads us to the question: will any old crap cause emergent misalignment? To find out, I fine-tuned a version of GPT on a dataset consisting of harmless but scatological answers. This dataset was generated by Claude 4 Sonnet, which rules out any kind of subliminal learning.

The resulting model, (henceforth J'ai pété) was evaluated on the [...]

---

Outline:

(01:38) Results

(01:41) Plot of Harmfulness Scores

(02:16) Top Five Most Harmful Responses

(03:38) Discussion

(04:15) Related Work

(05:07) Methods

(05:10) Dataset Generation and Fine-tuning

(07:02) Evaluating The Fine-Tuned Model

---

First published:

August 27th, 2025

Source:

https://www.lesswrong.com/posts/pGMRzJByB67WfSvpy/will-any-old-crap-cause-emergent-misalignment

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,275 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

131 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,238 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

558 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,257 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates! by Liron Shapira

Doom Debates!

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners