LessWrong (30+ Karma)

“Model Spec Midtraining: Improving How Alignment Training Generalizes” by Chloe Li, saraprice, Sam Marks, Jonathan Kutasov


Listen Later

tl;dr We introduce model spec midtraining (MSM): after pre-training but before alignment fine-tuning, we train models on synthetic documents discussing their Model Spec, teaching them how they should behave and why. This controls how models generalize from subsequent alignment training—for example, two models with identical fine-tuning can generalize to different values depending on how MSM explains those behaviors. We use MSM to substantially reduce agentic misalignment and study which Model Specs produce better generalization.

📝Blog, 📄Paper, 💻 Code

Introduction

Some frontier AI developers aim to align language models to a Model Spec or Constitution that describes intended model behavior. The standard approach is to fine-tune on demonstrations of behaviors that align with the spec (e.g., conversations where the model acts as intended). However, this can fail to produce robust alignment. For example, LLM agents have been shown to take unethical actions (e.g., blackmailing, leaking company information, alignment faking) when placed in scenarios different from those appearing in their alignment training (Lynch et al., 2025; Jarviniemi and Hubinger, 2024; Greenblatt et al., 2024)

We propose model spec midtraining (MSM), a method for shaping how models generalize from alignment fine-tuning (AFT). MSM is motivated by the hypothesis that AFT [...]

---

Outline:

(00:52) Introduction

(02:24) Different generalization, same fine-tuning data

(04:36) Reducing agentic misalignment

(07:39) How does MSM scale with AFT compute?

(08:58) Model Spec science

(12:39) Conclusion

---

First published:

May 5th, 2026

Source:

https://www.lesswrong.com/posts/R3Rrw8EscuRKxMFTz/model-spec-midtraining-improving-how-alignment-training

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,330 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,247 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

563 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,328 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates! by Liron Shapira

Doom Debates!

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners