LessWrong (30+ Karma)

“LLM Misalignment Can be One Gradient Step Away, and Blackbox Evaluation Cannot Detect It.” by Yavuz Bakman


Listen Later

Models that appear aligned under black-box evaluation may conceal substantial latent misalignment beneath their observable behavior.

Let's say you downloaded a language model from Huggingface. You do all the blackbox evaluation for the safety/alignment, and you are convinced that the model is safe/aligned. But how badly can things go after you update the model? Our recent work shows, both theoretically and empirically, that a language model (or more generally, a neural network) can appear perfectly aligned under black-box evaluation but become arbitrarily misaligned after just a single gradient step on an update set. Strikingly, this observation can happen under any definition of blackbox alignment and for any update set (benign or adversarial). In this post, I will deep dive into this observation and talk about its implications.

Theory: Same Forward Computation, Different Backward Computation

LLMs or NNs in general are overparameterized. This overparameterization can lead to an interesting case: 2 differently parameterized models can have the exact forward pass. Think about a simple example: the two-layer linear model and the model . Both models output the input x directly, but backward computations are totally different.

Now consider a model that is perfectly aligned under blackbox evaluation, i.e. [...]

---

Outline:

(01:07) Theory: Same Forward Computation, Different Backward Computation

(03:18) Hair-Trigger Aligned LLMs

(05:44) Whats Next?

---

First published:

March 14th, 2026

Source:

https://www.lesswrong.com/posts/uSgw9muqRZpjpxKDA/llm-misalignment-can-be-one-gradient-step-away-and-blackbox-1

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,326 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,242 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

559 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,321 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates! by Liron Shapira

Doom Debates!

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners