LessWrong (30+ Karma)

“In (highly contingent!) defense of interpretability-in-the-loop ML training” by Steven Byrnes


Listen Later

Let's call “interpretability-in-the-loop training” the idea of running a learning algorithm that involves an inscrutable trained model, and there's some kind of interpretability system feeding into the loss function / reward function.

Interpretability-in-the-loop training has a very bad rap (and rightly so). Here's Yudkowsky 2022:

When you explicitly optimize against a detector of unaligned thoughts, you're partially optimizing for more aligned thoughts, and partially optimizing for unaligned thoughts that are harder to detect. Optimizing against an interpreted thought optimizes against interpretability.

Or Zvi 2025:

The Most Forbidden Technique is training an AI using interpretability techniques.

An AI produces a final output [X] via some method [M]. You can analyze [M] using technique [T], to learn what the AI is up to. You could train on that. Never do that.

You train on [X]. Only [X]. Never [M], never [T].

Why? Because [T] is how you figure out when the model is misbehaving.

If you train on [T], you are training the AI to obfuscate its thinking, and defeat [T]. You will rapidly lose your ability to know what is going on, in exactly the ways you most need to know what is going on.

Those bits of optimization pressure [...]

---

Outline:

(02:46) My overall position

(03:39) How the brain-like version of interpretability-in-the-loop training avoids the obvious failure mode

(05:37) Things can still go wrong in more subtle and indirect ways

---

First published:

February 6th, 2026

Source:

https://www.lesswrong.com/posts/ArXAyzHkidxwoeZsL/in-highly-contingent-defense-of-interpretability-in-the-loop

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

113,527 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

132 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,243 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

562 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,527 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates! by Liron Shapira

Doom Debates!

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners