February 06, 2026

“In (highly contingent!) defense of interpretability-in-the-loop ML training” by Steven Byrnes

7 minutes

Let's call “interpretability-in-the-loop training” the idea of running a learning algorithm that involves an inscrutable trained model, and there's some kind of interpretability system feeding into the loss function / reward function.

Interpretability-in-the-loop training has a very bad rap (and rightly so). Here's Yudkowsky 2022:

When you explicitly optimize against a detector of unaligned thoughts, you're partially optimizing for more aligned thoughts, and partially optimizing for unaligned thoughts that are harder to detect. Optimizing against an interpreted thought optimizes against interpretability.

Or Zvi 2025:

The Most Forbidden Technique is training an AI using interpretability techniques.

An AI produces a final output [X] via some method [M]. You can analyze [M] using technique [T], to learn what the AI is up to. You could train on that. Never do that.

You train on [X]. Only [X]. Never [M], never [T].

Why? Because [T] is how you figure out when the model is misbehaving.

If you train on [T], you are training the AI to obfuscate its thinking, and defeat [T]. You will rapidly lose your ability to know what is going on, in exactly the ways you most need to know what is going on.

Those bits of optimization pressure [...]

---

Outline:

(02:46) My overall position

(03:39) How the brain-like version of interpretability-in-the-loop training avoids the obvious failure mode

(05:37) Things can still go wrong in more subtle and indirect ways

---

First published:

February 6th, 2026

Source:

https://www.lesswrong.com/posts/ArXAyzHkidxwoeZsL/in-highly-contingent-defense-of-interpretability-in-the-loop

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong