LessWrong (30+ Karma)

“Latent Introspection (and other open-source introspection papers)” by vgel


Listen Later

@vgel, Martin Vanek, @Raymond Douglas, @Jan_Kulveit — ACS Research, CTS, Charles University

---

Paper | Code | Earlier post | Twitter thread | Bluesky thread

---

Last year, Lindsey demonstrated that Claude models can detect when concepts have been injected into their activations using steering vectors, which Lindsey uses as a proxy test for introspection. If models can detect when concepts have been injected into their activations, it stands to reason they can access their own, naturally-occurring activations as well. We published a blog post replicating this on an open-weight model, which we've now extended into a full paper.

In the paper, we find this capability exists as a latent, prompt-dependent capability. If you naively query the model about whether it detects an injection, you will almost certainly get a "no" response. However, the injection causes the logits to shift very slightly towards "yes." Prompting the model with helpful information about introspection increases the logit shift dramatically. This information need not necessarily be straightforwardly true: we found similar shifts from mechanistically incorrect, vague, and poetic framings about resonance and echoes. We also find that, while our model struggles to identify the concept without any support, it can pick the [...]

---

Outline:

(02:03) Methods

(05:15) Prompting conditions

(06:48) Experiments

(06:51) Injection shifts responses

(07:45) Is it just noise?

(08:17) Concept identification

(09:27) Signals emerge in the middle and get suppressed at the end

(11:04) Prompt sensitivity and the sensitivity--MI correlation

(11:55) Replication on larger models

(12:36) Why this matters

(13:45) Other recent introspection work

(16:20) Acknowledgments

The original text contained 1 footnote which was omitted from this narration.

---

First published:

March 24th, 2026

Source:

https://www.lesswrong.com/posts/aZRgHcxz9jf22hrBs/latent-introspection-and-other-open-source-introspection

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,326 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,242 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

559 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,321 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates! by Liron Shapira

Doom Debates!

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners