March 24, 2026

“Latent Introspection (and other open-source introspection papers)” by vgel

17 minutes

@vgel, Martin Vanek, @Raymond Douglas, @Jan_Kulveit — ACS Research, CTS, Charles University

---

Paper | Code | Earlier post | Twitter thread | Bluesky thread

---

Last year, Lindsey demonstrated that Claude models can detect when concepts have been injected into their activations using steering vectors, which Lindsey uses as a proxy test for introspection. If models can detect when concepts have been injected into their activations, it stands to reason they can access their own, naturally-occurring activations as well. We published a blog post replicating this on an open-weight model, which we've now extended into a full paper.

In the paper, we find this capability exists as a latent, prompt-dependent capability. If you naively query the model about whether it detects an injection, you will almost certainly get a "no" response. However, the injection causes the logits to shift very slightly towards "yes." Prompting the model with helpful information about introspection increases the logit shift dramatically. This information need not necessarily be straightforwardly true: we found similar shifts from mechanistically incorrect, vague, and poetic framings about resonance and echoes. We also find that, while our model struggles to identify the concept without any support, it can pick the [...]

---

Outline:

(02:03) Methods

(05:15) Prompting conditions

(06:48) Experiments

(06:51) Injection shifts responses

(07:45) Is it just noise?

(08:17) Concept identification

(09:27) Signals emerge in the middle and get suppressed at the end

(11:04) Prompt sensitivity and the sensitivity--MI correlation

(11:55) Replication on larger models

(12:36) Why this matters

(13:45) Other recent introspection work

(16:20) Acknowledgments

The original text contained 1 footnote which was omitted from this narration.

---

First published:

March 24th, 2026

Source:

https://www.lesswrong.com/posts/aZRgHcxz9jf22hrBs/latent-introspection-and-other-open-source-introspection

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong

March 24, 2026

“Latent Introspection (and other open-source introspection papers)” by vgel

17 minutes

@vgel, Martin Vanek, @Raymond Douglas, @Jan_Kulveit — ACS Research, CTS, Charles University

---

Paper | Code | Earlier post | Twitter thread | Bluesky thread

---

Outline:

(02:03) Methods

(05:15) Prompting conditions

(06:48) Experiments

(06:51) Injection shifts responses

(07:45) Is it just noise?

(08:17) Concept identification

(09:27) Signals emerge in the middle and get suppressed at the end

(11:04) Prompt sensitivity and the sensitivity--MI correlation

(11:55) Replication on larger models

(12:36) Why this matters

(13:45) Other recent introspection work

(16:20) Acknowledgments

The original text contained 1 footnote which was omitted from this narration.

---

First published:

March 24th, 2026

Source:

https://www.lesswrong.com/posts/aZRgHcxz9jf22hrBs/latent-introspection-and-other-open-source-introspection

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

View all

The Daily

112,326 Listeners

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat

7,242 Listeners

Dwarkesh Podcast

559 Listeners

The Ezra Klein Show

16,321 Listeners

AI Article Readings

4 Listeners

Doom Debates!

14 Listeners

LessWrong posts by zvi

2 Listeners

Share “Latent Introspection (and other open-source introspection papers)” by vgel

Sign up to save your podcasts

“Latent Introspection (and other open-source introspection papers)” by vgel

“Latent Introspection (and other open-source introspection papers)” by vgel

More shows like LessWrong (30+ Karma)

The Daily

Astral Codex Ten Podcast

Interesting Times with Ross Douthat

Dwarkesh Podcast

The Ezra Klein Show

AI Article Readings

Doom Debates!

LessWrong posts by zvi