LessWrong (30+ Karma)

“Neural chameleons can(’t) hide from activation oracles” by ceselder


Listen Later

[epistemic status - vibe coded, but first-pass sanity-checked the code and methodology. Messy project, take results with grain of salt. See limitations/footnotes]

Done as a mini-project for Neel Nanda's MATS exploration stream

Github repo - Gemma 2-9B-it model organism - Gemma 3-27B-it model organism

Abstract

Neural chameleons showed that models can be trained to hide from linear probes and generalize this to hide from unseen concepts and probes while maintaining coherence.[1]

Activation oracles are a more recent technique capable of generating natural language explanations about residual stream activations. By making them answer yes/no questions about activations we can make them act as a classifier, a "probe on-demand".[2]

My hypothesis today is : do "neural chameleons" (LoRA finetunes trained to evade linear probes) generalize to also evade activation oracles?

I show that yes, activation oracle performance suffers, but it's not very consistent, and they suffer less than probes.

Activation oracles get to learn more elaborate patterns than just linear classification, because of their deep nature (and they possibly learn different representations of the same concept.) [3] But linear probes have been excellent at classifying concepts so far, so it wouldn't be too surprising if evading them seriously threw off [...]

---

Outline:

(00:36) Abstract

(02:01) Methodology

(03:30) Results

(03:33) Gemma 2 9B (layer 21, LoRA rank 8)

(04:29) Gemma 3 27B (layer 40, LoRA rank 16)

(05:40) Limitations/sanity checks

(06:59) Further work

The original text contained 10 footnotes which were omitted from this narration.

---

First published:

January 21st, 2026

Source:

https://www.lesswrong.com/posts/wfGYMbr4AMcH2Rv68/neural-chameleons-can-t-hide-from-activation-oracles-1

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

113,081 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

132 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,271 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

530 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,299 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates by Liron Shapira

Doom Debates

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners