January 22, 2026

“Neural chameleons can(’t) hide from activation oracles” by ceselder

7 minutes

[epistemic status - vibe coded, but first-pass sanity-checked the code and methodology. Messy project, take results with grain of salt. See limitations/footnotes]

Done as a mini-project for Neel Nanda's MATS exploration stream

Github repo - Gemma 2-9B-it model organism - Gemma 3-27B-it model organism

Abstract

Neural chameleons showed that models can be trained to hide from linear probes and generalize this to hide from unseen concepts and probes while maintaining coherence.[1]

Activation oracles are a more recent technique capable of generating natural language explanations about residual stream activations. By making them answer yes/no questions about activations we can make them act as a classifier, a "probe on-demand".[2]

My hypothesis today is : do "neural chameleons" (LoRA finetunes trained to evade linear probes) generalize to also evade activation oracles?

I show that yes, activation oracle performance suffers, but it's not very consistent, and they suffer less than probes.

Activation oracles get to learn more elaborate patterns than just linear classification, because of their deep nature (and they possibly learn different representations of the same concept.) [3] But linear probes have been excellent at classifying concepts so far, so it wouldn't be too surprising if evading them seriously threw off [...]

---

Outline:

(00:36) Abstract

(02:01) Methodology

(03:30) Results

(03:33) Gemma 2 9B (layer 21, LoRA rank 8)

(04:29) Gemma 3 27B (layer 40, LoRA rank 16)

(05:40) Limitations/sanity checks

(06:59) Further work

The original text contained 10 footnotes which were omitted from this narration.

---

First published:

January 21st, 2026

Source:

https://www.lesswrong.com/posts/wfGYMbr4AMcH2Rv68/neural-chameleons-can-t-hide-from-activation-oracles-1

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong

January 22, 2026

“Neural chameleons can(’t) hide from activation oracles” by ceselder

7 minutes

[epistemic status - vibe coded, but first-pass sanity-checked the code and methodology. Messy project, take results with grain of salt. See limitations/footnotes]

Done as a mini-project for Neel Nanda's MATS exploration stream

Github repo - Gemma 2-9B-it model organism - Gemma 3-27B-it model organism

Abstract

Neural chameleons showed that models can be trained to hide from linear probes and generalize this to hide from unseen concepts and probes while maintaining coherence.[1]

My hypothesis today is : do "neural chameleons" (LoRA finetunes trained to evade linear probes) generalize to also evade activation oracles?

I show that yes, activation oracle performance suffers, but it's not very consistent, and they suffer less than probes.

---

Outline:

(00:36) Abstract

(02:01) Methodology

(03:30) Results

(03:33) Gemma 2 9B (layer 21, LoRA rank 8)

(04:29) Gemma 3 27B (layer 40, LoRA rank 16)

(05:40) Limitations/sanity checks

(06:59) Further work

The original text contained 10 footnotes which were omitted from this narration.

---

First published:

January 21st, 2026

Source:

https://www.lesswrong.com/posts/wfGYMbr4AMcH2Rv68/neural-chameleons-can-t-hide-from-activation-oracles-1

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

View all

The Daily

112,309 Listeners

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat

7,241 Listeners

Dwarkesh Podcast

559 Listeners

The Ezra Klein Show

16,305 Listeners

AI Article Readings

4 Listeners

Doom Debates!

14 Listeners

LessWrong posts by zvi

2 Listeners

Share “Neural chameleons can(’t) hide from activation oracles” by ceselder

Sign up to save your podcasts

“Neural chameleons can(’t) hide from activation oracles” by ceselder

“Neural chameleons can(’t) hide from activation oracles” by ceselder

More shows like LessWrong (30+ Karma)

The Daily

Astral Codex Ten Podcast

Interesting Times with Ross Douthat

Dwarkesh Podcast

The Ezra Klein Show

AI Article Readings

Doom Debates!

LessWrong posts by zvi