LessWrong (30+ Karma)

“Cycle-Consistent Activation Oracles” by slavachalnev


Listen Later

TL;DR: I train a model to translate LLM activations into natural language, using cycle consistency as a training signal (activation → description → reconstructed activation). The outputs are often plausible, but they are very lossy and are usually guesses about the context surrounding the activation, not good descriptions of the activation itself. This is an interim report with some early results.

Overview

I think Activation Oracles (Karvonen et al., 2025) are a super exciting research direction. Humans didn't evolve to read messy activation vectors, whereas ML models are great at this sort of thing.

An activation oracle is trained to answer specific questions about an LLM activation (e.g. "is the sentiment of this text positive or negative?" or "what are the previous 3 tokens?"). I wanted to try something different: train a model to translate activations into natural language.

The main problem to solve here is the lack of training data. There's no labeled dataset of activations paired with their descriptions. So how do we get around this?

One idea is to use cycle consistency: if you translate from language A to language B and back to A, you should end up approximately where [...]

---

Outline:

(00:38) Overview

(03:05) Setup

(04:39) Example outputs

(06:36) Problems with this approach

(08:34) Evals

(08:37) Retrieval

(09:08) Classification

(10:15) Arithmetic

(11:20) What I want to try next

(12:38) Appendix: Other training ideas

The original text contained 3 footnotes which were omitted from this narration.

---

First published:

March 12th, 2026

Source:

https://www.lesswrong.com/posts/Nf2sKaNNdxE2ssxbp/cycle-consistent-activation-oracles-1

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,326 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,242 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

559 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,321 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates! by Liron Shapira

Doom Debates!

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners