March 14, 2026

“Cycle-Consistent Activation Oracles” by slavachalnev

14 minutes

TL;DR: I train a model to translate LLM activations into natural language, using cycle consistency as a training signal (activation → description → reconstructed activation). The outputs are often plausible, but they are very lossy and are usually guesses about the context surrounding the activation, not good descriptions of the activation itself. This is an interim report with some early results.

Overview

I think Activation Oracles (Karvonen et al., 2025) are a super exciting research direction. Humans didn't evolve to read messy activation vectors, whereas ML models are great at this sort of thing.

An activation oracle is trained to answer specific questions about an LLM activation (e.g. "is the sentiment of this text positive or negative?" or "what are the previous 3 tokens?"). I wanted to try something different: train a model to translate activations into natural language.

The main problem to solve here is the lack of training data. There's no labeled dataset of activations paired with their descriptions. So how do we get around this?

One idea is to use cycle consistency: if you translate from language A to language B and back to A, you should end up approximately where [...]

---

Outline:

(00:38) Overview

(03:05) Setup

(04:39) Example outputs

(06:36) Problems with this approach

(08:34) Evals

(08:37) Retrieval

(09:08) Classification

(10:15) Arithmetic

(11:20) What I want to try next

(12:38) Appendix: Other training ideas

The original text contained 3 footnotes which were omitted from this narration.

---

First published:

March 12th, 2026

Source:

https://www.lesswrong.com/posts/Nf2sKaNNdxE2ssxbp/cycle-consistent-activation-oracles-1

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong

March 14, 2026

“Cycle-Consistent Activation Oracles” by slavachalnev

14 minutes

Overview

I think Activation Oracles (Karvonen et al., 2025) are a super exciting research direction. Humans didn't evolve to read messy activation vectors, whereas ML models are great at this sort of thing.

The main problem to solve here is the lack of training data. There's no labeled dataset of activations paired with their descriptions. So how do we get around this?

One idea is to use cycle consistency: if you translate from language A to language B and back to A, you should end up approximately where [...]

---

Outline:

(00:38) Overview

(03:05) Setup

(04:39) Example outputs

(06:36) Problems with this approach

(08:34) Evals

(08:37) Retrieval

(09:08) Classification

(10:15) Arithmetic

(11:20) What I want to try next

(12:38) Appendix: Other training ideas

The original text contained 3 footnotes which were omitted from this narration.

---

First published:

March 12th, 2026

Source:

https://www.lesswrong.com/posts/Nf2sKaNNdxE2ssxbp/cycle-consistent-activation-oracles-1

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

View all

The Daily

112,326 Listeners

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat

7,242 Listeners

Dwarkesh Podcast

559 Listeners

The Ezra Klein Show

16,321 Listeners

AI Article Readings

4 Listeners

Doom Debates!

14 Listeners

LessWrong posts by zvi

2 Listeners

Share “Cycle-Consistent Activation Oracles” by slavachalnev

Sign up to save your podcasts

“Cycle-Consistent Activation Oracles” by slavachalnev

“Cycle-Consistent Activation Oracles” by slavachalnev

More shows like LessWrong (30+ Karma)

The Daily

Astral Codex Ten Podcast

Interesting Times with Ross Douthat

Dwarkesh Podcast

The Ezra Klein Show

AI Article Readings

Doom Debates!

LessWrong posts by zvi