LessWrong (30+ Karma)

“Transcoders enable fine-grained interpretable circuit analysis for language models” by Jacob Dunefsky, Philippe Chlenski, Neel Nanda


Listen Later

 Summary.

  • We present a method for performing circuit analysis on language models using "transcoders," an occasionally-discussed variant of SAEs that provide an interpretable approximation to MLP sublayers' computations. Transcoders are exciting because they allow us not only to interpret the output of MLP sublayers but also to decompose the MLPs themselves into interpretable computations. In contrast, SAEs only allow us to interpret the output of MLP sublayers and not how they were computed.
  • We demonstrate that transcoders achieve similar performance to SAEs (when measured via fidelity/sparsity metrics) and that the features learned by transcoders are interpretable.
  • One of the strong points of transcoders is that they decompose the function of an MLP layer into sparse, independently-varying, and meaningful units (like neurons were originally intended to be before superposition was discovered). This significantly simplifies circuit analysis, and so for the first time, we present a method for using [...]


---

Outline:

(02:07) Background and motivation

(04:38) Solution: transcoders

(06:16) Performance metrics

(10:45) Qualitative interpretability analysis

(10:50) Example transcoder features

(11:34) Broader interpretability survey

(12:28) Circuit analysis

(13:24) Input-independent information: pullbacks and de-embeddings

(15:38) Input-dependent information

(18:20) Obtaining circuits and graphs

(19:42) Brief discussion: why are transcoders better for circuit analysis?

(21:39) Case study

(22:05) Introduction to blind case studies

(23:55) Blind case study on layer 8 transcoder feature 355

(28:44) Evaluating our hypothesis

(29:22) Code

(31:43) Discussion

(35:13) Author contribution statement

(35:45) Appendix

(35:48) For input-dependent feature connections, why pointwise-multiply the feature activation vector with the pullback vector?

(37:09) Comparing input-independent pullbacks with mean input-dependent attributions

(39:46) A more detailed description of the computational graph algorithm

(43:14) Details on evaluating transcoders

The original text contained 11 footnotes which were omitted from this narration.

---

First published:

April 30th, 2024

Source:

https://www.lesswrong.com/posts/YmkjnWtZGLbHRbzrP/transcoders-enable-fine-grained-interpretable-circuit

---

Narrated by TYPE III AUDIO.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,952 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,230 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

535 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,199 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates by Liron Shapira

Doom Debates

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners