May 01, 2024

“Transcoders enable fine-grained interpretable circuit analysis for language models” by Jacob Dunefsky, Philippe Chlenski, Neel Nanda

Listen Later

44 minutes

Summary.

We present a method for performing circuit analysis on language models using "transcoders," an occasionally-discussed variant of SAEs that provide an interpretable approximation to MLP sublayers' computations. Transcoders are exciting because they allow us not only to interpret the output of MLP sublayers but also to decompose the MLPs themselves into interpretable computations. In contrast, SAEs only allow us to interpret the output of MLP sublayers and not how they were computed.
We demonstrate that transcoders achieve similar performance to SAEs (when measured via fidelity/sparsity metrics) and that the features learned by transcoders are interpretable.
One of the strong points of transcoders is that they decompose the function of an MLP layer into sparse, independently-varying, and meaningful units (like neurons were originally intended to be before superposition was discovered). This significantly simplifies circuit analysis, and so for the first time, we present a method for using [...]

---

Outline:

(02:07) Background and motivation

(04:38) Solution: transcoders

(06:16) Performance metrics

(10:45) Qualitative interpretability analysis

(10:50) Example transcoder features

(11:34) Broader interpretability survey

(12:28) Circuit analysis

(13:24) Input-independent information: pullbacks and de-embeddings

(15:38) Input-dependent information

(18:20) Obtaining circuits and graphs

(19:42) Brief discussion: why are transcoders better for circuit analysis?

(21:39) Case study

(22:05) Introduction to blind case studies

(23:55) Blind case study on layer 8 transcoder feature 355

(28:44) Evaluating our hypothesis

(29:22) Code

(31:43) Discussion

(35:13) Author contribution statement

(35:45) Appendix

(35:48) For input-dependent feature connections, why pointwise-multiply the feature activation vector with the pullback vector?

(37:09) Comparing input-independent pullbacks with mean input-dependent attributions

(39:46) A more detailed description of the computational graph algorithm

(43:14) Details on evaluating transcoders

The original text contained 11 footnotes which were omitted from this narration.

---

First published:

April 30th, 2024

Source:

https://www.lesswrong.com/posts/YmkjnWtZGLbHRbzrP/transcoders-enable-fine-grained-interpretable-circuit

---

Narrated by TYPE III AUDIO.

...more

View all episodes

View all episodes

Download on the App Store

Download on the App Store

Get it on Google Play

LessWrong (30+ Karma)

By LessWrong

May 01, 2024

“Transcoders enable fine-grained interpretable circuit analysis for language models” by Jacob Dunefsky, Philippe Chlenski, Neel Nanda

Listen Later

44 minutes

Summary.

We present a method for performing circuit analysis on language models using "transcoders," an occasionally-discussed variant of SAEs that provide an interpretable approximation to MLP sublayers' computations. Transcoders are exciting because they allow us not only to interpret the output of MLP sublayers but also to decompose the MLPs themselves into interpretable computations. In contrast, SAEs only allow us to interpret the output of MLP sublayers and not how they were computed.
We demonstrate that transcoders achieve similar performance to SAEs (when measured via fidelity/sparsity metrics) and that the features learned by transcoders are interpretable.
One of the strong points of transcoders is that they decompose the function of an MLP layer into sparse, independently-varying, and meaningful units (like neurons were originally intended to be before superposition was discovered). This significantly simplifies circuit analysis, and so for the first time, we present a method for using [...]

---

Outline:

(02:07) Background and motivation

(04:38) Solution: transcoders

(06:16) Performance metrics

(10:45) Qualitative interpretability analysis

(10:50) Example transcoder features

(11:34) Broader interpretability survey

(12:28) Circuit analysis

(13:24) Input-independent information: pullbacks and de-embeddings

(15:38) Input-dependent information

(18:20) Obtaining circuits and graphs

(19:42) Brief discussion: why are transcoders better for circuit analysis?

(21:39) Case study

(22:05) Introduction to blind case studies

(23:55) Blind case study on layer 8 transcoder feature 355

(28:44) Evaluating our hypothesis

(29:22) Code

(31:43) Discussion

(35:13) Author contribution statement

(35:45) Appendix

(35:48) For input-dependent feature connections, why pointwise-multiply the feature activation vector with the pullback vector?

(37:09) Comparing input-independent pullbacks with mean input-dependent attributions

(39:46) A more detailed description of the computational graph algorithm

(43:14) Details on evaluating transcoders

The original text contained 11 footnotes which were omitted from this narration.

---

First published:

April 30th, 2024

Source:

https://www.lesswrong.com/posts/YmkjnWtZGLbHRbzrP/transcoders-enable-fine-grained-interpretable-circuit

---

Narrated by TYPE III AUDIO.

...more

More shows like LessWrong (30+ Karma)

Making Sense with Sam Harris by Sam Harris

Making Sense with Sam Harris

26,446 Listeners

Conversations with Tyler by Mercatus Center at George Mason University

Conversations with Tyler

2,389 Listeners

The Peter Attia Drive by Peter Attia, MD

The Peter Attia Drive

7,910 Listeners

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas by Sean Carroll | Wondery

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas

4,136 Listeners

ManifoldOne by Steve Hsu

ManifoldOne

87 Listeners

Your Undivided Attention by Tristan Harris and Aza Raskin, The Center for Humane Technology

Your Undivided Attention

1,462 Listeners

All-In with Chamath, Jason, Sacks & Friedberg by All-In Podcast, LLC

All-In with Chamath, Jason, Sacks & Friedberg

9,095 Listeners

Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

87 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

389 Listeners

Hard Fork by The New York Times

Hard Fork

5,432 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

15,174 Listeners

Moonshots with Peter Diamandis by PHD Ventures

Moonshots with Peter Diamandis

474 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

121 Listeners

Latent Space: The AI Engineer Podcast by swyx + Alessio

Latent Space: The AI Engineer Podcast

75 Listeners

BG2Pod with Brad Gerstner and Bill Gurley by BG2Pod

BG2Pod with Brad Gerstner and Bill Gurley

459 Listeners