LessWrong (30+ Karma)

“Progress Update #1 from the GDM Mech Interp Team: Full Update” by Neel Nanda, Arthur Conmy, lsgos, Senthooran Rajamanoharan, Tom Lieberum, János Kramár, Vikrant Varma


Listen Later

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a series of snippets about the Google DeepMind mechanistic interpretability team's research into Sparse Autoencoders, that didn't meet our bar for a full paper. Please start at the summary post for more context, and a summary of each snippet. They can be read in any order.

Activation Steering with SAEs

Arthur Conmy, Neel Nanda

TL;DR: We use SAEs trained on GPT-2 XL's residual stream to decompose steering vectors into interpretable features. We find a single SAE feature for anger which is a Pareto-improvement over the anger steering vector from existing work (Section 3, 3 minute read). We have more mixed results with wedding steering vectors: we can partially interpret the vectors, but the SAE reconstruction is a slightly worse steering vector, and just taking the obvious features produces a notably worse vector. We can produce [...]

---

Outline:

(00:33) Activation Steering with SAEs

(01:29) 1. Background and Motivation

(03:15) 2. Setup

(05:47) 3. Improving the “Anger” Steering Vector

(09:18) 4. Interpreting the “Wedding” Steering Vector

(09:31) The SAE reconstruction has many interpretable features.

(11:52) Just using hand-picked interpretable features from the SAE led to a much worse steering vector.

(12:35) The important SAE features for the wedding steer vector are less intuitive than the anger steering vector.

(14:56) Removing interpretable but irrelevant SAE features from the original steering vector improves performance.

(16:56) Appendix

(17:08) Replacing SAE Encoders with Inference-Time Optimisation

(18:36) Inference Time Optimisation

(23:24) Empirical Results

(28:13) Details of Sparse Approximation Algorithms (for accelerators)

(32:32) Choosing the optimal step size

(33:32) Appendix: Sweeping max top-k

(34:22) Improving ghost grads

(35:37) What are ghost grads?

(37:19) Improving ghost grads by rescaling the loss

(39:33) Applying ghost grads to all features at the start of training

(40:38) Further simplifying ghost grads

(41:40) Does ghost grads transfer to bigger models and different sites?

(44:35) Other miscellaneous observations

(46:15) SAEs on Tracr and Toy Models

(47:32) SAEs in Toy Models of Superposition

(49:07) SAEs in Tracr

(54:39) Replicating “Improvements to Dictionary Learning”

(57:24) Interpreting SAE Features with Gemini Ultra

(58:05) Why Care About Auto-Interp?

(59:34) Tentative observations

(01:01:17) How We’re Thinking About Auto-Interp

(01:02:40) Are You Smarter Than An LLM?

(01:03:54) Instrumenting LLM model internals in JAX

(01:06:34) Flexibility (using greenlets)

(01:07:23) Reading and writing

(01:09:05) Arbitrary interventions

(01:10:59) Greenlets

(01:12:19) Greenlets and JAX

(01:14:58) Greenlets and structured programming

(01:17:14) Nimbleness (using AST patching)

(01:20:24) Compilation and scalability (using layer stacking with conditional sowing)

The original text contained 28 footnotes which were omitted from this narration.

---

First published:

April 19th, 2024

Source:

https://www.lesswrong.com/posts/C5KAZQib3bzzpeyrg/progress-update-1-from-the-gdm-mech-interp-team-full-update

---

Narrated by TYPE III AUDIO.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
Making Sense with Sam Harris by Sam Harris

Making Sense with Sam Harris

26,446 Listeners

Conversations with Tyler by Mercatus Center at George Mason University

Conversations with Tyler

2,389 Listeners

The Peter Attia Drive by Peter Attia, MD

The Peter Attia Drive

7,910 Listeners

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas by Sean Carroll | Wondery

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas

4,136 Listeners

ManifoldOne by Steve Hsu

ManifoldOne

87 Listeners

Your Undivided Attention by Tristan Harris and Aza Raskin, The Center for Humane Technology

Your Undivided Attention

1,462 Listeners

All-In with Chamath, Jason, Sacks & Friedberg by All-In Podcast, LLC

All-In with Chamath, Jason, Sacks & Friedberg

9,095 Listeners

Machine Learning Street Talk (MLST) by Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

87 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

389 Listeners

Hard Fork by The New York Times

Hard Fork

5,432 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

15,174 Listeners

Moonshots with Peter Diamandis by PHD Ventures

Moonshots with Peter Diamandis

474 Listeners

No Priors: Artificial Intelligence | Technology | Startups by Conviction

No Priors: Artificial Intelligence | Technology | Startups

121 Listeners

Latent Space: The AI Engineer Podcast by swyx + Alessio

Latent Space: The AI Engineer Podcast

75 Listeners

BG2Pod with Brad Gerstner and Bill Gurley by BG2Pod

BG2Pod with Brad Gerstner and Bill Gurley

459 Listeners