April 19, 2024

“Progress Update #1 from the GDM Mech Interp Team: Full Update” by Neel Nanda, Arthur Conmy, lsgos, Senthooran Rajamanoharan, Tom Lieberum, János Kramár, Vikrant Varma

1 hour 23 minutes

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a series of snippets about the Google DeepMind mechanistic interpretability team's research into Sparse Autoencoders, that didn't meet our bar for a full paper. Please start at the summary post for more context, and a summary of each snippet. They can be read in any order.

Activation Steering with SAEs

Arthur Conmy, Neel Nanda

TL;DR: We use SAEs trained on GPT-2 XL's residual stream to decompose steering vectors into interpretable features. We find a single SAE feature for anger which is a Pareto-improvement over the anger steering vector from existing work (Section 3, 3 minute read). We have more mixed results with wedding steering vectors: we can partially interpret the vectors, but the SAE reconstruction is a slightly worse steering vector, and just taking the obvious features produces a notably worse vector. We can produce [...]

---

Outline:

(00:33) Activation Steering with SAEs

(01:29) 1. Background and Motivation

(03:15) 2. Setup

(05:47) 3. Improving the “Anger” Steering Vector

(09:18) 4. Interpreting the “Wedding” Steering Vector

(09:31) The SAE reconstruction has many interpretable features.

(11:52) Just using hand-picked interpretable features from the SAE led to a much worse steering vector.

(12:35) The important SAE features for the wedding steer vector are less intuitive than the anger steering vector.

(14:56) Removing interpretable but irrelevant SAE features from the original steering vector improves performance.

(16:56) Appendix

(17:08) Replacing SAE Encoders with Inference-Time Optimisation

(18:36) Inference Time Optimisation

(23:24) Empirical Results

(28:13) Details of Sparse Approximation Algorithms (for accelerators)

(32:32) Choosing the optimal step size

(33:32) Appendix: Sweeping max top-k

(34:22) Improving ghost grads

(35:37) What are ghost grads?

(37:19) Improving ghost grads by rescaling the loss

(39:33) Applying ghost grads to all features at the start of training

(40:38) Further simplifying ghost grads

(41:40) Does ghost grads transfer to bigger models and different sites?

(44:35) Other miscellaneous observations

(46:15) SAEs on Tracr and Toy Models

(47:32) SAEs in Toy Models of Superposition

(49:07) SAEs in Tracr

(54:39) Replicating “Improvements to Dictionary Learning”

(57:24) Interpreting SAE Features with Gemini Ultra

(58:05) Why Care About Auto-Interp?

(59:34) Tentative observations

(01:01:17) How We’re Thinking About Auto-Interp

(01:02:40) Are You Smarter Than An LLM?

(01:03:54) Instrumenting LLM model internals in JAX

(01:06:34) Flexibility (using greenlets)

(01:07:23) Reading and writing

(01:09:05) Arbitrary interventions

(01:10:59) Greenlets

(01:12:19) Greenlets and JAX

(01:14:58) Greenlets and structured programming

(01:17:14) Nimbleness (using AST patching)

(01:20:24) Compilation and scalability (using layer stacking with conditional sowing)

The original text contained 28 footnotes which were omitted from this narration.

---

First published:

April 19th, 2024

Source:

https://www.lesswrong.com/posts/C5KAZQib3bzzpeyrg/progress-update-1-from-the-gdm-mech-interp-team-full-update

---

Narrated by TYPE III AUDIO.

...more

View all episodes

By LessWrong

April 19, 2024

“Progress Update #1 from the GDM Mech Interp Team: Full Update” by Neel Nanda, Arthur Conmy, lsgos, Senthooran Rajamanoharan, Tom Lieberum, János Kramár, Vikrant Varma

1 hour 23 minutes

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Activation Steering with SAEs

Arthur Conmy, Neel Nanda

---

Outline:

(00:33) Activation Steering with SAEs

(01:29) 1. Background and Motivation

(03:15) 2. Setup

(05:47) 3. Improving the “Anger” Steering Vector

(09:18) 4. Interpreting the “Wedding” Steering Vector

(09:31) The SAE reconstruction has many interpretable features.

(11:52) Just using hand-picked interpretable features from the SAE led to a much worse steering vector.

(12:35) The important SAE features for the wedding steer vector are less intuitive than the anger steering vector.

(14:56) Removing interpretable but irrelevant SAE features from the original steering vector improves performance.

(16:56) Appendix

(17:08) Replacing SAE Encoders with Inference-Time Optimisation

(18:36) Inference Time Optimisation

(23:24) Empirical Results

(28:13) Details of Sparse Approximation Algorithms (for accelerators)

(32:32) Choosing the optimal step size

(33:32) Appendix: Sweeping max top-k

(34:22) Improving ghost grads

(35:37) What are ghost grads?

(37:19) Improving ghost grads by rescaling the loss

(39:33) Applying ghost grads to all features at the start of training

(40:38) Further simplifying ghost grads

(41:40) Does ghost grads transfer to bigger models and different sites?

(44:35) Other miscellaneous observations

(46:15) SAEs on Tracr and Toy Models

(47:32) SAEs in Toy Models of Superposition

(49:07) SAEs in Tracr

(54:39) Replicating “Improvements to Dictionary Learning”

(57:24) Interpreting SAE Features with Gemini Ultra

(58:05) Why Care About Auto-Interp?

(59:34) Tentative observations

(01:01:17) How We’re Thinking About Auto-Interp

(01:02:40) Are You Smarter Than An LLM?

(01:03:54) Instrumenting LLM model internals in JAX

(01:06:34) Flexibility (using greenlets)

(01:07:23) Reading and writing

(01:09:05) Arbitrary interventions

(01:10:59) Greenlets

(01:12:19) Greenlets and JAX

(01:14:58) Greenlets and structured programming

(01:17:14) Nimbleness (using AST patching)

(01:20:24) Compilation and scalability (using layer stacking with conditional sowing)

The original text contained 28 footnotes which were omitted from this narration.

---

First published:

April 19th, 2024

Source:

https://www.lesswrong.com/posts/C5KAZQib3bzzpeyrg/progress-update-1-from-the-gdm-mech-interp-team-full-update

---

Narrated by TYPE III AUDIO.

...more

More shows like LessWrong (30+ Karma)

View all

Making Sense with Sam Harris

26,446 Listeners

Conversations with Tyler

2,389 Listeners

The Peter Attia Drive

7,910 Listeners

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas

4,136 Listeners

ManifoldOne

87 Listeners

Your Undivided Attention

1,462 Listeners

All-In with Chamath, Jason, Sacks & Friedberg

9,095 Listeners

Machine Learning Street Talk (MLST)

87 Listeners

Dwarkesh Podcast

389 Listeners

Hard Fork

5,432 Listeners

The Ezra Klein Show

15,174 Listeners

Moonshots with Peter Diamandis

474 Listeners

No Priors: Artificial Intelligence | Technology | Startups

121 Listeners

Latent Space: The AI Engineer Podcast

75 Listeners

BG2Pod with Brad Gerstner and Bill Gurley

459 Listeners

Share “Progress Update #1 from the GDM Mech Interp Team: Full Update” by Neel Nanda, Arthur Conmy, lsgos, Senthooran Rajamanoharan, Tom Lieberum, János Kramár, Vikrant Varma

Sign up to save your podcasts

“Progress Update #1 from the GDM Mech Interp Team: Full Update” by Neel Nanda, Arthur Conmy, lsgos, Senthooran Rajamanoharan, Tom Lieberum, János Kramár, Vikrant Varma

“Progress Update #1 from the GDM Mech Interp Team: Full Update” by Neel Nanda, Arthur Conmy, lsgos, Senthooran Rajamanoharan, Tom Lieberum, János Kramár, Vikrant Varma

More shows like LessWrong (30+ Karma)

Making Sense with Sam Harris

Conversations with Tyler

The Peter Attia Drive

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas

ManifoldOne

Your Undivided Attention

All-In with Chamath, Jason, Sacks & Friedberg

Machine Learning Street Talk (MLST)

Dwarkesh Podcast

Hard Fork

The Ezra Klein Show

Moonshots with Peter Diamandis

No Priors: Artificial Intelligence | Technology | Startups

Latent Space: The AI Engineer Podcast

BG2Pod with Brad Gerstner and Bill Gurley