
Sign up to save your podcasts
Or
This is a series of snippets about the Google DeepMind mechanistic interpretability team's research into Sparse Autoencoders, that didn't meet our bar for a full paper. Please start at the summary post for more context, and a summary of each snippet. They can be read in any order.
Activation Steering with SAEs
Arthur Conmy, Neel Nanda
TL;DR: We use SAEs trained on GPT-2 XL's residual stream to decompose steering vectors into interpretable features. We find a single SAE feature for anger which is a Pareto-improvement over the anger steering vector from existing work (Section 3, 3 minute read). We have more mixed results with wedding steering vectors: we can partially interpret the vectors, but the SAE reconstruction is a slightly worse steering vector, and just taking the obvious features produces a notably worse vector. We can produce [...]
---
Outline:
(00:33) Activation Steering with SAEs
(01:29) 1. Background and Motivation
(03:15) 2. Setup
(05:47) 3. Improving the “Anger” Steering Vector
(09:18) 4. Interpreting the “Wedding” Steering Vector
(09:31) The SAE reconstruction has many interpretable features.
(11:52) Just using hand-picked interpretable features from the SAE led to a much worse steering vector.
(12:35) The important SAE features for the wedding steer vector are less intuitive than the anger steering vector.
(14:56) Removing interpretable but irrelevant SAE features from the original steering vector improves performance.
(16:56) Appendix
(17:08) Replacing SAE Encoders with Inference-Time Optimisation
(18:36) Inference Time Optimisation
(23:24) Empirical Results
(28:13) Details of Sparse Approximation Algorithms (for accelerators)
(32:32) Choosing the optimal step size
(33:32) Appendix: Sweeping max top-k
(34:22) Improving ghost grads
(35:37) What are ghost grads?
(37:19) Improving ghost grads by rescaling the loss
(39:33) Applying ghost grads to all features at the start of training
(40:38) Further simplifying ghost grads
(41:40) Does ghost grads transfer to bigger models and different sites?
(44:35) Other miscellaneous observations
(46:15) SAEs on Tracr and Toy Models
(47:32) SAEs in Toy Models of Superposition
(49:07) SAEs in Tracr
(54:39) Replicating “Improvements to Dictionary Learning”
(57:24) Interpreting SAE Features with Gemini Ultra
(58:05) Why Care About Auto-Interp?
(59:34) Tentative observations
(01:01:17) How We’re Thinking About Auto-Interp
(01:02:40) Are You Smarter Than An LLM?
(01:03:54) Instrumenting LLM model internals in JAX
(01:06:34) Flexibility (using greenlets)
(01:07:23) Reading and writing
(01:09:05) Arbitrary interventions
(01:10:59) Greenlets
(01:12:19) Greenlets and JAX
(01:14:58) Greenlets and structured programming
(01:17:14) Nimbleness (using AST patching)
(01:20:24) Compilation and scalability (using layer stacking with conditional sowing)
The original text contained 28 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
This is a series of snippets about the Google DeepMind mechanistic interpretability team's research into Sparse Autoencoders, that didn't meet our bar for a full paper. Please start at the summary post for more context, and a summary of each snippet. They can be read in any order.
Activation Steering with SAEs
Arthur Conmy, Neel Nanda
TL;DR: We use SAEs trained on GPT-2 XL's residual stream to decompose steering vectors into interpretable features. We find a single SAE feature for anger which is a Pareto-improvement over the anger steering vector from existing work (Section 3, 3 minute read). We have more mixed results with wedding steering vectors: we can partially interpret the vectors, but the SAE reconstruction is a slightly worse steering vector, and just taking the obvious features produces a notably worse vector. We can produce [...]
---
Outline:
(00:33) Activation Steering with SAEs
(01:29) 1. Background and Motivation
(03:15) 2. Setup
(05:47) 3. Improving the “Anger” Steering Vector
(09:18) 4. Interpreting the “Wedding” Steering Vector
(09:31) The SAE reconstruction has many interpretable features.
(11:52) Just using hand-picked interpretable features from the SAE led to a much worse steering vector.
(12:35) The important SAE features for the wedding steer vector are less intuitive than the anger steering vector.
(14:56) Removing interpretable but irrelevant SAE features from the original steering vector improves performance.
(16:56) Appendix
(17:08) Replacing SAE Encoders with Inference-Time Optimisation
(18:36) Inference Time Optimisation
(23:24) Empirical Results
(28:13) Details of Sparse Approximation Algorithms (for accelerators)
(32:32) Choosing the optimal step size
(33:32) Appendix: Sweeping max top-k
(34:22) Improving ghost grads
(35:37) What are ghost grads?
(37:19) Improving ghost grads by rescaling the loss
(39:33) Applying ghost grads to all features at the start of training
(40:38) Further simplifying ghost grads
(41:40) Does ghost grads transfer to bigger models and different sites?
(44:35) Other miscellaneous observations
(46:15) SAEs on Tracr and Toy Models
(47:32) SAEs in Toy Models of Superposition
(49:07) SAEs in Tracr
(54:39) Replicating “Improvements to Dictionary Learning”
(57:24) Interpreting SAE Features with Gemini Ultra
(58:05) Why Care About Auto-Interp?
(59:34) Tentative observations
(01:01:17) How We’re Thinking About Auto-Interp
(01:02:40) Are You Smarter Than An LLM?
(01:03:54) Instrumenting LLM model internals in JAX
(01:06:34) Flexibility (using greenlets)
(01:07:23) Reading and writing
(01:09:05) Arbitrary interventions
(01:10:59) Greenlets
(01:12:19) Greenlets and JAX
(01:14:58) Greenlets and structured programming
(01:17:14) Nimbleness (using AST patching)
(01:20:24) Compilation and scalability (using layer stacking with conditional sowing)
The original text contained 28 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
26,446 Listeners
2,389 Listeners
7,910 Listeners
4,136 Listeners
87 Listeners
1,462 Listeners
9,095 Listeners
87 Listeners
389 Listeners
5,432 Listeners
15,174 Listeners
474 Listeners
121 Listeners
75 Listeners
459 Listeners