Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a series of snippets about the Google DeepMind mechanistic interpretability team's research into Sparse Autoencoders, that didn't meet our bar for a full paper. Please start at the summary post for more context, and a summary of each snippet. They can be read in any order.
Activation Steering with SAEs
Arthur Conmy, Neel Nanda
TL;DR: We use SAEs trained on GPT-2 XL's residual stream to decompose steering vectors into interpretable features. We find a single SAE feature for anger which is a Pareto-improvement over the anger steering vector from existing work (Section 3, 3 minute read). We have more mixed results with wedding steering vectors: we can partially interpret the vectors, but the SAE reconstruction is a slightly worse steering vector, and just taking the obvious features produces a notably worse vector. We can produce [...]
---
Outline:
(00:33) Activation Steering with SAEs
(01:29) 1. Background and Motivation
(03:15) 2. Setup
(05:47) 3. Improving the “Anger” Steering Vector
(09:18) 4. Interpreting the “Wedding” Steering Vector
(09:31) The SAE reconstruction has many interpretable features.
(11:52) Just using hand-picked interpretable features from the SAE led to a much worse steering vector.
(12:35) The important SAE features for the wedding steer vector are less intuitive than the anger steering vector.
(14:56) Removing interpretable but irrelevant SAE features from the original steering vector improves performance.
(16:56) Appendix
(17:08) Replacing SAE Encoders with Inference-Time Optimisation
(18:36) Inference Time Optimisation
(23:24) Empirical Results
(28:13) Details of Sparse Approximation Algorithms (for accelerators)
(32:32) Choosing the optimal step size
(33:32) Appendix: Sweeping max top-k
(34:22) Improving ghost grads
(35:37) What are ghost grads?
(37:19) Improving ghost grads by rescaling the loss
(39:33) Applying ghost grads to all features at the start of training
(40:38) Further simplifying ghost grads
(41:40) Does ghost grads transfer to bigger models and different sites?
(44:35) Other miscellaneous observations
(46:15) SAEs on Tracr and Toy Models
(47:32) SAEs in Toy Models of Superposition
(49:07) SAEs in Tracr
(54:39) Replicating “Improvements to Dictionary Learning”
(57:24) Interpreting SAE Features with Gemini Ultra
(58:05) Why Care About Auto-Interp?
(59:34) Tentative observations
(01:01:17) How We’re Thinking About Auto-Interp
(01:02:40) Are You Smarter Than An LLM?
(01:03:54) Instrumenting LLM model internals in JAX
(01:06:34) Flexibility (using greenlets)
(01:07:23) Reading and writing
(01:09:05) Arbitrary interventions
(01:10:59) Greenlets
(01:12:19) Greenlets and JAX
(01:14:58) Greenlets and structured programming
(01:17:14) Nimbleness (using AST patching)
(01:20:24) Compilation and scalability (using layer stacking with conditional sowing)
The original text contained 28 footnotes which were omitted from this narration.
---