[This is an interim report and continuation of the work from the research sprint done in MATS winter 7 (Neel Nanda's Training Phase)]
Try out binary masking for a few residual saes in this colab notebook: [Github Notebook] [Colab Notebook]
TL;DR:
We propose a novel approach to:
- Scaling SAE Circuits to Large Models: By placing sparse autoencoders only in the residual stream at intervals, we find circuits in models as large as Gemma 9B without requiring SAEs to be trained for every transformer layer.
- Finding Circuits: We develop a better circuit finding algorithm. Our method optimizes a binary mask over SAE latents, which proves significantly more effective than existing thresholding-based methods like Attribution Patching or Integrated Gradients.
Our discovered circuits paint a clear picture of how Gemma does a given task, with one circuit achieving 95% faithfulness with <20 total latents. This minimality lets us quickly understand the [...]
---
Outline:
(00:35) TL;DR:
(01:35) 1 Introduction
(04:30) 2 Background
(04:33) 2.1 SAEs
(05:19) 2.2 Circuits
(07:04) 2.3 Problems with Current Sparse Feature Interpretability Approaches
(07:10) 2.3.1 Scalability
(07:44) 2.3.2 Independent Scoring of Nodes
(08:25) 2.3.3 Error Nodes
(10:01) 3 Our Approach
(10:29) 3.1 Solving Scalability: Circuits with few residual SAEs
(11:21) 3.2 Solving Independent Scoring: Masking
(12:08) 3.3 Error nodes
(12:25) 4 Results
(12:29) 4.1 Setup
(13:49) 4.2 Performance Recovery
(15:00) 4.2.1 Code Output Prediction:
(16:48) 4.2.2 Subject Verb Agreement (SVA):
(17:42) 4.2.3 IOI:
(18:55) 4.3 Completeness
(22:22) 4.4 Mask Stability
(23:07) 5 Case Study: Code Output Prediction
(25:09) 6. Conclusions
(27:45) 7. Future Research and Ideas
The original text contained 7 images which were described by AI.
---