The Nonlinear Library: Alignment Forum

AF - Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small by Joseph Isaac Bloom


Listen Later

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small, published by Joseph Isaac Bloom on February 2, 2024 on The AI Alignment Forum.
UPDATE: Since we posted this last night, someone pointed out that our implementation of ghost grads has a non-trivial error (which makes the results a-priori quite surprising). We computed the ghost grad forward pass using Exp(Relu(W_enc(x)[dead_neuron_mask])) rather than Exp((W_enc(x)[dead_neuron_mask])). I'm running some ablation experiments now to get to the bottom of this.
This work was produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, under mentorship from Neel Nanda and Arthur Conmy. Funding for this work was provided by the Manifund Regranting Program and donors as well as LightSpeed Grants.
This is intended to be a fairly informal post sharing a set of Sparse Autoencoders trained on the residual stream of GPT2-small which achieve fairly good reconstruction performance and contain fairly sparse / interpretable features. More importantly, advice from Anthropic and community members has enabled us to train these fairly more efficiently / faster than before.
The specific methods that were most useful were: ghost gradients, learning rate warmup, and initializing the decoder bias with the geometric median. We discuss each of these in more detail below.
5 Minute Summary
We're publishing a set of 12 Sparse AutoEncoders for the GPT2 Small residual stream.
These dictionaries have approximately 25,000 features each, with very few dead features (mainly in the early layers) and high quality reconstruction (log loss when the activations are replaced with the output is 3.3 - 3.6 as compared with 3.3 normally).
The L0's range from 5 in the first layer to 70 in the 9th SAE (increasing by about 5-10 per layer and dropping in the last two layers.
By choosing a fixed dictionary size, we can see how statistics like the number of dead features or reconstruction cross entropy loss change with layer giving some indication of how properties of the feature distribution change with layer depth.
We haven't yet extensively analyzed these dictionaries, but will share automatically generated dashboards we've generated.
Readers can access the Sparse Autoencoder weights in this
HuggingFace Repo. Training code and code for loading the weights / model and data loaders can be found in this
Github Repository. Training curves and feature dashboards can also be found in this
wandb report. Users can download all 25k feature dashboards generated for layer 2 and 10 SAEs and the first 5000 of the layer 5 SAE features
here (note the left hand of column of the dashboards should currently be ignored).
Layer
Variance Explained
L1 loss
L0*
% Alive Features
Reconstruction
CE Log Loss
0
99.15%
4.58
12.24
80.0%
3.32
1
98.37%
41.04
14.68
83.4%
3.33
2
98.07%
51.88
18.80
80.0%
3.37
3
96.97%
74.96
25.75
86.3%
3.48
4
95.77%
90.23
33.14
97.7%
3.44
5
94.90%
108.59
43.61
99.7%
3.45
6
93.90%
136.07
49.68
100%
3.44
7
93.08%
138.05
57.29
100%
3.45
8
92.57%
167.35
65.47
100%
3.45
9
92.05%
198.42
71.10
100%
3.45
10
91.12%
215.11
53.79
100%
3.52
11
93.30%
270.13
59.16
100%
3.57
Original Model
3.3
Summary Statistics for GPT2 Small Residual Stream SAEs. *L0 = Average number of features firing per token.
Training SAEs that we were happy with used to take much longer than it is taking us now. Last week, it took me 20 hours to train a 50k feature SAE on 1 billion tokens and over the weekend it took 3 hours for us to train 25k SAE on 300M tokens with similar variance explained, L0 and CE loss recovered.
We attribute the improvement to having implemented various pieces of advice that have made our lives a lot easier:
Ghost Gradients / Avoiding Resampling: Prior to ghost gradients (which we were made aware of last week in the Anthropic Jan...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear Library: Alignment ForumBy The Nonlinear Fund


More shows like The Nonlinear Library: Alignment Forum

View all
AXRP - the AI X-risk Research Podcast by Daniel Filan

AXRP - the AI X-risk Research Podcast

9 Listeners