Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some open-source dictionaries and dictionary learning infrastructure, published by Sam Marks on December 5, 2023 on The AI Alignment Forum.
As more people begin work on interpretability projects which incorporate dictionary learning, it will be valuable to have high-quality dictionaries publicly available.[1] To get the ball rolling on this, my collaborator (Aaron Mueller) and I are:
open-sourcing a number of sparse autoencoder dictionaries trained on Pythia-70m MLPs
releasing our repository for training these dictionaries[2].
Let's discuss the dictionaries first, and then the repo.
The dictionaries
The dictionaries can be downloaded from here. See the sections "Downloading our open-source dictionaries" and "Using trained dictionaries" here for information about how to download and use them. If you use these dictionaries in a published paper, we ask that you mention us in the acknowledgements.
We're releasing two sets of dictionaries for EleutherAI's 6-layer pythia-70m-deduped model. The dictionaries in both sets were trained on 512-dimensional MLP output activations (not the MLP hidden layer like Anthropic used), using ~800M tokens from The Pile.
The first set, called
0_8192, consists of dictionaries of size 8192=16512. These were trained with an L1 penalty of
1e-3.
The second set, called
1_32768, consists of dictionaries of size 32768=64512. These were trained with an l1 penalty of
3e-3.
Here are some statistics. (See our repo's readme for more info on what these statistics mean.)
For dictionaries in the
0_8192 set:
Layer
MSE Loss
L1 loss
L0
% Alive
% Loss Recovered
0
0.056
6.132
9.951
0.998
0.984
1
0.089
6.677
44.739
0.887
0.924
2
0.108
11.44
62.156
0.587
0.867
3
0.135
23.773
175.303
0.588
0.902
4
0.148
27.084
174.07
0.806
0.927
5
0.179
47.126
235.05
0.672
0.972
For dictionaries in the
1_32768 set:
Layer
MSE Loss
L1 loss
L0
% Alive
% Loss Recovered
0
0.09
4.32
2.873
0.174
0.946
1
0.13
2.798
11.256
0.159
0.768
2
0.152
6.151
16.381
0.118
0.724
3
0.211
11.571
39.863
0.226
0.765
4
0.222
13.665
29.235
0.19
0.816
5
0.265
26.4
43.846
0.13
0.931
And here are some histograms of feature frequencies.
Overall, I'd guess that these dictionaries are decent, but not amazing.
We trained these dictionaries because we wanted to work on a downstream application of dictionary learning, but lacked the dictionaries. These dictionaries are more than good enough to get us off the ground on our mainline project, but I expect that in not too long we'll come back to train some better dictionaries (which we'll also open source). I think the same is true for other folks: these dictionaries should be sufficient to get started on projects that require dictionaries; and when better dictionaries are available later, you can swap them in for optimal results.
Some miscellaneous notes about these dictionaries (you can find more in the repo).
The L1 penalty for
1_32768 seems to have been too large; only 10-20% of the neurons are alive, and the loss recovered is much worse. That said, we'll remark that after examining features from both sets of dictionaries, the dictionaries from the
1_32768 set seem to have more interpretable features than those from the
0_8192 set (though it's hard to tell).
In particular, we suspect that for
0_8192, the many high-frequency features in the later layers are uninterpretable but help significantly with reconstructing activations, resulting in deceptively good-looking statistics. (See the bullet point below regarding neuron resampling and bimodality.)
As we progress through the layers, the dictionaries tend to get worse along most metrics (except for % loss recovered). This may have to do with the growing scale of the activations themselves as one moves through the layers of pythia models (h/t to Arthur Conmy for raising this hypothesis).
We note that our dictionary fea...