Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Stitching SAEs of different sizes, published by Bart Bussmann on July 13, 2024 on The AI Alignment Forum.
Work done in Neel Nanda's stream of MATS 6.0, equal contribution by Bart Bussmann and Patrick Leask, Patrick Leask is concurrently a PhD candidate at Durham University
TL;DR: When you scale up an SAE, the features in the larger SAE can be categorized in two groups: 1) "novel features" with new information not in the small SAE and 2) "reconstruction features" that sparsify information that already exists in the small SAE. You can stitch SAEs by adding the novel features to the smaller SAE.
Introduction
Sparse autoencoders (SAEs) have been shown to recover sparse, monosemantic features from language models. However, there has been limited research into how those features vary with dictionary size, that is, when you take the same activation in the same model and train a wider dictionary on it, what changes? And how do the features learned vary?
We show that features in larger SAEs cluster into two kinds of features: those that capture similar information to the smaller SAE (either identical features, or split features; about 65%), and those which capture novel features absent in the smaller mode (the remaining 35%). We validate this by showing that inserting the novel features from the larger SAE into the smaller SAE boosts the reconstruction performance, while inserting the similar features makes performance worse.
Building on this insight, we show how features from multiple SAEs of different sizes can be combined to create a "Frankenstein" model that outperforms SAEs with an equal number of features, though tends to lead to higher L0, making a fair comparison difficult. Our work provides new understanding of how SAE dictionary size impacts the learned feature space, and how to reason about whether to train a wider SAE.
We hope that this method may also lead to a practically useful way of training high-performance SAEs with less feature splitting and a wider range of learned novel features.
Larger SAEs learn both similar and entirely novel features
Set-up
We use sparse autoencoders as in
Towards Monosemanticity and
Sparse Autoencoders Find Highly Interpretable Directions. In our setup, the feature activations are computed as:
Based on these feature activations, the input is then reconstructed as
The encoder and decoder matrices and biases are trained with a loss function that combines an L2 penalty on the reconstruction loss and an L1 penalty on the feature activations:
In our experiments, we train a range of sparse autoencoders (SAEs) with varying widths across residual streams in GPT-2 and Pythia-410m. The width of an SAE is determined by the number of features (F) in the sparse autoencoder. Our smallest SAE on GPT-2 consists of only 768 features, while the largest one has nearly 100,000 features. Here is the full list of SAEs used in this research:
Name
Model site
Dictionary size
L0
MSE
CE Loss Recovered from zero ablation
CE Loss Recovered from mean ablation
GPT2-768
gpt2-small layer 8 of 12 resid_pre
768
35.2
2.72
0.915
0.876
GPT2-1536
gpt2-small layer 8 of 12 resid_pre
1536
39.5
2.22
0.942
0.915
GPT2-3072
gpt2-small layer 8 of 12 resid_pre
3072
42.4
1.89
0.955
0.937
GPT2-6144
gpt2-small layer 8 of 12 resid_pre
6144
43.8
1.631
0.965
0.949
GPT2-12288
gpt2-small layer 8 of 12 resid_pre
12288
43.9
1.456
0.971
0.958
GPT2-24576
gpt2-small layer 8 of 12 resid_pre
24576
42.9
1.331
0.975
0.963
GPT2-49152
gpt2-small layer 8 of 12 resid_pre
49152
42.4
1.210
0.978
0.967
GPT2-98304
gpt2-small layer 8 of 12 resid_pre
98304
43.9
1.144
0.980
0.970
Pythia-8192
Pythia-410M-deduped layer 3 of 24 resid_pre
8192
51.0
0.030
0.977
0.972
Pythia-16384
Pythia-410M-deduped layer 3 of 24 resid_pre
16384
43.2
0.024
0.983
0.979
The base language models used are those included in
Transform...