Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Attention SAEs Scale to GPT-2 Small, published by Connor Kissane on February 3, 2024 on The AI Alignment Forum.
This is an interim report that we are currently building on. We hope this update + open sourcing our SAEs will be useful to related research occurring in parallel. Produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort
Executive Summary
In a previous post, we showed that
sparse autoencoders (SAEs) work on the attention layer outputs of a two layer transformer. We scale our attention SAEs to GPT-2 Small, and continue to find sparse interpretable features in every layer. This makes us optimistic about our ongoing efforts scaling further, especially since we didn't have to do much iterating
We open source our SAEs. Load them from
Hugging Face or this
colab notebook
The SAEs seem good, often recovering more than 80% of the loss relative to zero ablation, and are sparse with less than 20 features firing on average. The majority of the live features are interpretable
We continue to find the same three feature families that we found in the
two layer model:
induction features,
local context features, and
high level context features. This suggests that some of our lessons interpreting features in smaller models may generalize
We also find new, interesting feature families that we didn't find in the two layer model, providing hints about fundamentally different capabilities in GPT-2 Small
See our
feature interface to browse the first 30 features for each layer
Introduction
In
Sparse Autoencoders Work on Attention Layer Outputs we showed that we can apply SAEs to extract sparse interpretable features from the last attention layer of a two layer transformer. We have since applied the same technique to a 12-layer model, GPT-2 Small, and continue to find sparse, interpretable features in every layer. Our SAEs often recover more than 80% of the loss[1], and are sparse with less than 20 features firing on average. We perform shallow investigations of the first 30 features from each layer, and we find that the majority (often 80%+) of non-dead SAEs features are interpretable.
interactive visualizations for each layer.
We open source our SAEs in hope that they will be useful to other researchers currently working on dictionary learning. We are particularly excited about using these SAEs to better understand attention circuits at the feature level. See the SAEs on
Hugging Face or load them using this
colab notebook. Below we provide the key metrics for each SAE:
L0 norm
loss recovered
dead features
% alive features interpretable
L0
3
99%
13%
97%
L1
20
78%
49%
87%
L2
16
90%
20%
95%
L3
15
84%
8%
75%
L4
15
88%
5%
100%
L5
20
85%
40%
82%
L6
19
82%
28%
75%
L7
19
83%
58%
70%
L8
20
76%
37%
64%
L9
21
83%
48%
85%
L10
16
85%
41%
81%
L11
8
89%
84%
66%
It's worth noting that we didn't do much differently to train these,[2] leaving us optimistic about the tractability of scaling attention SAEs to even bigger models.
Excitingly, we also continue to identify feature families. We find features from all three of the families that we identified in the two layer model: induction features, local context features, and high level context features. This provides us hope that some of our lessons from interpreting features in smaller models will continue to generalize.
We also find new, interesting feature families in GPT-2 Small, suggesting that attention SAEs can provide valuable hints about new[3] capabilities that larger models have learned. Some new features include:
Successor features, which activate when predicting the next item in a sequence such as "15, 16" -> "17" (which are partly coming from
Successor Heads in the model), and boost the logits of the next item.
Name mover features, which predict a name in the context, such as in the
IOI task
Duplicate token f...