Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Sparse Autoencoders Work on Attention Layer Outputs, published by Connor Kissane on January 16, 2024 on The AI Alignment Forum.
This post is the result of a 2 week research sprint project during the training phase of Neel Nanda's MATS stream.
Executive Summary
We replicate Anthropic's MLP Sparse Autoencoder (SAE) paper on attention outputs and it works well: the SAEs learn sparse, interpretable features, which gives us insight into what attention layers learn. We study the second attention layer of a two layer language model (with MLPs).
Specifically, rather than training our SAE on attn_output, we train our SAE on "hook_z" concatenated over all attention heads (aka the mixed values aka the attention outputs before a linear map - see notation here). This is valuable as we can see how much of each feature's weights come from each head, which we believe is a promising direction to investigate attention head superposition, although we only briefly explore that in this work.
We open source our SAE, you can use it via this
Colab notebook
.
Shallow Dives: We do a shallow investigation to interpret each of the first 50 features. We estimate 82% of non-dead features in our SAE are interpretable (24% of the SAE features are dead).
See this
feature interface to browse the first 50 features.
Deep dives: To verify our SAEs have learned something real, we zoom in on individual features for much more detailed investigations: the "'board' is next by induction" feature, the local context feature of "in questions starting with 'Which'", and the more global context feature of "in texts about pets".
We go beyond the techniques from the Anthropic paper, and investigate the circuits used to compute the features from earlier components, including analysing composition with an MLP0 SAE.
We also investigate how the features are used downstream, and whether it's via MLP1 or the direct connection to the logits.
Automation: We automatically detect and quantify a large "{token} is next by induction" feature family. This represents ~5% of the living features in the SAE.
Though the specific automation technique won't generalize to other feature families, this is notable, as if there are many "one feature per vocab token" families like this, we may need impractically wide SAEs for larger models.
Introduction
In Anthropic's SAE paper, they find that training sparse autoencoders (SAEs) on a one layer model's MLP activations finds interpretable features, providing a path to breakdown these high dimensional activations into units that we can understand. In this post, we demonstrate that the same technique works on attention layer outputs and learns sparse, interpretable features!
To see how interpretable our SAE is we perform shallow investigations of the first 50 features of our SAE (i.e. randomly chosen features). We found that 76% are not dead (i.e. activate on at least some inputs), and within the alive features we think 82% are interpretable. To get a feel for the features we find see our
interactive visualizations of the first 50. Here's one example:[1]
Shallow investigations are limited and may be misleading or illusory, so we then do some deep dives to more deeply understand multiple individual features including:
"'board' is next, by induction" - one of many "{token} is next by induction" features
"In questions starting with 'Which'" - a local context feature, which interestingly is computed by multiple heads
"In pet context" - one of many high level context features
Similar to the Anthropic paper's "Detailed Investigations", we understand when these features activate and how they affect downstream computation. However, we also go beyond Anthropic's techniques, and look into the upstream circuits by which these features are computed from earlier components. An attention layer (with frozen att...