September 21, 2023

AF - Sparse Autoencoders Find Highly Interpretable Directions in Language Models by Logan Riggs Smith

8 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Sparse Autoencoders Find Highly Interpretable Directions in Language Models, published by Logan Riggs Smith on September 21, 2023 on The AI Alignment Forum.

This is a linkpost for Sparse Autoencoders Find Highly Interpretable Directions in Language Models

We use a scalable and unsupervised method called Sparse Autoencoders to find interpretable, monosemantic features in real LLMs (Pythia-70M/410M) for both residual stream and MLPs. We showcase monosemantic features, feature replacement for Indirect Object Identification (IOI), and use OpenAI's automatic interpretation protocol to demonstrate a significant improvement in interpretability.

Paper Overview

Sparse Autoencoders & Superposition

To reverse engineer a neural network, we'd like to first break it down into smaller units (features) that can be analysed in isolation. Using individual neurons as these units can be useful but neurons are often polysemantic, activating for several unrelated types of feature so just looking at neurons is insufficient. Also, for some types of network activations, like the residual stream of a transformer, there is little reason to expect features to align with the neuron basis so we don't even have a good place to start.

Toy Models of Superposition investigates why polysemanticity might arise and hypothesise that it may result from models learning more distinct features than there are dimensions in the layer, taking advantage of the fact that features are sparse, each one only being active a small proportion of the time. This suggests that we may be able to recover the network's features by finding a set of directions in activation space such that each activation vector can be reconstructed from a sparse linear combinations of these directions.

We attempt to reconstruct these hypothesised network features by training linear autoencoders on model activation vectors. We use a sparsity penalty on the embedding, and tied weights between the encoder and decoder, training the models on 10M to 50M activation vectors each. For more detail on the methods used, see the paper.

Automatic Interpretation

We use the same automatic interpretation technique that OpenAI used to interpret the neurons in GPT2 to analyse our features, as well as alternative methods of decomposition. This was demonstrated in a previous post but we now extend these results across the all 6 layers in Pythia-70M, showing a clear improvement over all baselines in all but the final layers. Case studies later in the paper suggest that the features are still meaningful in these later layers but that automatic interpretation struggles to perform well.

IOI Feature Identification

We are able to use less-than-rank one ablations to precisely edit activations to restore uncorrupted behaviour on the IOI task. With normal activation patching, patches occur at a module-wide level, while here we perform interventions of the form

x'=¯x+∑i∈F(ci-¯ci)fi

where ¯x is the embedding of the corrupted datapoint, F is the set of patched features, and ci and ¯ci are the activations of feature fi on the clean and corrupted datapoint respectively.

We show that our features are able to better able to precisely reconstruct the data than other activation decomposition methods (like PCA), and moreover that the finegrainedness of our edits increases with dictionary sparsity. Unfortunately, as our autoencoders are not able to perfectly reconstruct the data, they have a positive minumum KL-divergence from the base model, while PCA does not.

Dictionary Features are Highly Monosemantic & Causal

(Left) Histogram of activations for a specific dictionary feature. The majority of activations are for apostrophe (in blue), where the y-axis the is number of datapoints that activate in that bin. (Right) Histogram of the drop in logits (ie how much the LLM predicts a spe...

...more

View all episodes

By The Nonlinear Fund

September 21, 2023

AF - Sparse Autoencoders Find Highly Interpretable Directions in Language Models by Logan Riggs Smith

8 minutes

This is a linkpost for Sparse Autoencoders Find Highly Interpretable Directions in Language Models

Paper Overview

Sparse Autoencoders & Superposition

Automatic Interpretation

IOI Feature Identification

x'=¯x+∑i∈F(ci-¯ci)fi

where ¯x is the embedding of the corrupted datapoint, F is the set of patched features, and ci and ¯ci are the activations of feature fi on the clean and corrupted datapoint respectively.

Dictionary Features are Highly Monosemantic & Causal

...more

More shows like The Nonlinear Library: Alignment Forum

View all

AXRP - the AI X-risk Research Podcast

9 Listeners

Share AF - Sparse Autoencoders Find Highly Interpretable Directions in Language Models by Logan Riggs Smith

Sign up to save your podcasts

AF - Sparse Autoencoders Find Highly Interpretable Directions in Language Models by Logan Riggs Smith

AF - Sparse Autoencoders Find Highly Interpretable Directions in Language Models by Logan Riggs Smith

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast