Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT, published by Robert AIZI on March 5, 2024 on LessWrong.
Abstract
A sparse autoencoder is a neural network architecture that has recently gained popularity as a technique to find interpretable features in language models (
Cunningham et al,
Anthropic's Bricken et al). We train a sparse autoencoder on
OthelloGPT, a language model trained on transcripts of the board game
Othello, which has been shown to
contain a linear representation of the board state, findable by supervised probes. The sparse autoencoder finds 9 features which serve as high-accuracy classifiers of the board state, out of 180 findable with supervised probes (and 192 possible piece/position combinations). Across random seeds, the autoencoder repeatedly finds "simpler" features concentrated on the center of the board and the corners.
This demonstrates that current techniques for sparse autoencoders may fail to find a large majority of the interesting, interpretable features in a language model.
Introduction
There has been a recent flurry of research activity around Sparse Autoencoders for Dictionary Learning, a new approach to finding interpretable features in language models and potentially "solving superposition" (
Sharkey et al,
Anthropic's Bricken et al,
Cunningham et al.). But while this technique can find features which are interpretable, it is not yet clear if sparse autoencoders can find particular features of interest (e.g., features relevant to reducing AI risk).
This research report seeks to answer the question of whether sparse autoencoders can find a set of a-priori existing, interesting, and interpretable features in the
OthelloGPT language model. OthelloGPT, as the name suggests, is a language model trained on transcripts of the board game
Othello to predict legal moves, but was found to also linearly encode the current board state (
Nanda,
Hazineh et al). That is, for each of the 64 board positions, there were "board-state features" (linear mappings from the residual stream to \R^3) that classify the state at that position between [is empty] vs [has active-player's piece] vs [has enemy's piece], and these board-state features can be found by the supervised training of a linear probe.
These board-state features are an exciting testbed for sparse autoencoders because they represent a set of "called-shot" features we hope to find, and which are extremely interpretable and correspond to natural human thinking[1]. If the sparse autoencoder can find these features, this is some evidence that they will find relevant and important features in language models.
Conversely, if the sparse autoencoders can't find these features, that indicates a limitation of the method, and provides a test case where we can adjust our training methods until we can find them.
Overview
Here we:
Train an OthelloGPT model from scratch
Train a linear probe to classify the board states (replicating
Hazineh et al) from an intermediate layer of OthelloGPT.
Train a sparse autoencoder on the same layer of OthelloGPT
Assess whether the features found by the sparse autoencoder include the linear encoding of the current board state that the linear probe is able to find.
Retrain the sparse autoencoder with different random seeds, and analyze which features are found.
Methods
Training OthelloGPT
We first trained an OthelloGPT model from scratch, following the approach of
Li et al. Our model is a 25M parameter, 8-layer, decoder-only transformer, with residual stream dimension d_model=512 (identical to Li et al's model). It is trained to do next-token-prediction of random transcripts of Othello games, with each possible move being encoded as a separate token, resulting in a vocabulary size of 66 (64 from the positions on the boards, plus 2 speci...