Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AutoInterpretation Finds Sparse Coding Beats Alternatives, published by Hoagy on July 17, 2023 on LessWrong.
Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort
Huge thanks to Logan Riggs, Aidan Ewart, Lee Sharkey, Robert Huben for their work on the sparse coding project, Lee Sharkey and Chris Mathwin for comments on the draft, EleutherAI for compute and OpenAI for GPT-4 credits.
Summary
We use OpenAI's automatic interpretation protocol to analyse features found by dictionary learning using sparse coding and compare the interpretability scores thereby found to a variety of baselines.
We find that for both the residual stream (layer 2) and MLP (layer 1) of Eleuther's Pythia70M, sparse coding learns a set of features that is superior to all tested baselines, even when removing the bias and looking just at the learnt directions. In doing so we provide additional evidence to the hypothesis that NNs should be conceived as using distributed representations to represent linear features which are only weakly anchored to the neuron basis.
As before these results are still somewhat preliminary and we hope to expand on them and make them more robust over the coming month or two, but we hope people find them fruitful sources of ideas. If you want to discuss, feel free to message me or head over to our thread in the EleutherAI discord.
All code available at the github repo.
Methods
Sparse Coding
The feature dictionaries learned by sparse coding are learnt by simple linear autoencoders with a sparsity penalty on the activations. For more background on the sparse coding approach to feature-finding see the Conjecture interim report that we're building from, or Robert Huben's explainer.
Automatic Interpretation
As Logan Riggs' recently found, many of the directions found through sparse coding seem highly interpretable, but we wanted a way to quantify this, and make sure that we were detecting a real difference in the level of interpretability.
To do this we used the methodology outlined in this OpenAI paper, details can be found in their code base. To quickly summarise, we are analysing features which are defined as scalar-valued functions of the activations of a neural network, limiting ourselves here to features defined on a single layer of a language model. The original paper simply defined features as the activation of individual neurons but we will in general be looking at linear combinations of neurons.
We give a feature an interpretability score by first generating a natural language explanation for the feature, which is expected to explain how strongly a feature will be active in a certain context, for example 'the feature activates on legal terminology'. Then, we give this explanation to an LLM and ask it to predict the feature for hundreds of different contexts, so if the tokens are ['the' 'lawyer' 'went' 'to' 'the' 'court'] the predicted activations might be [0, 10, 0, 0, 8]. The score is defined as the correlation between the true and predicted activations.
To generate the explanations we follow OpenAI and take a 64-token sentence-fragment from each of the first 50,000 lines of OpenWebText. For each feature, we calculate the average activation and take the 20 fragments with the highest activation. Of these 20, we pass 5 to GPT-4, along with the rescaled per-token activations. From these 5 fragments, GPT-4 suggests an explanation for when the neuron fires.
GPT3.5 is then used to simulate the feature, given the explanation, across both another 5 highly activating fragments, and 5 randomly selected fragments (with non-zero variation). The correlation scores are calculated across all 10 fragments ('top-and-random'), as well as for the top and random fragments separately.
Comparing Feature Dictionaries
We use dictionary learning with a sparsi...