Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 200 COP in MI: Studying Learned Features in Language Models, published by Neel Nanda on January 19, 2023 on The AI Alignment Forum.
This is the final post in a sequence called 200 Concrete Open Problems in Mechanistic Interpretability. Start here, then read in any order. If you want to learn the basics before you think about open problems, check out my post on getting started. Look up jargon in my Mechanistic Interpretability Explainer
Motivation
Motivating Paper: Softmax Linear Units (SoLU), Multimodal Neurons
To accompany this post, I’ve created a website called Neuroscope that displays the text that most activates each neuron in some language models - check it out!
This section contains a lot of details and thoughts on thinking about neurons and learned features and relationship to the surrounding literature. If you get bored, feel free to just skip to exploring and looking for interesting neurons in Neuroscope
MLPs represent ⅔ of the parameters in a transformer, yet we really don’t understand them very well. Based on our knowledge of image models, my best guess is that models learn to represent features, properties of the input, with different directions corresponding to different features. Early layers learn simple features that are basic functions of the input, like edges and corners, and these are iteratively refined and built up into more sophisticated features, like angles and curves and circles. Language models are messier than those image models, since they have attention layers and a residual stream, but our guess is that analogous features are generally computed in MLP layers.
But there’s a lot of holes and confusions in this framework, and I’d love to have these filled in! How true actually is this in practice? Do features correspond to neurons vs arbitrary directions? What kinds of features get learned, and where do they occur in a model? What features do we see in small models vs large ones vs enormous ones? What kinds of things are natural for a language model to express vs extremely hard and convoluted? What are the ways our intuitions will trip us up here?
Issues like polysemanticity and superposition make it difficult to actually reverse engineer specific neurons, and it seems clear that the model can learn features that do not correspond to specific neurons. But even if we relax our standards of rigour and just want to understand what features are present at all, we don’t know much about what role these layers actually serve in the model! There’s been a fair amount of work studying this for BERT, but very little on studying this in generative language models, especially large ones!
There’s a bunch of angles to make progress on this, and I’m excited to see a diverse range of approaches! But here I’m going to focus on the somewhat unreliable but useful technique of looking at max activating dataset examples: running the model across a bunch of data and for each neuron tracking the texts that most activated it. If we see a consistent pattern (or patterns) in these texts we can infer that the model is detecting the feature represented by that pattern. To emphasise, a consistent pattern in a neuron’s examples does not mean the neuron only represents that feature. But it can be pretty good evidence that the feature is represented in the model, and that that neuron responds strongly to it!
To help you explore what’s inside a language model, I’ve created a tool called Neuroscope which displays the max activating dataset examples for a range of language models. Each neuron has its own page, and there are some pretty weird ones in there! For example, take neuron 456 in my one layer SoLU model - this activates on the word " the" after " love", on what seems to be the comments section of cutesy arts and crafts blogs.
One vision for an accessible, beginner-fr...