Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: We Found An Neuron in GPT-2, published by Joseph Miller on February 11, 2023 on LessWrong.
We started out with the question: How does GPT-2 know when to use the word an over a? The choice depends on whether the word that comes after starts with a vowel or not, but GPT-2 can only output one word at a time.
We still don’t have a full answer, but we did find a single MLP neuron in GPT-2 Large that is crucial for predicting the token " an". And we also found that the weights of this neuron correspond with the embedding of the " an" token, which led us to find other neurons that predict a specific token.
Discovering the Neuron
Choosing the prompt
It was surprisingly hard to think of a prompt where GPT-2 would output “ an” (the leading space is part of the token) as the top prediction. Eventually we gave up with GPT-2_small and switched to GPT-2_large. As we’ll see later, even GPT-2_large systematically under-predicts the token “ an”. This may be because smaller language models lean on the higher frequency of " a" to make a best guess. The prompt we finally found that gave a high (64%) probability for “ an” was:
The first sentence was necessary to push the model towards an indefinite article — without it the model would make other predictions such as “[picked] up”.
Before we proceed, here’s a quick overview of the transformer architecture. Each attention block and MLP takes input and adds output to the residual stream.
Logit Lens
Using the logit lens technique, we took the logits from the residual stream between each layer and plotted the difference between logit(‘ an’) and logit(‘ a’). We found a big spike after Layer 31’s MLP.
Activation Patching by the Layer
Activation patching is a technique introduced by Meng et. al. (2022) to analyze the significance of a single layer in a transformer. First, we saved the activation of each layer when running the original prompt through the model — the “clean activation”.
We then ran a corrupted prompt through the model:
By replacing the word "apple" with "lemon", we induce the model to predict the token " a" instead of " an". With the model predicting " a" over " an", we can replace a layer’s corrupted activation with its clean activation to see how much the model shifts towards the " an" token, which indicates that layer’s significance to predicting " an". We repeat this process over all the layers of the model.
We're mostly going to ignore attention for the rest of this post, but these results indicate that Layer 26 is where " picked" starts thinking a lot about " apple", which is obviously required to predict " an".
Note: the scale on these patching graphs is the relative logit difference recovery:
PatchedLogitDiff−CorruptedLogitDiffCleanLogitDiff−CorruptedLogitDiff
(ie. "what proportion of logit(" an") - logit(" a') in the clean prompt did this patch recover?").
The two MLP layers that stand out are Layer 0 and Layer 31. We already know that Layer 0’s MLP is generally important for GPT-2 to function (although we're not sure why attention in Layer 0 is important). The effect of Layer 31 is more interesting. Our results suggests that Layer 31’s MLP plays a significant role in predicting the " an" token. (See this comment if you're confused how this result fits with the logit lens above.)
Finding 1:We can discover predictive neurons by activation patching individual neurons
Activation patching has been used to investigate transformers by the layer, but can we push this technique further and apply it to individual neurons? Since each MLP in a transformer only has one hidden layer, each neuron’s activation does not affect any other neuron in the MLP. So we should be able to patch individual neurons, because they are independent from each other in the same sense that the attention heads in a single layer are independent from each othe...