Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How-to Transformer Mechanistic Interpretability—in 50 lines of code or less!, published by StefanHex on January 24, 2023 on LessWrong.
Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort.
What if I told you that in just one weekend you can get up to speed doing practical Mechanistic Interpretability research on Transformers? Surprised? Then this is your tutorial!
I'll give you a view to how I research Transformer circuits in practice, show you the tools you need, and explain my thought process along the way. I focus on the practical side to get started with interventions; for more background see point 2 below.
Prerequisites:
Understanding the Transformer architecture: Know what the residual stream is, how attention layers and MLPs work, and how logits & predictions work. For future sections familiarity with multi-head attention is useful. Here’s a link to Neel’s glossary which provides excellent explanations for most terms I might use!If you're not familiar with Transformers you can check out Step 2 (6) on Neel's guide or any of the other explanations online, I recommend Jay Alammar's The Illustrated Transformer and/or Milan Straka's lecture series.
Some overview of Mechanistic Interpretability is helpful: See e.g. any of Neel's talks, or look at the results in the IOI paper / walkthrough.
Basic Python: Familiarity with arrays (as in NumPy or PyTorch, for indices) is useful; but explicitly no PyTorch knowledge required!
No hardware required, free Google Colab account works fine for this. Here's a notebook with all the code from this tutorial! PS: Here's a little web page where you can run some of these methods online! No trivial inconveniences!
Step 0: Setup
Open a notebook (e.g. Colab) and install Neel Nanda’s TransformerLens (formerly known as EasyTransformer).
Step 1: Getting a model to play with
That’s it, now you’ve got a GPT2 model to play with! TransformerLens supports most relevant open source transformers. Here’s how to run the language model
Let’s have a look at the internal activations: TransformerLens can give you a dictionary with almost all internal activations you ever care about (referred to as “cache”):
Here you will find things like the attention pattern blocks.0.attn.hook_pattern, the residual stream before and after each layer blocks.1.hook_resid_pre, and more!
You can also access all the weights & parameters of the model in model.named_parameters(). Here you will find weight matrices & biases of every MLP and Attention layer, as well as the embedding & unembedding matrices. I won’t focus on these in this guide but they’re great to look at! (Exercise: What can the unembedding biases unembed.b_U tell you about common tokens?)
Step 2: Let's start analyzing a behavior!
Let’s go and find some induction heads! I’ll make up an example: Her name was Alex Hart. When Alex, with likely completion Hart. TransformerLens has a little tool to plot a tokenized prompt, model predictions, and associated logits:
I find it is useful to spend a few minutes thinking about which information is needed to solve the task: The model needs to
Realize the last token, Alex, is a repetition of a previous occurrence
The model needs to copy the last name from after the previous Alex occurrence to the last token as prediction
Method 1: Residual stream patching
The number 1 thing I try when I want to reverse engineer a new behavior is to find where in the network the information is “traveling”.
In transformers, the model keeps track of all information in the residual stream. Attention heads & MLPs read from the residual stream, perform some computation or information moving, and write their outputs back into the residual stream. I think of this stream as having a couple of “lanes” corresponding to each token position. Over the course of the model...