The Nonlinear Library

LW - How-to Transformer Mechanistic Interpretability—in 50 lines of code or less! by StefanHex


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How-to Transformer Mechanistic Interpretability—in 50 lines of code or less!, published by StefanHex on January 24, 2023 on LessWrong.
Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort.
What if I told you that in just one weekend you can get up to speed doing practical Mechanistic Interpretability research on Transformers? Surprised? Then this is your tutorial!
I'll give you a view to how I research Transformer circuits in practice, show you the tools you need, and explain my thought process along the way. I focus on the practical side to get started with interventions; for more background see point 2 below.
Prerequisites:
Understanding the Transformer architecture: Know what the residual stream is, how attention layers and MLPs work, and how logits & predictions work. For future sections familiarity with multi-head attention is useful. Here’s a link to Neel’s glossary which provides excellent explanations for most terms I might use!If you're not familiar with Transformers you can check out Step 2 (6) on Neel's guide or any of the other explanations online, I recommend Jay Alammar's The Illustrated Transformer and/or Milan Straka's lecture series.
Some overview of Mechanistic Interpretability is helpful: See e.g. any of Neel's talks, or look at the results in the IOI paper / walkthrough.
Basic Python: Familiarity with arrays (as in NumPy or PyTorch, for indices) is useful; but explicitly no PyTorch knowledge required!
No hardware required, free Google Colab account works fine for this. Here's a notebook with all the code from this tutorial! PS: Here's a little web page where you can run some of these methods online! No trivial inconveniences!
Step 0: Setup
Open a notebook (e.g. Colab) and install Neel Nanda’s TransformerLens (formerly known as EasyTransformer).
Step 1: Getting a model to play with
That’s it, now you’ve got a GPT2 model to play with! TransformerLens supports most relevant open source transformers. Here’s how to run the language model
Let’s have a look at the internal activations: TransformerLens can give you a dictionary with almost all internal activations you ever care about (referred to as “cache”):
Here you will find things like the attention pattern blocks.0.attn.hook_pattern, the residual stream before and after each layer blocks.1.hook_resid_pre, and more!
You can also access all the weights & parameters of the model in model.named_parameters(). Here you will find weight matrices & biases of every MLP and Attention layer, as well as the embedding & unembedding matrices. I won’t focus on these in this guide but they’re great to look at! (Exercise: What can the unembedding biases unembed.b_U tell you about common tokens?)
Step 2: Let's start analyzing a behavior!
Let’s go and find some induction heads! I’ll make up an example: Her name was Alex Hart. When Alex, with likely completion Hart. TransformerLens has a little tool to plot a tokenized prompt, model predictions, and associated logits:
I find it is useful to spend a few minutes thinking about which information is needed to solve the task: The model needs to
Realize the last token, Alex, is a repetition of a previous occurrence
The model needs to copy the last name from after the previous Alex occurrence to the last token as prediction
Method 1: Residual stream patching
The number 1 thing I try when I want to reverse engineer a new behavior is to find where in the network the information is “traveling”.
In transformers, the model keeps track of all information in the residual stream. Attention heads & MLPs read from the residual stream, perform some computation or information moving, and write their outputs back into the residual stream. I think of this stream as having a couple of “lanes” corresponding to each token position. Over the course of the model...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings