January 25, 2023

LW - How-to Transformer Mechanistic Interpretability—in 50 lines of code or less! by StefanHex

24 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How-to Transformer Mechanistic Interpretability—in 50 lines of code or less!, published by StefanHex on January 24, 2023 on LessWrong.

Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort.

What if I told you that in just one weekend you can get up to speed doing practical Mechanistic Interpretability research on Transformers? Surprised? Then this is your tutorial!

I'll give you a view to how I research Transformer circuits in practice, show you the tools you need, and explain my thought process along the way. I focus on the practical side to get started with interventions; for more background see point 2 below.

Prerequisites:

Understanding the Transformer architecture: Know what the residual stream is, how attention layers and MLPs work, and how logits & predictions work. For future sections familiarity with multi-head attention is useful. Here’s a link to Neel’s glossary which provides excellent explanations for most terms I might use!If you're not familiar with Transformers you can check out Step 2 (6) on Neel's guide or any of the other explanations online, I recommend Jay Alammar's The Illustrated Transformer and/or Milan Straka's lecture series.

Some overview of Mechanistic Interpretability is helpful: See e.g. any of Neel's talks, or look at the results in the IOI paper / walkthrough.

Basic Python: Familiarity with arrays (as in NumPy or PyTorch, for indices) is useful; but explicitly no PyTorch knowledge required!

No hardware required, free Google Colab account works fine for this. Here's a notebook with all the code from this tutorial! PS: Here's a little web page where you can run some of these methods online! No trivial inconveniences!

Step 0: Setup

Open a notebook (e.g. Colab) and install Neel Nanda’s TransformerLens (formerly known as EasyTransformer).

Step 1: Getting a model to play with

That’s it, now you’ve got a GPT2 model to play with! TransformerLens supports most relevant open source transformers. Here’s how to run the language model

Let’s have a look at the internal activations: TransformerLens can give you a dictionary with almost all internal activations you ever care about (referred to as “cache”):

Here you will find things like the attention pattern blocks.0.attn.hook_pattern, the residual stream before and after each layer blocks.1.hook_resid_pre, and more!

You can also access all the weights & parameters of the model in model.named_parameters(). Here you will find weight matrices & biases of every MLP and Attention layer, as well as the embedding & unembedding matrices. I won’t focus on these in this guide but they’re great to look at! (Exercise: What can the unembedding biases unembed.b_U tell you about common tokens?)

Step 2: Let's start analyzing a behavior!

Let’s go and find some induction heads! I’ll make up an example: Her name was Alex Hart. When Alex, with likely completion Hart. TransformerLens has a little tool to plot a tokenized prompt, model predictions, and associated logits:

I find it is useful to spend a few minutes thinking about which information is needed to solve the task: The model needs to

Realize the last token, Alex, is a repetition of a previous occurrence

The model needs to copy the last name from after the previous Alex occurrence to the last token as prediction

Method 1: Residual stream patching

The number 1 thing I try when I want to reverse engineer a new behavior is to find where in the network the information is “traveling”.

In transformers, the model keeps track of all information in the residual stream. Attention heads & MLPs read from the residual stream, perform some computation or information moving, and write their outputs back into the residual stream. I think of this stream as having a couple of “lanes” corresponding to each token position. Over the course of the model...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

January 25, 2023

LW - How-to Transformer Mechanistic Interpretability—in 50 lines of code or less! by StefanHex

24 minutes

Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort.

What if I told you that in just one weekend you can get up to speed doing practical Mechanistic Interpretability research on Transformers? Surprised? Then this is your tutorial!

Prerequisites:

Some overview of Mechanistic Interpretability is helpful: See e.g. any of Neel's talks, or look at the results in the IOI paper / walkthrough.

Basic Python: Familiarity with arrays (as in NumPy or PyTorch, for indices) is useful; but explicitly no PyTorch knowledge required!

Step 0: Setup

Open a notebook (e.g. Colab) and install Neel Nanda’s TransformerLens (formerly known as EasyTransformer).

Step 1: Getting a model to play with

That’s it, now you’ve got a GPT2 model to play with! TransformerLens supports most relevant open source transformers. Here’s how to run the language model

Let’s have a look at the internal activations: TransformerLens can give you a dictionary with almost all internal activations you ever care about (referred to as “cache”):

Here you will find things like the attention pattern blocks.0.attn.hook_pattern, the residual stream before and after each layer blocks.1.hook_resid_pre, and more!

Step 2: Let's start analyzing a behavior!

I find it is useful to spend a few minutes thinking about which information is needed to solve the task: The model needs to

Realize the last token, Alex, is a repetition of a previous occurrence

The model needs to copy the last name from after the previous Alex occurrence to the last token as prediction

Method 1: Residual stream patching

The number 1 thing I try when I want to reverse engineer a new behavior is to find where in the network the information is “traveling”.

...more

Share LW - How-to Transformer Mechanistic Interpretability—in 50 lines of code or less! by StefanHex

Sign up to save your podcasts

LW - How-to Transformer Mechanistic Interpretability—in 50 lines of code or less! by StefanHex

LW - How-to Transformer Mechanistic Interpretability—in 50 lines of code or less! by StefanHex