Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Induction heads - illustrated, published by TheMcDouglas on January 2, 2023 on LessWrong.
TL;DR
This is my illustrated walkthrough of induction heads. I created it in order to concisely capture all the information about how the circuit works.
There are 2 versions of the walkthrough:
Version 1 is the one included in this post. It's slightly shorter, and focuses more on the intuitions than the actual linear operations going on.
Version 2 can be found at my personal website. It has all the same stuff as version 1, with a bit of added info about the mathematical details, and how you might go about reverse-engineering this circuit in a real model.
The final image from version 1 is inline below, and depending on your level of familiarity with transformers, looking at this diagram might provide most of the value of this post. If it doesn't make sense to you, then read on for the full walkthrough, where I build up this diagram bit by bit.
Introduction
Induction heads are a well-studied and understood circuit in transformers. They allow a model to perform in-context learning, of a very specific form: if a sequence contains a repeated subsequence e.g. of the form A B ... A B (where A and B stand for generic tokens, e.g. the first and last name of a person who doesn't appear in any of the model's training data), then the second time this subsequence occurs the transformer will be able to predict that B follows A. Although this might seem like weirdly specific ability, it turns out that induction circuits are actually a pretty massive deal. They're present even in large models (despite being originally discovered in 2-layer models), they can be linked to macro effects like bumps in loss curves during training, and there is some evidence that induction heads might even constitute the mechanism for the actual majority of all in-context learning in large transformer models.
I think induction heads can be pretty confusing unless you fully understand the internal mechanics, and it's easy to come away from them feeling like you get what's going on without actually being able to explain things down to the precise details. My hope is that these diagrams help people form a more precise understanding of what's actually going on.
Prerequisites
This post is aimed at people who already understand how a transformer is structured (I'd recommend Neel Nanda's tutorial for that), and the core ideas in the Mathematical Framework for Transformer Circuits paper. If you understand everything on this list, it will probably suffice:
The central object in the transformer is the residual stream.
Different heads in each layer can be thought of as operating independently of each other, reading and writing into the residual stream.
Heads can compose to form circuits. For instance, K-composition is when the output of one head is used to generate the key vector in the attention calculations of a subsequent head.
We can describe the weight matrices WQ, WK and WV as reading from (or projecting from) the residual stream, and WO as writing to (or embedding into) the residual stream.
We can think of the combined operations WQ and WK in terms of a single ,ow-rank matrix WQK:=WQWTK, called the QK circuit.
This matrix defines a bilinear form on the vectors in the residual stream: vTiWQKvj is the attention paid by the ith token to the jth token.
Conceptually, this matrix tells us which tokens information is moved to & from in the residual stream.
We can think of the combined operations WV and WO in terms of a single matrix WOV:=WVWO, called the OV circuit.
This matrix defines a map from residual stream vectors to residual stream vectors: if vj is the residual stream vector at the source token, then vTjWOV is the vector that gets moved from token j to the destination token (if j is attended to).
Conceptually, this matr...