The Nonlinear Library

LW - Some common confusion about induction heads by Alexandre Variengien


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some common confusion about induction heads, published by Alexandre Variengien on March 28, 2023 on LessWrong.
Epistemic status: conceptual discussion and opinions informed by doing 6 months of interpretability research at Redwood Research and exchanging with other researchers, but I’m just speaking for myself.
Induction heads are defined twice by Anthropic.
The first time as a mechanism in 2L attention-only transformers
A second time as a behavioral description on repeated random sequences of tokens
However, these two definitions rely on distinct sources of evidence and create confusion, as their difference is not always acknowledged when people cite these papers. The mechanistic definition applies to toy language models, while the behavioral definition is a useful yet incomplete characterization of attention heads.
I think that many people are in fact confused by this: I have talked to many people who aren’t clear on the fact that these two concepts are different, and incorrectly believe that (e.g.) the mechanism of induction heads in larger language models has been characterized.
More specifically, the two Anthropic papers introduce the following two distinct definitions of induction heads:
Mechanistic: The first definition, introduced by Elhage et al., describes a behavior in a 2 layer attention-only model (copying a token given a matching prefix) and a minimal mechanism to perform this behavior (a set of paths in the computational graph and a human interpretation of the transformation along those paths). Empirically, this mechanism seems to be the best possible short description of what those heads are doing (i.e. if you have to choose a subgraph made of a single path as input for the keys, queries, and values of these heads, the induction circuit is likely to be the one that recovers the most loss). But this explanation does not encompass everything these heads do. In reality, many more paths are used than the one described (see Redwood’s causal scrubbing results on induction heads) and the function of the additional paths is unclear.
I don’t know whether the claims about the behavior and mechanisms of these heads are best described as “mostly true but missing details” or “only a small part of what’s going on”. See also Buck’s comment for more discussion on the interpretation of causal scrubbing recovered loss.
Behavioral: The second definition, introduced by Olsson et al., relies on head activation evaluators (measuring attention patterns and head output) on out-of-distribution sequences made of Repeated Random Tokens (RRT). Two scores are used to characterize induction heads: i) Prefix matching: attention probability to the first occurrence of the token [A] on patterns like [A][B] . [A] ii) Copying: how much the head output increases the logit of [A] compared to the other logits. The RRT distribution was chosen so that fully abstracted induction behavior is one of the few useful heuristics to predict the next token.
In this post, I’ll use mechanistic or behavioral induction heads to differentiate between the definitions.
I’ll present three points that — I think — are important to keep in mind when using these definitions.
1 - The two-head mechanism (induction head and previous token head) described in Elhage et al. is the minimal way to implement an induction head
As noted in the paper, induction heads can use more complicated mechanisms. For instance, instead of relying on a previous token head to match only one token as a prefix (the token [A] in the example above), they could rely on a head that attends further back to match longer prefixes (e.g. patterns like [X][A][B] . [X][A]). Empirically, evidence for induction heads using some amount of longer prefix matching has been observed in the causal scrubbing experiments on induction.
The two-head mechanism i...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings