Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Behavioral and mechanistic definitions (often confuse AI alignment discussions), published by Lawrence Chan on February 20, 2023 on The AI Alignment Forum.
TL;DR: It’s important to distinguish between behavioral definitions – which categorize objects based on outside observable properties – and mechanistic definitions – which categorize objects based on their internal mechanisms. In this post, I give several examples of terms which can be defined either behaviorally and mechanistically. Then, I talk about the pros and cons of both kinds of definitions, and how this distinction relates to the distinction between gears-level versus black-box models.
Related to: Most similar to John Wentworth’s Gears and Behaviors, but about definitions rather than models. Also inspired by: Gears in understanding, How an algorithm feels from the inside, the “Human’s Guide to Words” Sequence in general.
Epistemic status: written quickly instead of not at all.
Introduction:
Broadly speaking, when pointing at a relatively distinct cluster of objects, there’s two ways to define membership criteria:
Behaviorally: You can categorize objects based on outside observable properties, that is, their behavior in particular situations.
Mechanistically: Alternatively, you can categorize objects via their internal mechanisms. That is, instead of only checking for a particular behavioral property, you instead look for how the object implements said property.
Many AI safety concepts have both behavioral and mechanistic definitions. In turn, many discussions about AI safety end up with the participants confused or even talking past each other. This is my attempt to clarify the discussion, by giving examples of both, explaining the pros and cons, and discussing when you might want to use either.
Three examples of behavioral and mechanistic definitions
To better illustrate what I mean, I’ll give two examples from recent ML work and a third from the sequences.
Induction heads
First introduced in a mathematical framework for transformer circuits, induction heads are transformer attention heads that implement in-context copying behavior. However, there seem to be two definitions that are often conflated:
Behavioral: Subsequent papers (In-context Learning and Induction Heads, Scaling laws and Interpretability of Learning from Repeated Data) give a behavioral definition of induction heads: Induction heads are heads that score highly on two metrics on repeated random sequences of the form [A] [B] . [A]:
Prefix matching: attention heads pay a lot of attention to the first occurrence of the token [A].
Copying: attention heads increase the logit of [B] relative to other tokens.
This definition is clearly behavioral: it makes no reference to how these heads are implemented, but only to their outside behavior.
Mechanistic: In contrast, the original mathematical framework paper also gives a mechanistic definition for induction heads: induction heads are heads that implement copying behavior using either Q- or K-composition. While this definition does make some reference to outside properties (induction heads implement copying), the primary part is mechanistic and details how this copying behavior is implemented.
However, it turns out that the two definitions don’t overlap perfectly: behavioral induction heads are often implementing many other heuristics, even in very small language models. I often talk to people who confuse the two definitions and think that we understand much more about the internal mechanisms of large language models than we actually do. In a forthcoming post, Alexandre Variengien discusses the distinction between these two definitions in more detail, while also highlighting specific confusions that may arise from failing to distinguish the two definitions.
Different framings of inner and...