The Nonlinear Library

LW - Polysemantic Attention Head in a 4-Layer Transformer by Jett


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Polysemantic Attention Head in a 4-Layer Transformer, published by Jett on November 9, 2023 on LessWrong.
Produced as a part of MATS Program, under @Neel Nanda and @Lee Sharkey mentorship
Epistemic status: optimized to get the post out quickly, but we are confident in the main claims
TL;DR: head 1.4 in attn-only-4l exhibits many different attention patterns that are all relevant to model's performance
Introduction
In
previous post
about the docstring circuit, we found that attention head 1.4 (Layer 1, Head 4) in a
4-layer attention-only transformer would act as either a fuzzy previous token head or as an induction head in different parts of the prompt.
These results suggested that attention head 1.4 was polysemantic, i.e. performing different functions within different contexts.
In
Section 1, we classify ~5 million rows of attention patterns associated with 5,000 prompts from the model's training distribution. In doing so, we identify many more simple behaviours that this head exhibits.
In
Section 2, we explore 3 simple behaviours (induction, fuzzy previous token, and bigger indentation) more deeply. We construct a set of prompts for each behaviour, and we investigate its importance to model performance.
This post provides evidence of the complex role that attention heads play within a model's computation, and that simplifying an attention head to a simple, singular behaviour can be misleading.
Section 1
Methods
We uniformly sample 5,000 prompts from the model's training dataset of
web text and
code.
We collect approximately 5 million individual rows of attention patterns corresponding to these prompts, ie. rows from the head's attention matrices that correspond to a single destination position.
We then classify each of these patterns as (a mix of) simple, salient behaviours.
If there is a behaviour that accounts for at least 95% of a pattern, then it is classified. Otherwise we refer to it as unknown (but there is a multitude of consistent behaviours that we did not define, and thus did not classify)
Results
Distribution of behaviours
In Figure 1 we present results of the classification, where "all" refers to "all destination tokens" and other labels refer to specific destination tokens.
Character
· is for a space,
for a new line, and labels such as
[·K]mean "
\n and K spaces".
We distinguish the following behaviours:
previous: attention concentrated on a few previous tokens
inactive: attention to BOS and EOS
previous+induction: a mix of previous and basic induction
unknown: not classified
Some observations:
Across all the patterns, previous is the most common behaviour, followed by inactive and unknown.
A big chunk of the patterns (unknown) were not automatically classified. There are many examples of consistent behaviours there, but we do not know for how many patterns they account.
Destination token does not determine the attention pattern.
[·3] and
[·7] have basically the same distributions, with ~87% of patterns not classified
Prompt examples for each destination token
Token:
[·3]
Behaviour: previous+induction
There are many ways to understand this pattern, there is likely more going on than simple previous and induction behaviours.
Token:
·R
Behaviour: inactive
Token:
[·7]
Behaviour: unknown
This is a very common pattern, where attention is paid from "new line and indentation" to "new line and bigger indentation". We believe it accounts for most of what classified as unknown for
[·7] and
[·3].
Token:
width
Behaviour: unknown
We did not see many examples like this, but looks like attention is being paid to recent tokens representing arithmetic operations.
Token:
dict
Behaviour: previous
Mostly previous token, but
·collections gets more than
. and
default, which points at something more complicated.
Section 2
Methods
We select a few behaviours and construct pro...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings