November 09, 2023

LW - Polysemantic Attention Head in a 4-Layer Transformer by Jett

10 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Polysemantic Attention Head in a 4-Layer Transformer, published by Jett on November 9, 2023 on LessWrong.

Produced as a part of MATS Program, under @Neel Nanda and @Lee Sharkey mentorship

Epistemic status: optimized to get the post out quickly, but we are confident in the main claims

TL;DR: head 1.4 in attn-only-4l exhibits many different attention patterns that are all relevant to model's performance

Introduction

about the docstring circuit, we found that attention head 1.4 (Layer 1, Head 4) in a

4-layer attention-only transformer would act as either a fuzzy previous token head or as an induction head in different parts of the prompt.

These results suggested that attention head 1.4 was polysemantic, i.e. performing different functions within different contexts.

Section 1, we classify ~5 million rows of attention patterns associated with 5,000 prompts from the model's training distribution. In doing so, we identify many more simple behaviours that this head exhibits.

Section 2, we explore 3 simple behaviours (induction, fuzzy previous token, and bigger indentation) more deeply. We construct a set of prompts for each behaviour, and we investigate its importance to model performance.

This post provides evidence of the complex role that attention heads play within a model's computation, and that simplifying an attention head to a simple, singular behaviour can be misleading.

Section 1

Methods

We uniformly sample 5,000 prompts from the model's training dataset of

web text and

code.

We collect approximately 5 million individual rows of attention patterns corresponding to these prompts, ie. rows from the head's attention matrices that correspond to a single destination position.

We then classify each of these patterns as (a mix of) simple, salient behaviours.

If there is a behaviour that accounts for at least 95% of a pattern, then it is classified. Otherwise we refer to it as unknown (but there is a multitude of consistent behaviours that we did not define, and thus did not classify)

Results

Distribution of behaviours

In Figure 1 we present results of the classification, where "all" refers to "all destination tokens" and other labels refer to specific destination tokens.

Character

· is for a space,

for a new line, and labels such as

[·K]mean "

\n and K spaces".

We distinguish the following behaviours:

previous: attention concentrated on a few previous tokens

inactive: attention to BOS and EOS

previous+induction: a mix of previous and basic induction

unknown: not classified

Some observations:

Across all the patterns, previous is the most common behaviour, followed by inactive and unknown.

A big chunk of the patterns (unknown) were not automatically classified. There are many examples of consistent behaviours there, but we do not know for how many patterns they account.

Destination token does not determine the attention pattern.

[·3] and

[·7] have basically the same distributions, with ~87% of patterns not classified

Prompt examples for each destination token

Token:

[·3]

Behaviour: previous+induction

There are many ways to understand this pattern, there is likely more going on than simple previous and induction behaviours.

Token:

·R

Behaviour: inactive

Token:

[·7]

Behaviour: unknown

This is a very common pattern, where attention is paid from "new line and indentation" to "new line and bigger indentation". We believe it accounts for most of what classified as unknown for

[·7] and

[·3].

Token:

width

Behaviour: unknown

We did not see many examples like this, but looks like attention is being paid to recent tokens representing arithmetic operations.

Token:

dict

Behaviour: previous

Mostly previous token, but

·collections gets more than

. and

default, which points at something more complicated.

Section 2

Methods

We select a few behaviours and construct pro...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

November 09, 2023

LW - Polysemantic Attention Head in a 4-Layer Transformer by Jett

10 minutes

Produced as a part of MATS Program, under @Neel Nanda and @Lee Sharkey mentorship

Epistemic status: optimized to get the post out quickly, but we are confident in the main claims

TL;DR: head 1.4 in attn-only-4l exhibits many different attention patterns that are all relevant to model's performance

Introduction

about the docstring circuit, we found that attention head 1.4 (Layer 1, Head 4) in a

4-layer attention-only transformer would act as either a fuzzy previous token head or as an induction head in different parts of the prompt.

These results suggested that attention head 1.4 was polysemantic, i.e. performing different functions within different contexts.

This post provides evidence of the complex role that attention heads play within a model's computation, and that simplifying an attention head to a simple, singular behaviour can be misleading.

Section 1

Methods

We uniformly sample 5,000 prompts from the model's training dataset of

web text and

code.

We collect approximately 5 million individual rows of attention patterns corresponding to these prompts, ie. rows from the head's attention matrices that correspond to a single destination position.

We then classify each of these patterns as (a mix of) simple, salient behaviours.

Results

Distribution of behaviours

In Figure 1 we present results of the classification, where "all" refers to "all destination tokens" and other labels refer to specific destination tokens.

Character

· is for a space,

for a new line, and labels such as

[·K]mean "

\n and K spaces".

We distinguish the following behaviours:

previous: attention concentrated on a few previous tokens

inactive: attention to BOS and EOS

previous+induction: a mix of previous and basic induction

unknown: not classified

Some observations:

Across all the patterns, previous is the most common behaviour, followed by inactive and unknown.

A big chunk of the patterns (unknown) were not automatically classified. There are many examples of consistent behaviours there, but we do not know for how many patterns they account.

Destination token does not determine the attention pattern.

[·3] and

[·7] have basically the same distributions, with ~87% of patterns not classified

Prompt examples for each destination token

Token:

[·3]

Behaviour: previous+induction

There are many ways to understand this pattern, there is likely more going on than simple previous and induction behaviours.

Token:

·R

Behaviour: inactive

Token:

[·7]

Behaviour: unknown

This is a very common pattern, where attention is paid from "new line and indentation" to "new line and bigger indentation". We believe it accounts for most of what classified as unknown for

[·7] and

[·3].

Token:

width

Behaviour: unknown

We did not see many examples like this, but looks like attention is being paid to recent tokens representing arithmetic operations.

Token:

dict

Behaviour: previous

Mostly previous token, but

·collections gets more than

. and

default, which points at something more complicated.

Section 2

Methods

We select a few behaviours and construct pro...

...more

Share LW - Polysemantic Attention Head in a 4-Layer Transformer by Jett

Sign up to save your podcasts

LW - Polysemantic Attention Head in a 4-Layer Transformer by Jett

LW - Polysemantic Attention Head in a 4-Layer Transformer by Jett