February 03, 2024

AF - Attention SAEs Scale to GPT-2 Small by Connor Kissane

13 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Attention SAEs Scale to GPT-2 Small, published by Connor Kissane on February 3, 2024 on The AI Alignment Forum.

This is an interim report that we are currently building on. We hope this update + open sourcing our SAEs will be useful to related research occurring in parallel. Produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort

Executive Summary

In a previous post, we showed that

sparse autoencoders (SAEs) work on the attention layer outputs of a two layer transformer. We scale our attention SAEs to GPT-2 Small, and continue to find sparse interpretable features in every layer. This makes us optimistic about our ongoing efforts scaling further, especially since we didn't have to do much iterating

We open source our SAEs. Load them from

Hugging Face or this

colab notebook

The SAEs seem good, often recovering more than 80% of the loss relative to zero ablation, and are sparse with less than 20 features firing on average. The majority of the live features are interpretable

We continue to find the same three feature families that we found in the

two layer model:

induction features,

local context features, and

high level context features. This suggests that some of our lessons interpreting features in smaller models may generalize

We also find new, interesting feature families that we didn't find in the two layer model, providing hints about fundamentally different capabilities in GPT-2 Small

See our

feature interface to browse the first 30 features for each layer

Introduction

Sparse Autoencoders Work on Attention Layer Outputs we showed that we can apply SAEs to extract sparse interpretable features from the last attention layer of a two layer transformer. We have since applied the same technique to a 12-layer model, GPT-2 Small, and continue to find sparse, interpretable features in every layer. Our SAEs often recover more than 80% of the loss[1], and are sparse with less than 20 features firing on average. We perform shallow investigations of the first 30 features from each layer, and we find that the majority (often 80%+) of non-dead SAEs features are interpretable.

interactive visualizations for each layer.

We open source our SAEs in hope that they will be useful to other researchers currently working on dictionary learning. We are particularly excited about using these SAEs to better understand attention circuits at the feature level. See the SAEs on

Hugging Face or load them using this

colab notebook. Below we provide the key metrics for each SAE:

L0 norm

loss recovered

dead features

% alive features interpretable

99%

13%

97%

78%

49%

87%

90%

20%

95%

84%

75%

88%

100%

85%

40%

82%

28%

75%

83%

58%

70%

76%

37%

64%

83%

48%

85%

L10

85%

41%

81%

L11

89%

84%

66%

It's worth noting that we didn't do much differently to train these,[2] leaving us optimistic about the tractability of scaling attention SAEs to even bigger models.

Excitingly, we also continue to identify feature families. We find features from all three of the families that we identified in the two layer model: induction features, local context features, and high level context features. This provides us hope that some of our lessons from interpreting features in smaller models will continue to generalize.

We also find new, interesting feature families in GPT-2 Small, suggesting that attention SAEs can provide valuable hints about new[3] capabilities that larger models have learned. Some new features include:

Successor features, which activate when predicting the next item in a sequence such as "15, 16" -> "17" (which are partly coming from

Successor Heads in the model), and boost the logits of the next item.

Name mover features, which predict a name in the context, such as in the

IOI task

Duplicate token f...

...more

View all episodes

By The Nonlinear Fund

February 03, 2024

AF - Attention SAEs Scale to GPT-2 Small by Connor Kissane

13 minutes

Executive Summary

In a previous post, we showed that

We open source our SAEs. Load them from

Hugging Face or this

colab notebook

We continue to find the same three feature families that we found in the

two layer model:

induction features,

local context features, and

high level context features. This suggests that some of our lessons interpreting features in smaller models may generalize

We also find new, interesting feature families that we didn't find in the two layer model, providing hints about fundamentally different capabilities in GPT-2 Small

See our

feature interface to browse the first 30 features for each layer

Introduction

interactive visualizations for each layer.

Hugging Face or load them using this

colab notebook. Below we provide the key metrics for each SAE:

L0 norm

loss recovered

dead features

% alive features interpretable

99%

13%

97%

78%

49%

87%

90%

20%

95%

84%

75%

88%

100%

85%

40%

82%

28%

75%

83%

58%

70%

76%

37%

64%

83%

48%

85%

L10

85%

41%

81%

L11

89%

84%

66%

It's worth noting that we didn't do much differently to train these,[2] leaving us optimistic about the tractability of scaling attention SAEs to even bigger models.

Successor features, which activate when predicting the next item in a sequence such as "15, 16" -> "17" (which are partly coming from

Successor Heads in the model), and boost the logits of the next item.

Name mover features, which predict a name in the context, such as in the

IOI task

Duplicate token f...

...more

More shows like The Nonlinear Library: Alignment Forum

View all

AXRP - the AI X-risk Research Podcast

9 Listeners

Share AF - Attention SAEs Scale to GPT-2 Small by Connor Kissane

Sign up to save your podcasts

AF - Attention SAEs Scale to GPT-2 Small by Connor Kissane

AF - Attention SAEs Scale to GPT-2 Small by Connor Kissane

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast