July 13, 2024

AF - Stitching SAEs of different sizes by Bart Bussmann

21 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Stitching SAEs of different sizes, published by Bart Bussmann on July 13, 2024 on The AI Alignment Forum.

Work done in Neel Nanda's stream of MATS 6.0, equal contribution by Bart Bussmann and Patrick Leask, Patrick Leask is concurrently a PhD candidate at Durham University

TL;DR: When you scale up an SAE, the features in the larger SAE can be categorized in two groups: 1) "novel features" with new information not in the small SAE and 2) "reconstruction features" that sparsify information that already exists in the small SAE. You can stitch SAEs by adding the novel features to the smaller SAE.

Introduction

Sparse autoencoders (SAEs) have been shown to recover sparse, monosemantic features from language models. However, there has been limited research into how those features vary with dictionary size, that is, when you take the same activation in the same model and train a wider dictionary on it, what changes? And how do the features learned vary?

We show that features in larger SAEs cluster into two kinds of features: those that capture similar information to the smaller SAE (either identical features, or split features; about 65%), and those which capture novel features absent in the smaller mode (the remaining 35%). We validate this by showing that inserting the novel features from the larger SAE into the smaller SAE boosts the reconstruction performance, while inserting the similar features makes performance worse.

Building on this insight, we show how features from multiple SAEs of different sizes can be combined to create a "Frankenstein" model that outperforms SAEs with an equal number of features, though tends to lead to higher L0, making a fair comparison difficult. Our work provides new understanding of how SAE dictionary size impacts the learned feature space, and how to reason about whether to train a wider SAE.

We hope that this method may also lead to a practically useful way of training high-performance SAEs with less feature splitting and a wider range of learned novel features.

Larger SAEs learn both similar and entirely novel features

Set-up

We use sparse autoencoders as in

Towards Monosemanticity and

Sparse Autoencoders Find Highly Interpretable Directions. In our setup, the feature activations are computed as:

Based on these feature activations, the input is then reconstructed as

The encoder and decoder matrices and biases are trained with a loss function that combines an L2 penalty on the reconstruction loss and an L1 penalty on the feature activations:

In our experiments, we train a range of sparse autoencoders (SAEs) with varying widths across residual streams in GPT-2 and Pythia-410m. The width of an SAE is determined by the number of features (F) in the sparse autoencoder. Our smallest SAE on GPT-2 consists of only 768 features, while the largest one has nearly 100,000 features. Here is the full list of SAEs used in this research:

Name

Model site

Dictionary size

MSE

CE Loss Recovered from zero ablation

CE Loss Recovered from mean ablation

GPT2-768

gpt2-small layer 8 of 12 resid_pre

768

35.2

2.72

0.915

0.876

GPT2-1536

gpt2-small layer 8 of 12 resid_pre

1536

39.5

2.22

0.942

0.915

GPT2-3072

gpt2-small layer 8 of 12 resid_pre

3072

42.4

1.89

0.955

0.937

GPT2-6144

gpt2-small layer 8 of 12 resid_pre

6144

43.8

1.631

0.965

0.949

GPT2-12288

gpt2-small layer 8 of 12 resid_pre

12288

43.9

1.456

0.971

0.958

GPT2-24576

gpt2-small layer 8 of 12 resid_pre

24576

42.9

1.331

0.975

0.963

GPT2-49152

gpt2-small layer 8 of 12 resid_pre

49152

42.4

1.210

0.978

0.967

GPT2-98304

gpt2-small layer 8 of 12 resid_pre

98304

43.9

1.144

0.980

0.970

Pythia-8192

Pythia-410M-deduped layer 3 of 24 resid_pre

8192

51.0

0.030

0.977

0.972

Pythia-16384

Pythia-410M-deduped layer 3 of 24 resid_pre

16384

43.2

0.024

0.983

0.979

The base language models used are those included in

Transform...

...more

View all episodes

By The Nonlinear Fund

July 13, 2024

AF - Stitching SAEs of different sizes by Bart Bussmann

21 minutes

Work done in Neel Nanda's stream of MATS 6.0, equal contribution by Bart Bussmann and Patrick Leask, Patrick Leask is concurrently a PhD candidate at Durham University

Introduction

We hope that this method may also lead to a practically useful way of training high-performance SAEs with less feature splitting and a wider range of learned novel features.

Larger SAEs learn both similar and entirely novel features

Set-up

We use sparse autoencoders as in

Towards Monosemanticity and

Sparse Autoencoders Find Highly Interpretable Directions. In our setup, the feature activations are computed as:

Based on these feature activations, the input is then reconstructed as

The encoder and decoder matrices and biases are trained with a loss function that combines an L2 penalty on the reconstruction loss and an L1 penalty on the feature activations:

Name

Model site

Dictionary size

MSE

CE Loss Recovered from zero ablation

CE Loss Recovered from mean ablation

GPT2-768

gpt2-small layer 8 of 12 resid_pre

768

35.2

2.72

0.915

0.876

GPT2-1536

gpt2-small layer 8 of 12 resid_pre

1536

39.5

2.22

0.942

0.915

GPT2-3072

gpt2-small layer 8 of 12 resid_pre

3072

42.4

1.89

0.955

0.937

GPT2-6144

gpt2-small layer 8 of 12 resid_pre

6144

43.8

1.631

0.965

0.949

GPT2-12288

gpt2-small layer 8 of 12 resid_pre

12288

43.9

1.456

0.971

0.958

GPT2-24576

gpt2-small layer 8 of 12 resid_pre

24576

42.9

1.331

0.975

0.963

GPT2-49152

gpt2-small layer 8 of 12 resid_pre

49152

42.4

1.210

0.978

0.967

GPT2-98304

gpt2-small layer 8 of 12 resid_pre

98304

43.9

1.144

0.980

0.970

Pythia-8192

Pythia-410M-deduped layer 3 of 24 resid_pre

8192

51.0

0.030

0.977

0.972

Pythia-16384

Pythia-410M-deduped layer 3 of 24 resid_pre

16384

43.2

0.024

0.983

0.979

The base language models used are those included in

Transform...

...more

More shows like The Nonlinear Library: Alignment Forum

View all

AXRP - the AI X-risk Research Podcast

8 Listeners

Share AF - Stitching SAEs of different sizes by Bart Bussmann

Sign up to save your podcasts

AF - Stitching SAEs of different sizes by Bart Bussmann

AF - Stitching SAEs of different sizes by Bart Bussmann

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast