December 05, 2023

AF - Some open-source dictionaries and dictionary learning infrastructure by Sam Marks

9 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some open-source dictionaries and dictionary learning infrastructure, published by Sam Marks on December 5, 2023 on The AI Alignment Forum.

As more people begin work on interpretability projects which incorporate dictionary learning, it will be valuable to have high-quality dictionaries publicly available.[1] To get the ball rolling on this, my collaborator (Aaron Mueller) and I are:

open-sourcing a number of sparse autoencoder dictionaries trained on Pythia-70m MLPs

releasing our repository for training these dictionaries[2].

Let's discuss the dictionaries first, and then the repo.

The dictionaries

The dictionaries can be downloaded from here. See the sections "Downloading our open-source dictionaries" and "Using trained dictionaries" here for information about how to download and use them. If you use these dictionaries in a published paper, we ask that you mention us in the acknowledgements.

We're releasing two sets of dictionaries for EleutherAI's 6-layer pythia-70m-deduped model. The dictionaries in both sets were trained on 512-dimensional MLP output activations (not the MLP hidden layer like Anthropic used), using ~800M tokens from The Pile.

The first set, called

0_8192, consists of dictionaries of size 8192=16512. These were trained with an L1 penalty of

1e-3.

The second set, called

1_32768, consists of dictionaries of size 32768=64512. These were trained with an l1 penalty of

3e-3.

Here are some statistics. (See our repo's readme for more info on what these statistics mean.)

For dictionaries in the

0_8192 set:

Layer

MSE Loss

L1 loss

% Alive

% Loss Recovered

0.056

6.132

9.951

0.998

0.984

0.089

6.677

44.739

0.887

0.924

0.108

11.44

62.156

0.587

0.867

0.135

23.773

175.303

0.588

0.902

0.148

27.084

174.07

0.806

0.927

0.179

47.126

235.05

0.672

0.972

For dictionaries in the

1_32768 set:

Layer

MSE Loss

L1 loss

% Alive

% Loss Recovered

0.09

4.32

2.873

0.174

0.946

0.13

2.798

11.256

0.159

0.768

0.152

6.151

16.381

0.118

0.724

0.211

11.571

39.863

0.226

0.765

0.222

13.665

29.235

0.19

0.816

0.265

26.4

43.846

0.13

0.931

And here are some histograms of feature frequencies.

Overall, I'd guess that these dictionaries are decent, but not amazing.

We trained these dictionaries because we wanted to work on a downstream application of dictionary learning, but lacked the dictionaries. These dictionaries are more than good enough to get us off the ground on our mainline project, but I expect that in not too long we'll come back to train some better dictionaries (which we'll also open source). I think the same is true for other folks: these dictionaries should be sufficient to get started on projects that require dictionaries; and when better dictionaries are available later, you can swap them in for optimal results.

Some miscellaneous notes about these dictionaries (you can find more in the repo).

The L1 penalty for

1_32768 seems to have been too large; only 10-20% of the neurons are alive, and the loss recovered is much worse. That said, we'll remark that after examining features from both sets of dictionaries, the dictionaries from the

1_32768 set seem to have more interpretable features than those from the

0_8192 set (though it's hard to tell).

In particular, we suspect that for

0_8192, the many high-frequency features in the later layers are uninterpretable but help significantly with reconstructing activations, resulting in deceptively good-looking statistics. (See the bullet point below regarding neuron resampling and bimodality.)

As we progress through the layers, the dictionaries tend to get worse along most metrics (except for % loss recovered). This may have to do with the growing scale of the activations themselves as one moves through the layers of pythia models (h/t to Arthur Conmy for raising this hypothesis).

We note that our dictionary fea...

...more

View all episodes

By The Nonlinear Fund

December 05, 2023

AF - Some open-source dictionaries and dictionary learning infrastructure by Sam Marks

9 minutes

open-sourcing a number of sparse autoencoder dictionaries trained on Pythia-70m MLPs

releasing our repository for training these dictionaries[2].

Let's discuss the dictionaries first, and then the repo.

The dictionaries

The first set, called

0_8192, consists of dictionaries of size 8192=16512. These were trained with an L1 penalty of

1e-3.

The second set, called

1_32768, consists of dictionaries of size 32768=64512. These were trained with an l1 penalty of

3e-3.

Here are some statistics. (See our repo's readme for more info on what these statistics mean.)

For dictionaries in the

0_8192 set:

Layer

MSE Loss

L1 loss

% Alive

% Loss Recovered

0.056

6.132

9.951

0.998

0.984

0.089

6.677

44.739

0.887

0.924

0.108

11.44

62.156

0.587

0.867

0.135

23.773

175.303

0.588

0.902

0.148

27.084

174.07

0.806

0.927

0.179

47.126

235.05

0.672

0.972

For dictionaries in the

1_32768 set:

Layer

MSE Loss

L1 loss

% Alive

% Loss Recovered

0.09

4.32

2.873

0.174

0.946

0.13

2.798

11.256

0.159

0.768

0.152

6.151

16.381

0.118

0.724

0.211

11.571

39.863

0.226

0.765

0.222

13.665

29.235

0.19

0.816

0.265

26.4

43.846

0.13

0.931

And here are some histograms of feature frequencies.

Overall, I'd guess that these dictionaries are decent, but not amazing.

Some miscellaneous notes about these dictionaries (you can find more in the repo).

The L1 penalty for

1_32768 set seem to have more interpretable features than those from the

0_8192 set (though it's hard to tell).

In particular, we suspect that for

We note that our dictionary fea...

...more

More shows like The Nonlinear Library: Alignment Forum

View all

AXRP - the AI X-risk Research Podcast

9 Listeners

Share AF - Some open-source dictionaries and dictionary learning infrastructure by Sam Marks

Sign up to save your podcasts

AF - Some open-source dictionaries and dictionary learning infrastructure by Sam Marks

AF - Some open-source dictionaries and dictionary learning infrastructure by Sam Marks

More shows like The Nonlinear Library: Alignment Forum

AXRP - the AI X-risk Research Podcast