July 05, 2023

AF - (tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders by Logan Riggs Smith

11 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: (tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders, published by Logan Riggs Smith on July 5, 2023 on The AI Alignment Forum.

Using a sparse autoencoder, I present evidence that the resulting decoder (aka "dictionary") learned 600+ features for Pythia-70M layer_2's MLP (after the GeLU), although I expect around 8k-16k features to be learnable.

Dictionary Learning: Short Explanation

Good explanation here & original here, but in short: a good dictionary means that you could give me any input & I can reconstruct it using a linear combination of dictionary elements. For example, signals can be reconstructed as a linear combination of frequencies:

In the same way, the neuron activations in a large language model (LLM) can be reconstructed as a linear combination of features. e.g.

neuron activations = 4([duplicate token] feature) + 7(bigram " check please" feature).

Big Picture: If we learn all the atomic features that make up all model behavior, then we can pick & choose the features we want (e.g. honesty)

To look at the autoencoder:

So for every neuron activation (ie a 2048-sized vector), the autoencoder is trained to encode a sparse set of feature activations/magnitudes (sparse as in only a few features "activate" ie have non-zero magnitudes), which are then multiplied by their respective feature vector (ie a row in the decoder/"dictionary") in order to reconstruct the original neuron activations.

As an example, an input is " Let u be f(8). Let w", and the decomposed linear combination of features are:

[" Let u be f(8). Let w"] = 4(letter "w" feature) + 7(bigram "Let [x/w/n/p/etc]" feature)

Note: The activation here is only for the last token " w" given the previous context. In general you get 2048 neuron activations for every token, but I'm just focusing on the last token in this example.

For the post, it's important to understand that:

Features have both a feature vector (ie the 2048-sized vector that is a row in the decoder) & a magnitude (ie a real number calculated on an input-by-input basis). Please ask questions in the comments if this doesn't make sense, especially after reading the rest of the post.

I calculate Max cosine similarity (MCS) between the feature vectors in two separately trained dictionaries. So if feature0 in dictionary 0 is "duplicate tokens", and feature98 in dictionary 1 has high cosine similarity, then I expect both to be representing "duplicate tokens" and for this to be a "real" feature [Intuition: there are many ways to be wrong & only one way to be right]

Feature Case Study

Top-activating Examples for feature #52

I ran through 500k tokens & found the ones that activated each feature. In this case, I chose feature52 which had an MCS of ~0.99. We can then look at the datapoints that maximally activate this feature. (To be clear, I am running datapoints through Pythia-70M, grabbing the activations mid-way through at layer 2's MLP after the GeLU, & running that through the autoencoder, grabbing the feature magnitudes ie latent activations).

Blue here means this feature is activating highly for that (token,context)-pair. The top line has activation 3.5 for the first " $" & 5 for the second. Note that it doesn't activate for closing $ and is generally sparse.

Ablate Context

Ablate the context one token at a time & see the effect on the feature magnitudes on last token. Red means the feature activated less when ablating that token.

To be clear, I am literally removing the token & running it through the model again to see the change in feature activation at the last position. In the top line, removing the last " $" makes the new last token to be " all" which has an activation of 0, so the difference is 0-5 = -5, which is value assigned to the dark red value on " $" (removing " all" before it had no effect, so it...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

July 05, 2023

AF - (tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders by Logan Riggs Smith

11 minutes

Dictionary Learning: Short Explanation

In the same way, the neuron activations in a large language model (LLM) can be reconstructed as a linear combination of features. e.g.

neuron activations = 4([duplicate token] feature) + 7(bigram " check please" feature).

Big Picture: If we learn all the atomic features that make up all model behavior, then we can pick & choose the features we want (e.g. honesty)

To look at the autoencoder:

As an example, an input is " Let u be f(8). Let w", and the decomposed linear combination of features are:

[" Let u be f(8). Let w"] = 4(letter "w" feature) + 7(bigram "Let [x/w/n/p/etc]" feature)

For the post, it's important to understand that:

Feature Case Study

Top-activating Examples for feature #52

Ablate Context

Ablate the context one token at a time & see the effect on the feature magnitudes on last token. Red means the feature activated less when ablating that token.

...more

Share AF - (tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders by Logan Riggs Smith

Sign up to save your podcasts

AF - (tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders by Logan Riggs Smith

AF - (tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders by Logan Riggs Smith