The Nonlinear Library

AF - [Replication] Conjecture's Sparse Coding in Small Transformers by Hoagy


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [Replication] Conjecture's Sparse Coding in Small Transformers, published by Hoagy on June 16, 2023 on The AI Alignment Forum.
Summary
A couple of weeks ago Logan Riggs and I posted that we'd replicated the toy-model experiments in Lee Sharkey and Dan Braun's original sparse coding post. Now we've replicated the work they did (slides) extending the technique to custom-trained small transformers (in their case 16 residual dimensions, in ours 32).
We've been able to replicate all of their core results, and our main takeaways from the last 2 weeks of research are the following:
We can recover many more features from activation space than the dimension of the activation space which have a high degree of cosine similarity with the features learned by other, larger dictionaries, which in toy models was a core indicator of having learned the correct features.
The distribution of MCS scores between trained dictionaries of different sizes is highly bimodal, suggesting that there is a particular set of features that are consistently found as sparse basis vectors of the activations, across different dictionary sizes.
The maximum-activating examples of these features usually seem human interpretable, though we haven't yet done a systematic comparison to the neuron basis.
The diagonal lines seen in the original post are an artefact of dividing the l1_loss coefficient by the dictionary size. Removing this means that the same l1_loss is applicable to a broad range of dictionary sizes.
We find that as dict size increases, MMCS initially increases rapidly at first, but then plateaus.
The learned feature vectors, including those ones that appear repeatedly, do not appear to be at all sparse with respect to the neuron basis.
As before, all code is available on GitHub and if you'd like to follow the research progress and potentially contribute, then join us on our EleutherAI thread.
Thanks to Robert Huben (Robert_AIZI) for extensive comments on this draft, and Lee Sharkey, Dan Braun and Robert Huben for their comments during our work.
Next steps & Request For Funding
We'd like to test how interpretable the features that we have found are, in a quantitative manner. We've got the basic structure ready to apply OpenAI's automated-intepretability library to the found features, which we would then compare to baselines such as the neuron basis and the PCA and ICA of the activation data.
This requires quite a lot of tokens- something like 20c worth of tokens for each query depending on the number of example sentences given. We would need to analyse hundreds of neurons to get a representative sample size, for each of the dictionaries or approaches, so we're looking for funding for this research on the order of a few thousand dollars, potentially more if it were available. If you'd be interested in supporting this research, please get in touch. We'll also be actively searching for funding, and trying to get OpenAI to donate some compute.
Assuming the result of this experiment are promising, (meaning that we are able to exceed the baselines in terms of quality or quantity of highly interpretable features), we plan then to focus on scaling up to larger models and experimenting with variations of the technique which incorporate additional information, such as combining activation data from multiple layers, or using the weight vectors to inform the selection of features. We're also very interested to work with people developing mathematical or toy models of superposition.
Results
Background
We used Andrej Karpathy's nanoGPT to train a small transformer, with 6 layers, a 32 dimensional residual stream, MLP widths of 128 and 4 attention heads with dimension 8. We trained on a node of 4xRTX3090s for about 9h, reaching a loss of 5.13 on OpenWebText. This transformer is the model from which we took activ...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings