Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Sparse Autoencoders: Future Work, published by Logan Riggs Smith on September 21, 2023 on The AI Alignment Forum.
Mostly my own writing, except for the 'Better Training Methods' section which was written by @Aidan Ewart.We made a lot of progress in 4 months working on Sparse Autoencoders, an unsupervised method to scalably find monosemantic features in LLMs, but there's still plenty of work to do. Below I (Logan) give both research ideas, as well as my current, half-baked thoughts on how to pursue them.
Find All the Circuits!
Truth/Deception/Sycophancy/Train-Test distinction/[In-context Learning/internal Optimization]
Find features relevant for these tasks. Do they generalize better than baselines?
For internal optimization, can we narrow this down to a circuit (using something like causal scrubbing) and retarget the search?
Understand RLHF
Find features for preference/reward models that make the reward large or very negative.
Compare features of models before & after RLHF
Adversarial Attacks
What features activate on adversarial attacks? What features feed into those?
Develop adversarial attacks, but only search over dictionary features
Circuits Across Time
Using a model w/ lots of checkpoints like Pythia, we can see feature & circuit formation across time given datapoints.
Circuits Across Scale
Pythia models are trained on the same data, in the same order but range in model sizes from 70M params to 13B.
Turn LLMs into code
Link to very rough draft of the idea I (Logan) wrote in two days
Mechanistic Anomaly Detection
If distribution X has features A,B,C activate, and distribution Y has features B,C,D, you may be able to use this discrete property to get a better ROC curve than strictly continuous methods.
How do the different operationalizations of distance between discrete features compare against each other?
Activation Engineering
Use feature directions found by the dictionary instead of examples. I predict this will generalize better, but would be good to compare against current methods
One open problem is which token in the sequence do you add the vector to. Maybe it makes sense to only add the [female] direction to tokens that are [names]. Dictionary features in previous layers may help you automatically pick the right type e.g. a feature that activates on [names].
Fun Stuff
Othello/Chess/Motor commands - Find features that relate to actions that a model is able to do. Can we find a corner piece feature, a knight feature, a "move here" feature?
Feature Search
There are three ways to find features AFAIK:1. Which input tokens activate it?
2. What output logits are causally downstream from it?
3. Which intermediate features cause it/are caused by it?
1) Input Tokens
When finding the input tokens, you may run into outlier dimensions that activate highly for most tokens (predominately the first token), so you need to account for that.
2) Output Logits
For output logits, if you have a dataset task (e.g. predicting stereotypical gender), you can remove each feature one at a time, and sort by greatest effect. This also extends to substituting features between two distributions and finding the smallest substitution to go from one to the other. For example,
"I'm Jane, and I'm a [female]"
"I'm Dave, and I'm a [male]"
Suppose at token Jane, it activates 2 Features A & B [1,1,0] and Dave activates 2 features B & C [0,1,1]. Then we can see what is the smallest substitution between the two that makes Jane complete as " male". If A is the "female" feature, then ablating it (setting it to zero) will make the model set male/female to equal probability. Adding the female feature to Dave and subtracting the male direction should make Dave complete as "female".
3) Intermediate Features
Say we're looking at layer 5, feature 783, which activates ~10 for 20 datapoints on average. We c...