Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some costs of superposition, published by Linda Linsefors on March 3, 2024 on The AI Alignment Forum.
I don't expect this post to contain anything novel. But from talking to others it seems like some of what I have to say in this post is not widely known, so it seemed worth writing.
In this post I'm defining superposition as: A representation with more features than neurons, achieved by encoding the features as almost orthogonal vectors in neuron space.
One reason to expect superposition in neural nets (NNs), is that for large n, Rn has many more than n almost orthogonal directions. On the surface, this seems obviously useful for the NN to exploit. However, superposition is not magic. You don't actually get to put in more information, the gain you get from having more feature directions has to be paid for some other way.
All the math in this post is very hand-wavey. I expect it to be approximately correct, to one order of magnitude, but not precisely correct.
Sparsity
One cost of superposition is feature activation sparsity. I.e, even though you get to have many possible features, you only get to have a few of those features simultaneously active.
(I think the restriction of sparsity is widely known, I mainly include this section because I'll need the sparsity math for the next section.)
In this section we'll assume that each feature of interest is a boolean, i.e. it's either turned on or off. We'll investigate how much we can weaken this assumption in the next section.
If you have m features represented by n neurons, with m>n, then you can't have all the features represented by orthogonal vectors. This means that an activation of one feature will cause some some noise in the activation of other features.
The typical noise on feature f1 caused by 1 unit of activation from feature f2, for any pair of features f1, f2, is (derived from Johnson-Lindenstrauss lemma)
ϵ=8ln(m)n [1]
If l features are active then the typical noise level on any other feature will be approximately ϵl units. This is because the individual noise terms add up like a random walk. Or see here for an alternative explanation of where the root square comes from.
For the signal to be stronger than the noise we need ϵl<1, and preferably ϵl1.
This means that we can have at most l(8ln(m)) simultaneously active features, and m8) possible features.
Boolean-ish
The other cost of superposition is that you lose expressive range for your activations, making them more like booleans than like floats.
In the previous section, we assumed boolean features, i.e. the feature is either on (1 unit of activation + noise) or off (0 units of activation + noise), where "one unit of activation" is some constant. Since the noise is proportional to the activation, it doesn't matter how large "one unit of activation" is, as long as it's consistent between features.
However, what if we want to allow for a range of activation values?
Let's say we have n neurons, m possible features, at most l simultaneous features, with at most a activation amplitude. Then we need to be able to deal with noise of the level
noise=aϵl=a8lln(m)n
The number of activation levels the neural net can distinguish between is at most the max amplitude divided by the noise.
anoise=n8lln(m)
Any more fine grained distinction will be overwhelmed by the noise.
As we get closer to maxing out x and m, the smaller the signal to noise ratio gets, meaning we can distinguish between fewer and fewer activation levels, making it more and more boolean-ish.
This does not necessarily mean the network encodes values in discrete steps. Feature encodings should probably still be seen as inhabiting a continuous range but with reduced range and precision (except in the limit of maximum super position, when all feature values are boolean). This is similar to how floats ...