Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Deep learning models might be secretly (almost) linear, published by beren on April 24, 2023 on LessWrong.
Crossposted from my personal blog.
Epistemic status: Pretty speculative, but there is a surprising amount of circumstantial evidence.
I have been increasingly thinking about NN representations and slowly coming to the conclusion that they are (almost) completely secretly linear inside. This means that, theoretically, if we can understand their directions, we can very easily exert very powerful control on the internal representations, as well as compose and reason about them in a straightforward way. Finding linear directions for a given representation would allow us to arbitrarily amplify or remove it and interpolate along it as desired. We could also then directly 'mix' it with other representations as desired. Measuring these directions during inference would let us detect the degree of each feature that the network assigns to a given input. For instance, this might let us create internal 'lie detectors' (which there is some progress towards) which can tell if the model is telling the truth, or being deceptive.
While nothing is super definitive (and clearly networks are not 100% linear), I think there is a large amount of fairly compelling circumstantial evidence for this position. Namely:
Evidence for this:
All the work from way back when about interpolating through VAE/GAN latent space. I.e. in the latent space of a VAE on CelebA there are natural 'directions' for recognizable semantic features like 'wearing sunglasses' and 'long hair' and linear interpolations along these directions produced highly recognizable images
Rank 1 or low rank editing techniques such as ROME work so well (not perfectly but pretty well). These are effectively just emphasizing a linear direction in the weights.
You can apparently add and mix LoRas and it works about how you would expect.
You can merge totally different models. People working with Stable Diffusion community literally additively merge model weights with weighted sum and it works!
Logit lens works.
SVD directions.
Linear probing definitely gives you a fair amount of signal.
Linear mode connectivity and git rebasin.
Colin Burns' unsupervised linear probing method works even for semantic features like 'truth'.
You can merge together different models finetuned from the same initialization.
You can do a moving average over model checkpoints and this improves performance and is better than any individual checkpoint!
The fact that linear methods work pretty tolerably well in neuroscience.
Various naive diff based editing and control techniques work at all.
General linear transformations between models and modalities.
You can often remove specific concepts from the model by erasing a linear subspace.
Task vectors (and in general things like finetune diffs being composable linearly)
People keep finding linear representations inside of neural networks when doing interpretability or just randomly.
If this is true, then we should be able to achieve quite a high level of control and understanding of NNs solely by straightforward linear methods and interventions. This would mean that deep networks might end up being pretty understandable and controllable artefacts in the near future. Just at this moment, we just have not yet found the right levers yet (or rather lots of existing work does show this but hasn't really been normalized or applied at scale for alignment). Linear-ish network representations are a best case scenario for both interpretability and control.
For a mechanistic, circuits-level understanding, there is still the problem of superposition of the linear representations. However, if the representations are indeed mostly linear than once superposition is solved there seem to be little other obstacles in front of a com...