Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
TL;DR: In this post, I distinguish between two related concepts in neural network interpretability: polysemanticity and superposition. Neuron polysemanticity is the observed phenomena that many neurons seem to fire (have large, positive activations) on multiple unrelated concepts. Superposition is a specific explanation for neuron (or attention head) polysemanticity, where a neural network represents more sparse features than there are neurons (or number of/dimension of attention heads) in near-orthogonal directions. I provide three ways neurons/attention heads can be polysemantic without superposition: non--neuron aligned orthogonal features, non-linear feature representations, and compositional representation without features. I conclude by listing a few reasons why it might be important to distinguish the two concepts.
Epistemic status: I wrote this “quickly” in about 10 hours, as otherwise it wouldn’t have come out at all. Think of it as a (failed) experiment in writing [...]
---
Outline:
(04:23) A brief review of polysemanticity and superposition
(04:28) Neuron polysemanticity
(08:01) Superposition
(12:17) Polysemanticity without superposition
(12:32) Example 1: non–neuron aligned orthogonal features
(17:25) Example 2: non-linear feature representations
(19:01) Example 3: compositional representation without “features”
(20:35) Conclusion: why does this distinction matter?
(21:38) Our current model of superposition may not fully explain neuron polysemanticity, so we should keep other hypotheses in mind
(23:53) Attempts to “solve superposition” may actually only be solving easier cases of polysemanticity
(25:02) Clear definitions are important for clear communication and rigorous science
(25:47) Acknowledgements
The original text contained 24 footnotes which were omitted from this narration.
---