
Sign up to save your podcasts
Or
TL;DR: In this post, I distinguish between two related concepts in neural network interpretability: polysemanticity and superposition. Neuron polysemanticity is the observed phenomena that many neurons seem to fire (have large, positive activations) on multiple unrelated concepts. Superposition is a specific explanation for neuron (or attention head) polysemanticity, where a neural network represents more sparse features than there are neurons (or number of/dimension of attention heads) in near-orthogonal directions. I provide three ways neurons/attention heads can be polysemantic without superposition: non--neuron aligned orthogonal features, non-linear feature representations, and compositional representation without features. I conclude by listing a few reasons why it might be important to distinguish the two concepts.
Epistemic status: I wrote this “quickly” in about 10 hours, as otherwise it wouldn’t have come out at all. Think of it as a (failed) experiment in writing [...]
---
Outline:
(04:23) A brief review of polysemanticity and superposition
(04:28) Neuron polysemanticity
(08:01) Superposition
(12:17) Polysemanticity without superposition
(12:32) Example 1: non–neuron aligned orthogonal features
(17:25) Example 2: non-linear feature representations
(19:01) Example 3: compositional representation without “features”
(20:35) Conclusion: why does this distinction matter?
(21:38) Our current model of superposition may not fully explain neuron polysemanticity, so we should keep other hypotheses in mind
(23:53) Attempts to “solve superposition” may actually only be solving easier cases of polysemanticity
(25:02) Clear definitions are important for clear communication and rigorous science
(25:47) Acknowledgements
The original text contained 24 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
TL;DR: In this post, I distinguish between two related concepts in neural network interpretability: polysemanticity and superposition. Neuron polysemanticity is the observed phenomena that many neurons seem to fire (have large, positive activations) on multiple unrelated concepts. Superposition is a specific explanation for neuron (or attention head) polysemanticity, where a neural network represents more sparse features than there are neurons (or number of/dimension of attention heads) in near-orthogonal directions. I provide three ways neurons/attention heads can be polysemantic without superposition: non--neuron aligned orthogonal features, non-linear feature representations, and compositional representation without features. I conclude by listing a few reasons why it might be important to distinguish the two concepts.
Epistemic status: I wrote this “quickly” in about 10 hours, as otherwise it wouldn’t have come out at all. Think of it as a (failed) experiment in writing [...]
---
Outline:
(04:23) A brief review of polysemanticity and superposition
(04:28) Neuron polysemanticity
(08:01) Superposition
(12:17) Polysemanticity without superposition
(12:32) Example 1: non–neuron aligned orthogonal features
(17:25) Example 2: non-linear feature representations
(19:01) Example 3: compositional representation without “features”
(20:35) Conclusion: why does this distinction matter?
(21:38) Our current model of superposition may not fully explain neuron polysemanticity, so we should keep other hypotheses in mind
(23:53) Attempts to “solve superposition” may actually only be solving easier cases of polysemanticity
(25:02) Clear definitions are important for clear communication and rigorous science
(25:47) Acknowledgements
The original text contained 24 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
26,446 Listeners
2,389 Listeners
7,910 Listeners
4,136 Listeners
87 Listeners
1,462 Listeners
9,095 Listeners
87 Listeners
389 Listeners
5,432 Listeners
15,174 Listeners
474 Listeners
121 Listeners
75 Listeners
459 Listeners