
Sign up to save your podcasts
Or
Diego Caples ([email protected])
Rob Neuhaus ([email protected])
Introduction
In principle, neuron activations in a transformer-based language model residual stream should be about the same scale. In practice, however the dimensions unexpectedly widely vary in scale. Mathematical theories of the transformer architecture do not predict this. They expect rotational equivariance within a model, where one dimension is no more important than any other. Is there something wrong with our reasonably informed intuitions of how transformers work? What explains these outlier channels?
Previously, Anthropic researched the existence of these privileged basis dimensions (dimensions more important / larger than expected) and ruled out several causes. By elimination, they reached the hypothesis that per-channel normalization in the Adam optimizer was the cause of privileged basis. However, they did not prove this was the case.
We conclusively show that Adam causes outlier channels / privileged basis within the transformer residual stream. When replacing the Adam [...]
---
Outline:
(00:17) Introduction
(02:06) Background
(02:09) Recommended Reading
(02:20) More About Anthropic's Work
(02:54) Adam vs SGD, and Rotational Equivariance
(03:52) Kurtosis
(04:43) TinyStories
(05:14) Experiments
(05:17) Replicating Outlier Channels at Small Scale
(06:03) Training an LM with SGD
(07:09) Conclusions
(08:37) Future Research
The original text contained 2 images which were described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Diego Caples ([email protected])
Rob Neuhaus ([email protected])
Introduction
In principle, neuron activations in a transformer-based language model residual stream should be about the same scale. In practice, however the dimensions unexpectedly widely vary in scale. Mathematical theories of the transformer architecture do not predict this. They expect rotational equivariance within a model, where one dimension is no more important than any other. Is there something wrong with our reasonably informed intuitions of how transformers work? What explains these outlier channels?
Previously, Anthropic researched the existence of these privileged basis dimensions (dimensions more important / larger than expected) and ruled out several causes. By elimination, they reached the hypothesis that per-channel normalization in the Adam optimizer was the cause of privileged basis. However, they did not prove this was the case.
We conclusively show that Adam causes outlier channels / privileged basis within the transformer residual stream. When replacing the Adam [...]
---
Outline:
(00:17) Introduction
(02:06) Background
(02:09) Recommended Reading
(02:20) More About Anthropic's Work
(02:54) Adam vs SGD, and Rotational Equivariance
(03:52) Kurtosis
(04:43) TinyStories
(05:14) Experiments
(05:17) Replicating Outlier Channels at Small Scale
(06:03) Training an LM with SGD
(07:09) Conclusions
(08:37) Future Research
The original text contained 2 images which were described by AI.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,409 Listeners
2,387 Listeners
7,908 Listeners
4,131 Listeners
87 Listeners
1,457 Listeners
9,042 Listeners
87 Listeners
388 Listeners
5,432 Listeners
15,201 Listeners
474 Listeners
122 Listeners
75 Listeners
454 Listeners