September 07, 2024

“Adam Optimizer Causes Privileged Basis in Transformer Language Models” by Diego Caples

9 minutes

Introduction

In principle, neuron activations in a transformer-based language model residual stream should be about the same scale. In practice, however the dimensions unexpectedly widely vary in scale. Mathematical theories of the transformer architecture do not predict this. They expect rotational equivariance within a model, where one dimension is no more important than any other. Is there something wrong with our reasonably informed intuitions of how transformers work? What explains these outlier channels?

Previously, Anthropic researched the existence of these privileged basis dimensions (dimensions more important / larger than expected) and ruled out several causes. By elimination, they reached the hypothesis that per-channel normalization in the Adam optimizer was the cause of privileged basis. However, they did not prove this was the case.

We conclusively show that Adam causes outlier channels / privileged basis within the transformer residual stream. When replacing the Adam [...]

---

Outline:

(00:17) Introduction

(02:06) Background

(02:09) Recommended Reading

(02:20) More About Anthropic's Work

(02:54) Adam vs SGD, and Rotational Equivariance

(03:52) Kurtosis

(04:43) TinyStories

(05:14) Experiments

(05:17) Replicating Outlier Channels at Small Scale

(06:03) Training an LM with SGD

(07:09) Conclusions

(08:37) Future Research

The original text contained 2 images which were described by AI.

---

First published:

September 6th, 2024

Source:

https://www.lesswrong.com/posts/yrhu6MeFddnGRSLtQ/adam-optimizer-causes-privileged-basis-in-transformer

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

View all episodes

By LessWrong

September 07, 2024

“Adam Optimizer Causes Privileged Basis in Transformer Language Models” by Diego Caples

9 minutes

Diego Caples ([email protected])

Rob Neuhaus ([email protected])

Introduction

We conclusively show that Adam causes outlier channels / privileged basis within the transformer residual stream. When replacing the Adam [...]

---

Outline:

(00:17) Introduction

(02:06) Background

(02:09) Recommended Reading

(02:20) More About Anthropic's Work

(02:54) Adam vs SGD, and Rotational Equivariance

(03:52) Kurtosis

(04:43) TinyStories

(05:14) Experiments

(05:17) Replicating Outlier Channels at Small Scale

(06:03) Training an LM with SGD

(07:09) Conclusions

(08:37) Future Research

The original text contained 2 images which were described by AI.

---

First published:

September 6th, 2024

Source:

https://www.lesswrong.com/posts/yrhu6MeFddnGRSLtQ/adam-optimizer-causes-privileged-basis-in-transformer

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more

More shows like LessWrong (30+ Karma)

View all

Making Sense with Sam Harris

26,409 Listeners

Conversations with Tyler

2,387 Listeners

The Peter Attia Drive

7,908 Listeners

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas

4,131 Listeners

ManifoldOne

87 Listeners

Your Undivided Attention

1,457 Listeners

All-In with Chamath, Jason, Sacks & Friedberg

9,042 Listeners

Machine Learning Street Talk (MLST)

87 Listeners

Dwarkesh Podcast

388 Listeners

Hard Fork

5,432 Listeners

The Ezra Klein Show

15,201 Listeners

Moonshots with Peter Diamandis

474 Listeners

No Priors: Artificial Intelligence | Technology | Startups

122 Listeners

Latent Space: The AI Engineer Podcast

75 Listeners

BG2Pod with Brad Gerstner and Bill Gurley

454 Listeners

Share “Adam Optimizer Causes Privileged Basis in Transformer Language Models” by Diego Caples

Sign up to save your podcasts

“Adam Optimizer Causes Privileged Basis in Transformer Language Models” by Diego Caples

“Adam Optimizer Causes Privileged Basis in Transformer Language Models” by Diego Caples

More shows like LessWrong (30+ Karma)

Making Sense with Sam Harris

Conversations with Tyler

The Peter Attia Drive

Sean Carroll's Mindscape: Science, Society, Philosophy, Culture, Arts, and Ideas

ManifoldOne

Your Undivided Attention

All-In with Chamath, Jason, Sacks & Friedberg

Machine Learning Street Talk (MLST)

Dwarkesh Podcast

Hard Fork

The Ezra Klein Show

Moonshots with Peter Diamandis

No Priors: Artificial Intelligence | Technology | Startups

Latent Space: The AI Engineer Podcast

BG2Pod with Brad Gerstner and Bill Gurley