LessWrong (30+ Karma)

“The slingshot helps with learning” by Wilson Wu


Listen Later

Audio note: this article contains 76 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

The slingshot effect is a late-stage training anomaly found in various adaptive gradient optimization methods. In particular, slingshots are present with AdamW, the optimizer most widely used for modern transformer training. The original slingshot paper observes that slingshots tend to occur alongside grokking, a phenomenon in which neural networks trained on algorithmic tasks generalize to the test set long after perfectly fitting the training set. In this post we take a closer look at slingshots and their effect on generalization in the setting of 1-hidden-layer MLPs trained on _k_-sparse parity, a specific algorithmic task. The main results are 1) an explanation of why slingshots occur in models trained with hinge loss that partially transfers to models trained with [...]

---

Outline:

(01:09) 1 Introduction

(01:13) 1.1 The _k_\-sparse parity task

(02:39) 1.2 The slingshot effect

(03:19) 2 Why do slingshots happen in the _k_\-sparse parity setting?

(03:48) 2.1 Hinge loss

(08:41) 2.2 Cross-entropy loss

(10:07) 2.3 Resetting AdamW mitigates slingshots

(11:20) 3 Slingshots and generalization

(11:24) 3.1 Slingshots tend to decrease test loss

(11:57) 3.2 Slingshots vs random jumps

(13:59) 3.3 Slingshots and LLC

(15:48) 4 Discussion

The original text contained 7 footnotes which were omitted from this narration.

---

First published:

October 31st, 2024

Source:

https://www.lesswrong.com/posts/LncYobrn3vRr7qkZW/the-slingshot-helps-with-learning

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

...more
View all episodesView all episodes
Download on the App Store

LessWrong (30+ Karma)By LessWrong


More shows like LessWrong (30+ Karma)

View all
The Daily by The New York Times

The Daily

112,847 Listeners

Astral Codex Ten Podcast by Jeremiah

Astral Codex Ten Podcast

130 Listeners

Interesting Times with Ross Douthat by New York Times Opinion

Interesting Times with Ross Douthat

7,206 Listeners

Dwarkesh Podcast by Dwarkesh Patel

Dwarkesh Podcast

531 Listeners

The Ezra Klein Show by New York Times Opinion

The Ezra Klein Show

16,150 Listeners

AI Article Readings by Readings of great articles in AI voices

AI Article Readings

4 Listeners

Doom Debates by Liron Shapira

Doom Debates

14 Listeners

LessWrong posts by zvi by zvi

LessWrong posts by zvi

2 Listeners