
Sign up to save your podcasts
Or
Audio note: this article contains 76 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
The slingshot effect is a late-stage training anomaly found in various adaptive gradient optimization methods. In particular, slingshots are present with AdamW, the optimizer most widely used for modern transformer training. The original slingshot paper observes that slingshots tend to occur alongside grokking, a phenomenon in which neural networks trained on algorithmic tasks generalize to the test set long after perfectly fitting the training set. In this post we take a closer look at slingshots and their effect on generalization in the setting of 1-hidden-layer MLPs trained on _k_-sparse parity, a specific algorithmic task. The main results are 1) an explanation of why slingshots occur in models trained with hinge loss that partially transfers to models trained with [...]
---
Outline:
(01:09) 1 Introduction
(01:13) 1.1 The _k_\-sparse parity task
(02:39) 1.2 The slingshot effect
(03:19) 2 Why do slingshots happen in the _k_\-sparse parity setting?
(03:48) 2.1 Hinge loss
(08:41) 2.2 Cross-entropy loss
(10:07) 2.3 Resetting AdamW mitigates slingshots
(11:20) 3 Slingshots and generalization
(11:24) 3.1 Slingshots tend to decrease test loss
(11:57) 3.2 Slingshots vs random jumps
(13:59) 3.3 Slingshots and LLC
(15:48) 4 Discussion
The original text contained 7 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Audio note: this article contains 76 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
The slingshot effect is a late-stage training anomaly found in various adaptive gradient optimization methods. In particular, slingshots are present with AdamW, the optimizer most widely used for modern transformer training. The original slingshot paper observes that slingshots tend to occur alongside grokking, a phenomenon in which neural networks trained on algorithmic tasks generalize to the test set long after perfectly fitting the training set. In this post we take a closer look at slingshots and their effect on generalization in the setting of 1-hidden-layer MLPs trained on _k_-sparse parity, a specific algorithmic task. The main results are 1) an explanation of why slingshots occur in models trained with hinge loss that partially transfers to models trained with [...]
---
Outline:
(01:09) 1 Introduction
(01:13) 1.1 The _k_\-sparse parity task
(02:39) 1.2 The slingshot effect
(03:19) 2 Why do slingshots happen in the _k_\-sparse parity setting?
(03:48) 2.1 Hinge loss
(08:41) 2.2 Cross-entropy loss
(10:07) 2.3 Resetting AdamW mitigates slingshots
(11:20) 3 Slingshots and generalization
(11:24) 3.1 Slingshots tend to decrease test loss
(11:57) 3.2 Slingshots vs random jumps
(13:59) 3.3 Slingshots and LLC
(15:48) 4 Discussion
The original text contained 7 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,359 Listeners
2,384 Listeners
7,947 Listeners
4,135 Listeners
87 Listeners
1,460 Listeners
9,053 Listeners
88 Listeners
385 Listeners
5,420 Listeners
15,229 Listeners
474 Listeners
121 Listeners
77 Listeners
455 Listeners