
Sign up to save your podcasts
Or
Audio note: this article contains 136 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Introduction
It's now well known that simple neural network models often "grok" algorithmic tasks. That is, when trained for many epochs on a subset of the full input space, the model quickly attains perfect train accuracy and then, much later, near-perfect test accuracy. In the former phase, the model memorizes the training set; in the latter, it generalizes out-of-distribution to the test set.
In the algorithmic grokking literature, there is typically exactly one natural generalization from the training set to the test set. What if, however, the training set were instead under-specified in such a way that there were multiple possible generalizations? Would the model grok at all? If so, which of the generalizing solutions would it choose? [...]
---
Outline:
(00:20) Introduction
(01:21) Setup
(03:45) Experiments
(03:48) Ambiguous grokking
(03:52) Grokking either group
(04:54) Grokking the intersect
(05:48) Grokking only one group
(06:41) No grokking
(07:13) Measuring complexity
(07:16) Complexity of the grokked solution
(11:53) Complexity over time
(13:57) Determination and differentiation
(14:01) Perturbation sensitivity
(15:47) Total variation
(17:14) Determination across distribution shift
(18:32) Training Jacobian
(21:03) Discussion
The original text contained 9 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Audio note: this article contains 136 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Introduction
It's now well known that simple neural network models often "grok" algorithmic tasks. That is, when trained for many epochs on a subset of the full input space, the model quickly attains perfect train accuracy and then, much later, near-perfect test accuracy. In the former phase, the model memorizes the training set; in the latter, it generalizes out-of-distribution to the test set.
In the algorithmic grokking literature, there is typically exactly one natural generalization from the training set to the test set. What if, however, the training set were instead under-specified in such a way that there were multiple possible generalizations? Would the model grok at all? If so, which of the generalizing solutions would it choose? [...]
---
Outline:
(00:20) Introduction
(01:21) Setup
(03:45) Experiments
(03:48) Ambiguous grokking
(03:52) Grokking either group
(04:54) Grokking the intersect
(05:48) Grokking only one group
(06:41) No grokking
(07:13) Measuring complexity
(07:16) Complexity of the grokked solution
(11:53) Complexity over time
(13:57) Determination and differentiation
(14:01) Perturbation sensitivity
(15:47) Total variation
(17:14) Determination across distribution shift
(18:32) Training Jacobian
(21:03) Discussion
The original text contained 9 footnotes which were omitted from this narration.
---
First published:
Source:
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
26,334 Listeners
2,399 Listeners
7,817 Listeners
4,107 Listeners
87 Listeners
1,453 Listeners
8,761 Listeners
90 Listeners
353 Listeners
5,356 Listeners
15,023 Listeners
464 Listeners
128 Listeners
73 Listeners
433 Listeners