Audio note: this article contains 136 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Introduction
It's now well known that simple neural network models often "grok" algorithmic tasks. That is, when trained for many epochs on a subset of the full input space, the model quickly attains perfect train accuracy and then, much later, near-perfect test accuracy. In the former phase, the model memorizes the training set; in the latter, it generalizes out-of-distribution to the test set.
In the algorithmic grokking literature, there is typically exactly one natural generalization from the training set to the test set. What if, however, the training set were instead under-specified in such a way that there were multiple possible generalizations? Would the model grok at all? If so, which of the generalizing solutions would it choose? [...]
---
Outline:
(00:20) Introduction
(01:21) Setup
(03:45) Experiments
(03:48) Ambiguous grokking
(03:52) Grokking either group
(04:54) Grokking the intersect
(05:48) Grokking only one group
(06:41) No grokking
(07:13) Measuring complexity
(07:16) Complexity of the grokked solution
(11:53) Complexity over time
(13:57) Determination and differentiation
(14:01) Perturbation sensitivity
(15:47) Total variation
(17:14) Determination across distribution shift
(18:32) Training Jacobian
(21:03) Discussion
The original text contained 9 footnotes which were omitted from this narration.
---