tl;dr: We compute the evolution of the local learning coefficient (LLC), a proxy for model complexity, for an algorithmic transformer. The LLC decreases as the model learns more structured solutions, such as head specialization.
This post is structured in three main parts, (1) a summary, giving an overview of the main results, (2) the Fine Print, that delves into various cross-checks and details and (3) Discussion and Conclusions.
Structure Formation in Algorithmic Transformers
In this work we study the development of simple algorithmic transformers, which are transformers that learn to perform algorithmic tasks. A major advantage of this setup is that we can control several (hyper)parameters, such as the complexity of the training data and network architecture. This allows us to do targeted experiments studying the impacts of these parameters on the learning dynamics. The main tool we use to study the development is the Local Learning Coefficient [...]
---
Outline:
(00:37) Structure Formation in Algorithmic Transformers
(02:17) 1-Head Model
(04:01) 2-Head Model
(06:03) The Fine Print
(06:30) 2-Head Model: Re-loaded
(06:35) Removing Layer Norm
(08:05) Removing Weight Decay
(09:33) Removing both Layer Norm and Weight Decay
(11:02) Adding Noise
(12:26) Number of vocabulary regions
(13:52) Adding More Heads
(16:14) Other algorithmic transformers
(16:41) Related work
(17:21) Discussion and Conclusions
The original text contained 12 footnotes which were omitted from this narration.
---