February 21, 2023

LW - Basic facts about language models during training by beren

28 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Basic facts about language models during training, published by beren on February 21, 2023 on LessWrong.

This post builds upon our last post on basic facts about language model internals and was written as part of the work done at Conjecture. We will shortly release all plots and animations (only a very small subset made it into this post) as well as the code at this repository.

We are aware of there being some inconsistencies with the Pythia model suite due to different configs for different model sizes affecting the learning rate schedule. As far as we know, the team at EleutherAI is currently re-running the models. After thinking about the issue, we do not believe it is likely to be fatal to many of the macroscale points made in this post and so we post the results here provisionally using the old original models. We plan to update this analysis when the new model suite is finished. Until then, take some of the results reported here with a grain of salt as they may be subject to change.

In this post, we continue the work done in our last post on language model internals but this time we analyze the same phenomena occurring during training. This is extremely important in understanding how language model training works at a macro-scale and sheds light into potentially new behaviours or specific important phase transitions that may occur during training which deserve further study, as well as giving insight into the origin of phenomena that we consistently observe in fully trained models.

Throughout, as in the previous post, we do not delve into the details of specific circuits, but instead aim to provide a holistic macro-level view of the basic distributional properties of the LLM’s weights, activations, and gradients across training checkpoints. Although seemingly basic, we are not aware of any similar analysis having been performed publicly, and we believe that understanding these distributional phenomena seems generally important in constraining circuit-level theorizing as well as provides empirical links to the theoretical constructs such as the neural tangent kernel and tensor programs that can prove facts about specific limits.

To perform our analysis, we use the open source Pythia model suite which contains a large number of checkpoints during training and was trained by EleutherAI and aims to use interpretability analysis to understand how representations develop across training. We agree with this goal and are happy to share our own analysis code etc. The Pythia project trains models of different sizes on exactly the same data in exactly the same order so as to be able to understand how and when certain representations form both during training and across different model scales. The Pythia models we utilize range from 19M parameters to 1.3B. Each Pythia model has 142 checkpoints of stored weights, equally spaced every 1000 steps, which we sweep across to perform our analysis.

Weights show a rapid phase transition from Gaussian to extreme heavy tails

It was very helpfully pointed out in a comment on our previous post that the weight statistics were actually sharper and more heavy tailed than Gaussian. This is correct and we also found this when we fit histograms to logistics vs Gaussian distributions. Overall, we find that the activation distributions of GPT2 models are generally not Gaussian but somewhere in between the logistic e−x and the Gaussian e−x2, which indicates both heavier tails and a thinner bulk. This is extremely interesting since it means that the weight statistics must move away from their Gaussian initialization which implies a highly significant perturbation away from their original position. This is perhaps in contrast with some theories, such as NTK theory, which argue that for large models we should not expect the weights to diverge too...

...more