The Nonlinear Library

LW - Residual stream norms grow exponentially over the forward pass by StefanHex


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Residual stream norms grow exponentially over the forward pass, published by StefanHex on May 7, 2023 on LessWrong.
Summary: For a range of language models and a range of input prompts, the norm of each residual stream grows exponentially over the forward pass, with average per-layer growth rate of about 1.045 in GPT2-XL. We show a bunch of evidence for this. We discuss to what extent different weights and parts of the network are responsible.
We find that some model weights increase exponentially as a function of layer number. We finally note our current favored explanation: Due to LayerNorm, it's hard to cancel out existing residual stream features, but easy to overshadow existing features by just making new features 4.5% larger.
Thanks to Aryan Bhatt, Marius Hobbhahn, Neel Nanda, and Nicky Pochinkov for discussion.
Plots showing exponential norm and variance growth
Our results are reproducible in this Colab.
Alex noticed exponential growth in the contents of GPT-2-XL's residual streams. He ran dozens of prompts through the model, plotted for each layer the distribution of residual stream norms in a histogram, and found exponential growth in the L2 norm of the residual streams:
Here's the norm of each residual stream for a specific prompt:
Stefan had previously noticed this phenomenon in GPT2-small, back in MATS 3.0:
Basic Facts about Language Model Internals also finds a growth in the norms of the attention-out matrices WO and the norms of MLP out matrices Wout ("writing weights"), while they find stable norms for WQ, WK, and Win ("reading weights"):
Comparison of various transformer models
We started our investigation by computing these residual stream norms for a variety of models, recovering Stefan's results (rescaled by √dmodel=√768) and Alex's earlier numbers. We see a number of straight lines in these logarithmic plots, which shows phases of exponential growth.
We are surprised by the decrease in Residual Stream norm in some of the EleutherAI models. We would have expected that, because the transformer blocks can only access the normalized activations, it's hard for the model to "cancel out" a direction in the residual stream. Therefore, the norm always grows. However, this isn't what we see above. One explanation is that the model is able to memorize or predict the LayerNorm scale. If the model does this well enough it can (partially) delete activations and reduce the norm by writing vectors that cancel out previous activations.
The very small models (distillgpt2, gpt2-small) have superexponential norm growth, but most models show exponential growth throughout extended periods. For example, from layer 5 to 41 in GPT2-XL, we see an exponential increase in residual stream norm at a rate of ~1.045 per layer. We showed this trend as an orange line in the above plot, and below we demonstrate the growth for a specific example:
BOS and padding tokens
In our initial tests, we noticed some residual streams showed a irregular and surprising growth curve:
As for the reason behind this shape, we expect that the residual stream (norm) is very predictable at BOS and padding positions. This is because these positions cannot attend to other positions and thus always have the same values (up to positional embedding). Thus it would be no problem for the model to cancel out activations, and our arguments about this being hard do not hold for BOS and padding positions. We don't know whether there is a particular meaning behind this shape.
We suspect that is the source of the U-shape shown in Basic facts about language models during training:
Theories for the source of the growth
From now on we focus on the GPT2-XL case. Here is the residual stream growth curve again (orange dots), but also including the resid_mid hook between the two Attention and MLP sub-layers (blue dots).
O...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings