July 28, 2023

LW - Visible loss landscape basins don't correspond to distinct algorithms by Mikhail Samin

6 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Visible loss landscape basins don't correspond to distinct algorithms, published by Mikhail Samin on July 28, 2023 on LessWrong.

Thanks to Justis, Arthur Conmy, Neel Nanda, Joseph Miller, and Tilman Räuker for their feedback on a draft.

I feel like many people haven't noticed an important result of mechanistic interpretability analysis of grokking, and so haven't updated how they think about loss landscapes and algorithms that neural networks end up implementing. I think this has implications for alignment research.

When thinking about grokking, people often imagine something like this: the neural network implements Algorithm 1 (e.g., memorizes the training data), achieves ~ the lowest loss available via memorization, then moves around the bottom of the Algorithm 1 basin and after a while, stumbles across a path to Algorithm 2 (e.g., the general algorithm for modular addition).

But the mechanistic interpretability of grokking analysis has shown that this is not true!

Approximately from the start of the training, Algorithm 1 is most of what the circuits are doing and what almost entirely determines the neural network's output; but at the same time, the entire time the neural network's parameters visibly move down the wider basin, they don't just become better at memorization; they increasingly implement the circuits for Algorithm 1 and the circuits for Algorithm 2, in superposition.

(Neel Nanda et al. have shown that the circuits that at the end implement the general algorithm for modular addition start forming approximately at the start of the training: the gradient was mostly an arrow towards memorization, but also, immediately from the initialization of the weights, a bit of an arrow pointing towards the general algorithm. The circuits were gradually tuned throughout the training. The noticeable change in the test loss starts occurring when the circuits are already almost right.)

A path through the loss landscape visible in 3D doesn't correspond to how and what the neural network is actually learning. Almost all of the changes to the loss are due to the increasingly good implementation of Algorithm 1; but apparently, the entire time, the gradient also points towards some faraway implementation of Algorithm 2. Somehow, the direction in which Algorithm 2 lies is also visible to the derivative, and moving the parameters in the direction the gradient points means mostly increasingly implementing Algorithm 1, and also increasingly implementing the faraway Algorithm 2.

"Grokking", visible in the test loss, is due to the change that happens when the parameters already implement Algorithm 2 accurately enough for the switch from mostly outputting the results of an implementation of Algorithm 1 to the results of an improving implementation of Algorithm 2 not to hurt the performance. Once it's the case, the neural network puts more weight into Algorithm 2 and at the same time quickly tunes it to be even more accurate (which is increasingly easy as the output is increasingly determined by the implementation of Algorithm 2).

This is something many people seem to have missed. I did not expect it to be the case, was surprised, and updated how I think about loss landscapes.

Does this generalize?

Maybe. I'm not sure whether it's correct to generalize from the mechanistic interpretability of grokking analysis to neural networks in general, real LLMs are under-parametrised while the grokking model is very over-parameterised, but I guess it might be reasonable to expect that this is how deep learning generally works.

People seem to think that multi-dimensional loss landscapes of neural networks have basins for specific algorithms, and neural networks get into these depending on how relatively large these basins are, which might be caused by how simple the algorithms are, how path-depe...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

July 28, 2023

LW - Visible loss landscape basins don't correspond to distinct algorithms by Mikhail Samin

6 minutes

Thanks to Justis, Arthur Conmy, Neel Nanda, Joseph Miller, and Tilman Räuker for their feedback on a draft.

But the mechanistic interpretability of grokking analysis has shown that this is not true!

This is something many people seem to have missed. I did not expect it to be the case, was surprised, and updated how I think about loss landscapes.

Does this generalize?

...more

Share LW - Visible loss landscape basins don't correspond to distinct algorithms by Mikhail Samin

Sign up to save your podcasts

LW - Visible loss landscape basins don't correspond to distinct algorithms by Mikhail Samin

LW - Visible loss landscape basins don't correspond to distinct algorithms by Mikhail Samin