August 07, 2023

AF - An interactive introduction to grokking and mechanistic interpretability by Adam Pearce

2 minutes

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An interactive introduction to grokking and mechanistic interpretability, published by Adam Pearce on August 7, 2023 on The AI Alignment Forum.

Our write up largely agrees with @Quintin Pope's summary, with the addition of training trajectory visualizations and an explanation of the MLP construction that solves modular addition.

A meta note that didn't make it into the article - with so many people looking into this problem over the last 18 months, I'm surprised this construction took so long to find. The modular addition task with a 1-layer MLP is about as simple as you can get!

Scaling mechanistic interpretability up to more complex tasks/models seems worth continuing to try, but I'm less sure extracting crisp explanations will be possible. Even if we "solve" superposition, figuring the construction here - where there's no superposition in the generalizing model - wasn't trivial.

gif/twitter summary

If we train a MLP to solve modular addition, the generalizing phase has suggestive periodic patterns.

To figure out why the model generalizes, we first look at task where we know the generalizing solution - sparse parity. You can see the model generalizing as weight decay prunes spurious connections.

One point from the Omnigrok paper I hadn't internalized before training lots of models: grokking only happens when hyper-parameters are just right.

We can make other weird things happen too, like AdamW oscillating between low train loss and low weights.

To understand how a MLP solves modular addition, we train a much smaller model with a circular input embedding baked in.

Following @Neel Nanda and applying a discrete Fourier transform, we see larger models trained from scratch use the same star trick!

Finally, we show what the stars are doing and prove that they work:

Our ReLU activation has a small error, but it's close enough to the exact solution - an x² activation suggested in Grokking modular arithmetic - for the model to patch everything up w/ constructive interference.

And there are still open question: why are the frequencies with >5 neurons lopsided? Why does factoring Winput not do that same thing as factoring Woutput?

Also see The Hydra Effect: Emergent Self-repair in Language Model Computations

Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

...more

View all episodes

By The Nonlinear Fund