Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Actually, Othello-GPT Has A Linear Emergent World Representation, published by Neel Nanda on March 29, 2023 on LessWrong.
Epistemic Status: This is a write-up of an experiment in speedrunning research, and the core results represent ~20 hours/2.5 days of work (though the write-up took way longer). I'm confident in the main results to the level of "hot damn, check out this graph", but likely have errors in some of the finer details.
Disclaimer: This is a write-up of a personal project, and does not represent the opinions or work of my employer
This post may get heavy on jargon. I recommend looking up unfamiliar terms in my mechanistic interpretability explainer
Thanks to Chris Olah, Martin Wattenberg, David Bau and Kenneth Li for valuable comments and advice on this work, and especially to Kenneth for open sourcing the model weights, dataset and codebase, without which this project wouldn't have been possible! Thanks to ChatGPT for formatting help.
Overview
Context: A recent paper trained a model to play legal moves in Othello by predicting the next move, and found that it had spontaneously learned to compute the full board state - an emergent world representation.
This could be recovered by non-linear probes but not linear probes.
We can causally intervene on this representation to predictably change model outputs, so it's telling us something real
I find that actually, there's a linear representation of the board state!
But that rather than "this cell is black", it represents "this cell has my colour", since the model plays both black and white moves.
We can causally intervene with the linear probe, and the model makes legal moves in the new board!
This is evidence for the linear representation hypothesis: that models, in general, compute features and represent them linearly, as directions in space! (If they don't, mechanistic interpretability would be way harder)
The original paper seemed at first like significant evidence for a non-linear representation - the finding of a linear representation hiding underneath shows the real predictive power of this hypothesis!
This (slightly) strengthens the paper's evidence that "predict the next token" transformer models are capable of learning a model of the world.
Part 2: There's a lot of fascinating questions left to answer about Othello-GPT - I outline some key directions, and how they fit into my bigger picture of mech interp progress
Studying modular circuits: A world model implies emergent modularity - many early circuits together compute a single world model, many late circuits each use it. What can we learn about what transformer modularity looks like, and how to reverse-engineer it?
Prior transformer circuits work focuses on end-to-end circuits, from the input tokens to output logits. But this seems unlikely to scale!
I present some preliminary evidence reading off a neuron's function from its input weights via the probe
Neuron interpretability and Studying Superposition: Prior work has made little progress on understanding MLP neurons. I think Othello GPT's neurons are tractable to understand, yet complex enough to teach us a lot!
I further think this can help us get some empirical data about the Toy Models of Superposition paper's predictions
I investigate max activating dataset examples and find seeming monosemanticity, yet deeper investigation show it seems more complex.
A transformer circuit laboratory: More broadly, the field has a tension between studying clean, tractable yet over-simplistic toy models and studying the real yet messy problem of interpreting LLMs - Othello-GPT is toy enough to be tractable yet complex enough to be full of mysteries, and I detail many more confusions and conjectures that it could shed light on.
Part 3: Reflections on the research process
I did the bulk of this project in a weeke...