Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Othello-GPT: Future Work I Am Excited About, published by Neel Nanda on March 29, 2023 on LessWrong.
This is the second in a three post sequence about interpreting Othello-GPT. See the first post for context.
This post covers future directions I'm excited to see work on, why I care about them, and advice to get started. Each section is self-contained, feel free to skip around.
Look up unfamiliar terms here
Future work I am excited about
The above sections leaves me (and hopefully you!) pretty convinced that I've found something real and dissolved the mystery of whether there's a linear vs non-linear representation. But I think there's a lot of exciting mysteries left to uncover in Othello-GPT, and that doing so may be a promising way to get better at reverse-engineering LLMs (the goal I actually care about). In the following sections, I try to:
Justify why I think further work on Othello-GPT is interesting
(Note that my research goal here is to get better at transformer mech interp, not to specifically understand emergent world models better)
Discuss how this unlocks finding modular circuits, and some preliminary results
Rather than purely studying circuits mapping input tokens to output logits (like basically all prior transformer circuits work), using the probe we can study circuits mapping the input tokens to the world model, and the world model to the output logits - the difference between thinking of a program as a massive block of code vs being split into functions and modules.
If we want to reverse-engineer large models, I think we need to get good at this!
Discuss how we can interpret Othello-GPT's neurons - we're very bad at interpreting transformer MLP neurons, and I think that Othello-GPT's are simple enough to be tractable yet complex enough to teach us something!
Discuss how, more broadly, Othello-GPT can act as a laboratory to get data on many other questions in transformer circuits - it's simple enough to have a ground truth, yet complex enough to be interesting
My hope is that some people reading this are interested enough to actually try working on these problems, and I end this section with advice on where to start.
Why and when to work on toy models
This is a long and rambly section about my research philosophy of mech interp, and you should feel free to move on to the next section if that's not your jam
At first glance, playing legal moves in Othello (not even playing good moves!) has nothing to do with language models, and I think this is a strong claim worth justifying. Can working on toy tasks like Othello-GPT really help us to reverse-engineer LLMs like GPT-4? I'm not sure! But I think it's a plausible bet worth making.
To walk through my reasoning, it's worth first thinking on what's holding us back - why haven't we already reverse-engineered the most capable models out there? I'd personally point to a few key factors (though note that this is my personal hot take, is not comprehensive, and I'm sure other researchers have their own views!):
Conceptual frameworks: To reverse-engineer a transformer, you need to know how to think like a transformer. Questions like: What kinds of algorithms is it natural for a transformer to represent, and how? Are features and circuits the right way to think about it? Is it even reasonable to expect that reverse-engineering is possible? How can we tell if a hypothesis or technique is principled vs hopelessly confused? What does it even mean to have truly identified a feature or circuit?
I personally thought A Mathematical Framework significantly clarified my conceptual frameworks for transformer circuits!
This blog post is fundamentally motivated by forming better conceptual frameworks - do models form linear representations?
Practical Knowledge/Techniques: Understanding models is hard, and being able to do this...