Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Neel Nanda on the Mechanistic Interpretability Researcher Mindset, published by Michaël Trazzi on September 22, 2023 on LessWrong.
Some excerpts from my interview with Neel Nanda about how to productively carry out research in mechanistic interpretability.
Posting this here since I believe his advice is relevant for building accurate world models in general.
An Informal Definition Of Mechanistic Interpretability
It's kind of this weird flavor of AI interpretability that says, "Bold hypothesis. Despite the entire edifice of established wisdom and machine learning, saying that these models are bullshit, inscrutable black boxes, I'm going to assume there is some actual structure here. But the structure is not there because the model wants to be interpretable or because it wants to be nice to me. The structure is there because the model learns an algorithm, and the algorithms that are most natural to express in the model's structure and its particular architecture and stack of linear algebra are algorithms that make sense to humans. (context)
Three Modes Of Mechanistic Interpretability Research: Confirming, Red Teaming And Gaining Surface Area
I kind of feel a lot of my research style is dominated by this deep seated conviction that models are comprehensible and that everything is fundamentally kind of obvious and that I should be able to just go inside the model and there should be this internal structure. And so one mode of research is I just have all of these hypotheses and guesses about what's going on. I generate experiment ideas for things that should be true if my hypothesis is true. And I just repeatedly try to confirm it.
Another mode of research is trying to red team and break things, where I have this hypothesis, I do this experiment, I'm like, "oh my God, this is going so well", and then get kind of stressed because I'm concerned that I'm having wishful thinking and I try to break it and falsify it and come up with experiments that would show that actually life is complicated.
A third mode of research is what I call "trying to gain surface area" where I just have a system that I'm pretty confused about. I just don't really know where to get started. Often, I'll just go and do things that I think will get me more information. Just go and plot stuff or follow random things I'm curious about in a fairly undirected fuzzy way. This mode of research has actually been the most productive for me. [...]
You could paraphrase them as, "Isn't it really obvious what's going on?", "Oh man, am I so sure about this?" and "Fuck around and find out". (context)
Strong Beliefs Weakly Held: Having Hypotheses But Being Willing To Be Surprised
You can kind of think of it as "strong beliefs weakly held". I think you should be good enough that you can start to form hypotheses, being at the point where you can sit down, set a five minute timer and brainstorm what's going on and come up with four different hypotheses is just a much, much stronger research position than when you sit down and try to brainstorm and you come up with nothing. Yeah, maybe having two hypotheses is the best one. You want to have multiple hypotheses in mind.
You also want to be aware that probably both of them are wrong, but you want to have enough engagement with the problem that you can generate experiment ideas. Maybe one way to phrase it is if you don't have any idea what's going on, it's hard to notice what's surprising. And often noticing what's surprising is one of the most productive things you can do when doing research. (context)
On The Benefits Of The Experimental Approach
I think there is a strong trend among people, especially the kind of people who get drawn to alignment from very theory based arguments to go and just pure theory craft and play around with toy models and form beautiful, elegant hy...