Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Takeaways from the Mechanistic Interpretability Challenges, published by scasper on June 8, 2023 on LessWrong.
...plus more challenges on the way
Spoilers ahead
What happened?
The Mechanistic Interpretability Challenges (post and GitHub) were two challenges I posed in February as part of the Engineer’s Interpretability Sequence. The first challenge was to find the pseudocode for the labeling function used to train a small CNN MNIST classifier. It was solved early last month. The second was to find the pseudocode for the labeling function used to train a one-layer transformer that classified pairs of numbers into two categories. It was solved (with some reservations) late last month. Instead of finding the labeling function, the researchers who solved it obtained a mechanistic explanation of how the model worked and argued that the labeling function’s pseudocode would not be tractable to find from the model.
Thanks to Stefan, Marius, and Neel
Stefan Heimersheim and Marius Hobbhahn solved both challenges as a team. I and others have been impressed with their work. Meanwhile, Neel Nanda offered to contribute $500 to the prize pool for solving each challenge. Per the request of Stefan and Marius, a total of $1,500 has been donated by Neel and me to AI Safety Support.
Why These Challenges?
In the original post on the challenges, I argued that solving them would be one of the first clear examples of mechanistic interpretability being used to solve a problem that was not specifically selected to be solvable with mechanistic interpretability.
Because it doesn’t treat models as black boxes, mechanistic interpretability is one of the potential solutions we might have for diagnosing and debugging insidious alignment failures. For example, if a model has a trojan or plans to make a treacherous turn once it detects that it’s in deployment, then these failures will be virtually undetectable from black-box access alone during training and development.
Mechanistic interpretability has been a reasonably high-profile research area for the past 6 years or so in the AI safety community. And it is currently undergoing a renewed surge of interest. However, I have tried to be critical of the fact that much of the progress in mechanistic interpretability research has been on “streetlight interpretability” projects often with cherrypicked models and tasks. As a result of this, there is a risk that, if mechanistic interpretability continues to be a field full of cherrypicked and toy work, it may fail to produce methods that keep up with state-of-the-art applications of AI. Certainly, progress in mechanistic interpretability has not kept up with progress in AI as a whole, and despite all of the interest from the AI safety community, it lacks any big wins or many real-world applications at all that produce competitive tools for engineers solving real-world problems.
Hence the purpose of the mechanistic interpretability challenges: to provide challenges that aren’t able to be cherrypicked by those undertaking them. The hope has been that these challenges and others like it could offer a useful way of testing how useful approaches to interpretability are. The goal is to measure how promising specific methods and mechanistic interpretability itself are for truly reverse engineering models performing tasks that don’t happen to be under any particular streetlight.
The First Challenge: A Clear Win for MI
As is now public information, the MNIST CNN was trained on a labeling function that labeled images with small and large L1 distances to this image as a 1 while images with a medium L1 distance from it were labeled as a 0.
The solution was thorough. The network developed a set of “detectors” and “anti-detectors” for this image in the penultimate layer. It labeled anythi...