Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety, published by Stephen Casper on February 17, 2023 on The AI Alignment Forum.
Part 6 of 12 in the Engineer’s Interpretability Sequence.
Thanks to Chris Olah and Neel Nanda for discussions and comments. In particular, I am thankful to Neel Nanda correcting a mistake I made in understanding the arguments in Olsson et al. (2022) in an earlier draft of this post.
TAISIC = “the AI safety interpretability community”
MI = “mechanistic interpretability”
What kind of work this post focused on
TAISIC prioritizes a relatively small set of problems in interpretability relative to the research community at large. This work is not homogenous, but a dominant theme is a focus on mechanistic, circuits-style interpretability with the end goals of model verification and/or detecting deceptive alignment.
There is a specific line of work that this post focuses on. Key papers from it include:
Feature Visualization (Olah et al., 2017)
Zoom In: An Introduction to Circuits (Olah et al., 2020)
Curve Detectors (Cammarata et al., 2020)
A Mathematical Framework for Transformer Circuits (Elhage et al., 2021)
In-context Learning and Induction Heads (Olsson et al., 2022)
Toy Models of Superposition (Elhage et al., 2022)
Softmax Linear Units (Elhage et al., 2022)
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small (Wang et al., 2022)
Progress measures for grokking via mechanistic interpretability (Nanda et al., 2023)
.etc.
And the points in this post will also apply somewhat to the current research agendas of Anthropic, Redwood Research, ARC, and Conjecture. This includes Causal Scrubbing (Chan et al., 2022) and mechanistic anomaly detection (Christiano, 2022).
Most (all?) of the above work is either from Distill or inspired in part by Distill’s interpretability work in the late 2010s.
To be clear, I believe this research is valuable, and it has been foundational to my own thinking about interpretability. But there seem to be some troubles with this space that might be keeping it from being as productive as it can be. Now may be a good time to make some adjustments to TAISIC’s focus on MI. This may be especially important given how much recent interest there has been in interpretability work and how there are large recent efforts focused on getting a large number of junior researchers working on it.
Four issues
This section discusses four major critiques of the works above. Not all of these critiques apply to all of the above, but for every paper mentioned above, at least one of the critiques below apply to it. Some but not all of these examples of papers exhibiting these problems will be covered.
Cherrypicking results
As discussed in EIS III and the Toward Transparent AI survey (Räuker et al., 2022), cherrypicking is common in the interpretability literature, but it manifests in some specific ways in MI work. It is very valuable for papers to include illustrative examples to build intuition, but when a paper makes such examples a central focus, cherrypicking can make results look better than they are. The feature visualization (Olah et al., 2017) and zoom in (Olah et al., 2020) papers have examples of this. Have a look at the cover photo for (Olah et al., 2017).
From Olah et al., (2017)
These images seem easy to describe and form hypotheses from. But instead of these, try going to OpenAI’ microscope and looking at some random visualizations. For example, here are some from a deep layer in an Inception-v4.
From this link.
As someone who often works with feature visualizations, I can confirm that these visualizations from OpenAI microscope are quite typical. But notice how they seem quite a bit less ‘lucid’ than the ones in the cover photo from Olah et al., (2017).
Of course, many papers present t...