The Nonlinear Library

AF - EIS II: What is “Interpretability”? by Stephen Casper


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS II: What is “Interpretability”?, published by Stephen Casper on February 9, 2023 on The AI Alignment Forum.
Part 2 of 12 in the Engineer’s Interpretability Sequence.
A parable based on a true story
In 2015, a Google’s image classification app classified many photos depicting black people as gorillas. Image from WSJ.
Remember Google’s infamous blunder from 2015 in which users found that one of its vision API’s often misclassified black people as gorillas? Consider a parable of two researchers who want to understand and tackle this issue.
Alice is an extremely skilled mechanistic interpretability researcher who spends a heroic amount of effort analyzing Google’s model. She identifies a set of neurons and weights that seem to be involved in the detection and processing of human and gorilla faces and bodies. She develops a detailed mechanistic hypothesis and writes a paper about it with 5 different types of evidence for her interpretation. Later on, another researcher who wants to test Alice’s hypothesis edits the model in a way that the hypothesis suggests would fix the problem. As it turns out, the hypothesis was imperfect, and the model now classifies many images of gorillas as humans!
Bob knows nothing about neural networks. Instead of analyzing the network, he looks at the dataset that the model was trained on and notices a striking lack of black people (as was indeed the case in real life (Krishnan, 2020)). He suggests making the data more representative and training the model again. When this is done, it mostly fixes the problem without side effects.
The goal of this parable is to illustrate that when it comes to doing useful engineering work with models, a mechanistic understanding may not always be the best way to go. We shouldn’t think of something called “interpretability” as being fundamentally separate from other tools that can help us accomplish our goals with models. And we especially shouldn’t automatically privilege some methods over others. In some cases, highly involved and complex approaches may be necessary. But in other cases like Alice’s, the interesting, smart, and paper-able solution to the problem might not only be harder but could also be more failure-prone. This isn’t to say that Alice’s work could never lead to more useful insights down the road. But in this particular case Alice’s smart approach was not as good as Bob’s simple one.
Interpretability is a means to an end.
Since I work and think about interpretability every day, I have felt compelled to adopt a definition for it. In a previous draft of this post, I proposed defining an interpretability tool as “any method by which something novel about a system can be better predicted or described.” And I think this is ok, but I have recently stopped caring about any particular definition. Instead, I think the important thing to understand is that “interpretability’ is not a term of any fundamental importance to an engineer.
The key idea behind this post is that whatever we call “interpretability” tools are entirely fungible with other techniques related to describing, evaluating, debugging, etc.
Does this mean that it’s the same thing as interpretability if we just calculate performance on a test set, train an adversarial example, do some model pruning, or make a prediction based on the dataset? Pretty much. For all practical intents and purposes, these things all are all of a certain common type. Consider any of the following sentences.
This model handles 85% of the data correctly.
This input plus whatever is in this adversarial perturbation make the model fail.
I got rid of 90% of the weights and the model’s performance only decreased by 2%.
The dataset has this particular bias, so the model probably will as well.
This model seems to have a circuit composed of these neurons and these wei...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings