Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS XII: Summary, published by Stephen Casper on February 23, 2023 on The AI Alignment Forum.
Part 12 of 12 in the Engineer’s Interpretability Sequence.
TAISIC = “the AI safety interpretability community”
MI = “mechanistic interpretability”
There might be some addenda later, but for now, this is the final post in The Engineer’s Interpretability Sequence. I hope you have found it interesting and have gotten some useful ideas. I will always be happy to talk to people about the topics from this sequence in the comments or via email. For now, the last thing I will do is offer a summary of key points post by post :)
A Prequel: Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks (Räuker et al., 2022)
A survey of over 300 works on inner interpretability from an AI safety perspective.
All opinions in this sequence, however, are my own and not necessarily those of coauthors or other affiliates.
EIS I: Intro
Lots of interpretability research exists, and the field is still rapidly growing.
Most of it is not very productive, and there is a significant gap between the research and practice. Interpretability tools aren't used much by engineers working on real alignment problems.
If one of our main goals for interpretability research is to help us with aligning highly intelligent AI systems in high stakes settings, we should be working on tools that are more engineering-relevant.
EIS II: What is “Interpretability”?
This post introduced a parable about two researchers trying to solve a problem.
The moral of the story is that we should not privilege difficult or interesting methods over easy and simple ones. It is key not to grade different tools on different curves.
From an engineer’s perspective, the term “interpretability” isn’t that useful.
Whatever we call “interpretability” tools are entirely fungible with other techniques related to describing, evaluating, debugging, etc. in models.
Mechanistic approaches to interpretability are not uniquely important for AI safety. MI tools have the potential to help identify and fix deceptive alignment failures, but...
There are many non-deceptive ways AI could go wrong.
MI is not uniquely useful for fixing deceptive alignment and especially not uniquely useful for fixing non-deceptive alignment failures.
EIS III Broad Critiques of Interpretability Research
There is a growing consensus that interpretability research is generally not very productive or engineering relevant.
There is also a growing consensus that better evaluation is needed. A lack of good evaluation methods may be the biggest challenge facing the interpretability research community.
There are three types of evaluation.
Intuition + pontification --> inadequate
Weak/ad-hoc --> still not enough
Based on engineering-relevant tasks --> what is needed
This can be based on one of three things
Making novel predictions about how a model will handle interesting inputs.
Controlling what a system does by guiding edits to it.
Abandoning a system that does a nontrivial task and replacing it with a simpler reverse-engineered alternative
Other common limitations of existing work
Poor scaling
Relying too much on humans in the loop
Failing to study combinations of tools
A lack of practical applications with real-world systems
EIS IV: A Spotlight on Feature Attribution/Saliency
Feature attribution/saliency methods are very common but unlikely to be very important from an engineering perspective.
These methods tend to be poorly evaluated, and when they have been subjected to task-based evaluation, they have not tended to fare well.
These methods just aren’t equipped to directly be very useful even when they work. They require scrutinizing samples from some data distribution. So the exact same things that feature attribution/saliency methods are equipped t...