Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Against Almost Every Theory of Impact of Interpretability, published by Charbel-Raphaël on August 17, 2023 on LessWrong.
Epistemic Status: I believe I am well-versed in this subject. I erred on the side of making claims that were too strong and allowing readers to disagree and start a discussion about precise points rather than trying to edge-case every statement. I also think that using memes is important because safety ideas are boring and anti-memetic. So let's go!
Many thanks to @scasper, @Sid Black , @Neel Nanda , @Fabien Roger , @Bogdan Ionut Cirstea, @WCargo, @Alexandre Variengien, @Jonathan Claybrough, @Edoardo Pona, @Andrea_Miotti, Diego Dorn, Angélina Gentaz, Clement Dumas, and Enzo Marsot for useful feedback and discussions.
When I started this post, I began by critiquing the article A Long List of Theories of Impact for Interpretability, from Neel Nanda, but I later expanded the scope of my critique. Some ideas which are presented are not supported by anyone, but to explain the difficulties, I still need to 1. explain them and 2. criticize them. It gives an adversarial vibe to this post. I'm sorry about that, and I think that doing research into interpretability, even if it's no longer what I consider a priority, is still commendable.
How to read this document? Most of this document is not technical, except for the section "What does the end story of interpretability look like?" which can be mostly skipped at first. I expect this document to also be useful for people not doing interpretability research. The different sections are mostly independent, and I've added a lot of bookmarks to help modularize this post.
If you have very little time, just read (this is also the part where I'm most confident):
Auditing deception with Interp is out of reach (4 min)
Enumerative safety critique (2 min)
Technical Agendas with better Theories of Impact (1 min)
Here is the list of claims that I will defend:
(bolded sections are the most important ones)
The overall Theory of Impact is quite poor
Interp is not a good predictor of future systems
Auditing deception with interp is out of reach
What does the end story of interpretability look like? That's not clear at all.
Enumerative safety?
Reverse engineering?
Olah's Interpretability dream?
Retargeting the search?
Relaxed adversarial training?
Microscope AI?
Preventive measures against Deception seem much more workable
Steering the world towards transparency
Cognitive Emulations - Explainability By design
Interpretability May Be Overall Harmful
Outside view: The proportion of junior researchers doing Interp rather than other technical work is too high
So far my best ToI for interp: Nerd Sniping?
Even if we completely solve interp, we are still in danger
Technical Agendas with better Theories of Impact
Conclusion
Note: The purpose of this post is to criticize the Theory of Impact (ToI) of interpretability for deep learning models such as GPT-like models, and not the explainability and interpretability of small models.
The emperor has no clothes?
I gave a talk about the different risk models, followed by an interpretability presentation, then I got a problematic question, "I don't understand, what's the point of doing this?" Hum.
Feature viz? (left image) Um, it's pretty but is this useful? Is this reliable?
GradCam (a pixel attribution technique, like on the above right figure), it's pretty. But I've never seen anybody use it in industry. Pixel attribution seems useful, but accuracy remains the king.
Induction heads? Ok, we are maybe on track to retro engineer the mechanism of regex in LLMs. Cool.
The considerations in the last bullet points are based on feeling and are not real arguments. Furthermore, most mechanistic interpretability isn't even aimed at being useful right now. But in the rest of the post, we'll find out if...