Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS III: Broad Critiques of Interpretability Research, published by Stephen Casper on February 14, 2023 on The AI Alignment Forum.
Part 3 of 12 in the Engineer’s Interpretability Sequence.
Right now, interpretability is a major subfield in the machine learning research community. As mentioned in EIS I, there is so much work in interpretability that there is now a database of 5199 interpretability papers (Jacovi, 2023). You can also look at a survey from some coauthors and me on over 300 works on interpreting network internals (Räuker et al., 2022).
The key promise of interpretability is to offer open-ended ways of understanding and evaluating models that help us with AI safety. And the diversity of approaches to interpretability is encouraging since we want build a toolbox full of many different useful techniques. But despite how much interpretability work is out there, the research has not been very good at producing competitive practical tools. Interpretability tools lack widespread use by practitioners in real applications (Doshi-Velez and Kim, 2017; Krishnan, 2019; Räuker et al., 2022).
The root cause of this has much to do with interpretability research not being approached with as much engineering rigor as it ought to be. This has become increasingly well-understood. Here is a short reading list for anyone who wants to see more takes that are critical of interpretability research. This post will engage with each of these more below.
The Mythos of Model Interpretability (Lipton, 2016)
Towards A Rigorous Science of Interpretable Machine Learning (Doshi-Velez and Kim, 2017)
Explanation in Artificial Intelligence: Insights from the Social Sciences (Miller, 2017)
Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead (Rudin, 2018)
Against Interpretability: a Critical Examination of the Interpretability Problem in Machine Learning (Krishnan, 2019)
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks (Räuker et al., 2022)
Benchmarking Interpretability Tools for Deep Neural Networks (Casper et al., 2023)
Note that I’m an author on the final two, so references to these papers are self-references. Also my perspectives here are my own and should not be assumed to necessarily reflect those of coauthors.
The goal of this post is to overview some broad limitations with interpretability research today. See also EIS V and EIS VI which discuss some similar themes in the context of AI safety and mechanistic interpretability research.
The central problem: evaluation
The hardest thing about conducting good interpretability research is that it’s not clear whether an interpretation is good or not when there is no ground truth to compare it to. Neural systems are complex, and it’s hard to verify that an interpretation faithfully describes how a network truly functions. So what does it even mean to be meaningfully interpreting a network? There is unfortunately no agreed upon standard. Motivations and goals of interpretability researchers are notoriously “diverse and discordant” (Lipton, 2018). But here, we will take an engineer’s perspective and consider interpretations to be good to the extent that they are useful.
Evaluation by intuition is inadequate.
Miller (2019) observes that “Most work in explainable artificial intelligence uses only the researchers’ intuition of what constitutes a ‘good’ explanation”. Some papers and posts have even formalized evaluation by intuition. Two examples are Yang et al. (2019) and Kirk et al. (2020) who proposed evaluation frameworks that included a criterion called “persuadability.” This was defined by Yang et al. (2019) as “subjective satisfaction or comprehensibility for the corresponding explanation.”
This is not a very good criterion from an enginee...