Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS IV: A Spotlight on Feature Attribution/Saliency, published by Stephen Casper on February 15, 2023 on The AI Alignment Forum.
Part 4 of 12 in the Engineer’s Interpretability Sequence.
Thanks to Tony Wang for a helpful comment.
If you want to become more familiar with feature attribution/saliency, a tutorial on them that may offer useful background is Nielsen et al. (2021).
Given a model and an input for it, the goal of feature attribution/saliency methods is to identify what features in the input are influential for the model’s decision. The literature on these methods is large and active with many hundreds of papers. In fact, in some circles, the word “interpretability” and especially the word “explainability” are more or less synonymous with feature attribution (some examples are discussed below). But despite the size of this literature, there are some troubles with the research on these methods that are fairly illustrative of broader ones with interpretability overall. Hence this post. There are some analogous ones in AI safety work that will be discussed more in the next two posts in the sequence.
Troubles with evaluation and performance
Some examples and troubles with the evaluation of feature attributions were already touched on in EIS III which discussed Pan et al. (2021) and Ismail et al. (2021). The claim from Pan et al. (2021) that their method is “obviously better” than alternatives exemplifies how these methods are sometimes simply declared successful after inspection from researchers. And Ismail et al. (2021) demonstrates a form of weak evaluation with a measure that may be quantitative but is not of direct interest to an engineer.
In response to this literature, several works have emerged to highlight difficulties with feature attribution/saliency methods. Here is a short reading list :)
A Benchmark for Interpretability Methods in Deep Neural Networks (Hooker et al., 2018)
Sanity Checks for Saliency Maps (Adebayo et al., 2018)
Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? (Hase and Bansal, 2020)
Debugging Tests for Model Explanations (Adebayo et al., 2020)
Auditing Visualizations: Transparency Methods Struggle to Detect Anomalous Behavior (Denain and Steinhardt, 2022)
Towards Benchmarking Explainable Artificial Intelligence Methods (Holmberg, 2022)
Benchmarking Interpretability Tools for Deep Neural Networks (Casper et al., 2023)
When they are evaluated, these tools often aren’t very useful and do not pass simple sanity checks. Consider an illustration of this problem:
From Adebayo et al. (2018)
These visualizations suggest that some of these tools do not reliably highlight features that seem important in images at all, and the ones that do often highlight them do not appear to be obviously better than an edge detector. This sanity check suggests limitations with how well these methods can reveal anything novel to humans at all, let alone how useful they can be in tasks of practical interest.
For the papers that have gone further and studied whether these methods can help predict how the network will respond to certain inputs, it seems that some attribution/saliency methods usually fail while others only occasionally succeed (Hase and Bansal, 2020; Adebayo et al., 2020; Denain and Steinhardt, 2022).
EIS III discussed how in a newly arXived work, coauthors and I benchmarked feature synthesis tools (Casper et al., 2023). In addition, we use a related approach to evaluate how helpful feature attribution/saliency methods can be for pointing out spurious features that the network has learned. This method was based on seeing how well a method can attribute a trojaned network’s decision to the trojan trigger in an image.
From Casper et al. (2023)
Shown at the top of the figure above are examples of trojaned ima...