Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Announcing the CNN Interpretability Competition, published by Stephen Casper on September 26, 2023 on The AI Alignment Forum.
TL;DR
I am excited to announce the CNN Interpretability Competition, which is part of the competition track of SATML 2024.
Dates: Sept 22, 2023 - Mar 22, 2024
Competition website:/
Total prize pool: $8,000
NeurIPS 2023 Paper: Red Teaming Deep Neural Networks with Feature Synthesis Tools
Github:
For additional reference: Practical Diagnostic Tools for Deep Neural Networks
Correspondence to:
[email protected]Intro and Motivation
Interpretability research is popular, and interpretability tools play a role in almost every agenda for making AI safe. However, there are some gaps between the research and engineering applications. If one of our main goals for interpretability research is to help us align highly intelligent AI systems in high-stakes settings, we need more tools that help us better solve practical problems.
One of the unique advantages of interpretability tools is that, unlike test sets, they can sometimes allow humans to characterize how networks may behave on novel examples. For example, Carter et al. (2019), Mu and Andreas (2020), Hernandez et al. (2021), Casper et al. (2022a), and Casper et al. (2023) have all used different interpretability tools to identify novel combinations of features that serve as adversarial attacks against deep neural networks.
Interpretability tools are promising for exercising better oversight, but human understanding is hard to measure, and it has been difficult to make clear progress toward more practically useful tools. Here, we work to address this by introducing the CNN Interpretability Competition (accepted to SATML 2024).
The key to the competition is to develop interpretations of the model that help human crowdworkers discover trojans: specific vulnerabilities implanted into a network in which a certain trigger feature causes the network to produce unexpected output. In addition, we also offer an open-ended challenge for participants to discover the triggers for secret trojans by any means necessary.
The motivation for this trojan-discovery competition is that trojans are bugs caused by novel trigger features -- they usually can't be identified by analyzing model performance on some readily available dataset. This makes finding them a challenging debugging task that mirrors the practical challenge of finding unknown bugs in models. However, unlike naturally occurring bugs in neural networks, the trojan triggers are known to us, so it will be possible to know when an interpretation is causally correct or not. In the real world, not all types of bugs in neural networks are likely to be trojan-like. However, benchmarking interpretability tools using trojans can offer a basic sanity check.
The Benchmark
This competition follows new work from Casper et al. (2023) (will be at NeurIPS 2023), in which we introduced a benchmark for interpretability tools based on helping human crowdworkers discover trojans that had interpretable triggers. We used 12 trojans of three different types: ones that were triggered by patches, styles, and naturally occurring features.
We then evaluated 9 methods meant to help detect trojan triggers: TABOR, (Guo et al., 2019), four variants of feature visualizations, (Olah et al., 2017; Mordvintsev et al., 2018), adversarial patches (Brown et al., 2017), two variants of robust feature-level adversaries (Casper et al., 2022a), and SNAFUE (Casper et al., 2022b). We tested each based on how much they helped crowdworkers identify trojan triggers in multiple-choice questions. Overall, this work found some successes. Adversarial patches, robust feature-level adversaries, and SNAFUE were relatively successful at helping humans discover trojan triggers.
However, even the best-perfo...