The Nonlinear Library

AF - EIS IX: Interpretability and Adversaries by Stephen Casper


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS IX: Interpretability and Adversaries, published by Stephen Casper on February 20, 2023 on The AI Alignment Forum.
Part 9 of 12 in the Engineer’s Interpretability Sequence.
Thanks to Nikolaos Tsilivis for helpful discussions.
The studies of interpretability and adversaries are inseparable.
There are several key connections between the two. Some works will be cited below, but please refer to page 9 of the Toward Transparent AI survey (Räuker et al., 2022) for full citations. There are too many to be worth the clutter in this post.
1. More interpretable networks are more adversarially robust and more adversarially robust networks are more interpretable.
The main vein of evidence on this topic comes from a set of papers which study how regularizing feature attribution/saliency maps to make them more clearly highlight specific input features has the effect of making networks more robust to adversaries. There is also some other work showing the reverse -- that adversarially robust networks tend to have more lucid attributions. There is also some work showing that networks which emulate certain properties of the human visual system are also more robust to adversaries and distribution shifts (e.g. Ying et al. (2022)).
Adversarial training is a good way of making networks more internally interpretable. One particularly notable work is Engstrom et al., (2019) who found striking improvements in how much easier it was to produce human-describable visualizations of internal network properties. Although they stopped short of applying this work to an engineering task, the paper seems to make a strong case for how adversarial training can improve interpretations. Adversarially trained networks also produce better representations for transfer learning, image generation, and modeling the human visual system.
Finally, some works have found that lateral inhibition and second-order optimization have been found to improve both interpretability and robustness.
2. Interpretability tools can and should be used to guide the design of adversaries.
This is one of the three types of rigorous evaluation methods for interpretability tools discussed in EIS III. Showing that an interpretability tool helps us understand a network well enough to exploit it is good evidence that it can be useful.
3. Adversarial examples can be useful interpretability tools.
Adversaries always reveal information about a network, even if it’s hard to describe a feature that fools it in words. However, a good amount of recent literature has revealed that studying interpretable adversaries can lead to useful, actionable insights. In some previous work (Casper et al., 2021), some coauthors and I argue for using “robust feature-level adversaries” as a way to produce attacks that are human-describable and likely to lead to a generalizable understanding. Casper et al, (2023) more rigorously tests methods like this.
4. Mechanistic interpretability and mechanistic adversarial examples are uniquely equipped for addressing deception and other insidious misalignment failures.
Hubinger (2020) discussed 11 proposals for building safe advanced AI, and all 11 explicitly call for the use of interpretability tools or (relaxed) adversarial training for inner alignment. This isn’t a coincidence because these offer the only types of approaches that can be useful for fixing insidiously aligned models. Recall from the previous post that an engineer might understand insidious misalignment failures as ones in which the inputs that will make a model exhibit misaligned behavior are hard to find during training, but there exists substantial neural circuitry dedicated to the misaligned behavior. Given this, it’s clear that working to understand and debug inner mechanisms is the key to make progress on insidious misalignment.
Are adversaries fea...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings