Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS XI: Moving Forward, published by Stephen Casper on February 22, 2023 on The AI Alignment Forum.
Part 11 of 12 in the Engineer’s Interpretability Sequence.
So far, this sequence has discussed a number of topics in interpretability research, all building toward this post. Its goal is to explain some approaches that may be valuable moving forward. I plan to work on some of the ideas here soon. Others, I may not work on soon, but I would love to discuss and support such work if I am able. I hope that this post can offer some useful ideas for people entering or continuing with interpretability research, and if you would like to discuss anything here more, feel more than free to email me at
[email protected].
What are we working toward?
First, it seems useful to highlight two points that are uncontroversial in the AI safety community but important nonetheless.
Our goal is a toolbox – not a silver bullet.
As AI safety engineers, we should neither expect nor try to find a single ‘best’ approach to interpretability that will solve all of our problems. There are many different types of interpretability tools, and many of the differences between them can be described as enforcing different priors over what explanations they generate. So it is trivial to see that there is not going to be any free lunch. There is no silver bullet for interpretability, and few tools conflict with each other anyway. Hence, our goal is a toolbox.
In fact, some coauthors and I recently found an excellent example of how using multiple interpretability tools at once beats using individual ones (Casper et al., 2023).
This doesn’t mean, however, that we should celebrate just any new interpretability tool. Working in unproductive directions is costly, and applying tool after tool to a problem contributes substantially to the alignment tax. The best types of tools to fill our toolbox will be ones that are automatable, cheap to use, and have demonstrated capabilities on tasks of engineering-relevance.
Don’t advance capabilities.
As AI safety engineers, we do not want to advance capabilities because doing so speeds up timelines. In turn, faster timelines mean less time for safety research, less time for regulators to react, and a greater likelihood of immense power being concentrated in the hands of very few. Avoiding faster timelines isn’t as simple as just not working on capabilities though. Many techniques have potential uses for both safety and capabilities. So instead of judging our work based on how much we improve safety, we need to judge it based on how much we improve safety relative to capabilities. This is an especially important tradeoff for engineers to keep in mind. A good example was discussed by Hendrycks and Woodside (2022) who observed that there is a positive correlation between the anomaly detection capabilities of a network and its task performance.
Some work may improve safety capabilities but if it does so by continuing along existing trendlines, we don’t get more safety than the counterfactual. For the full discussion of this point, see Hendrycks and Woodside (2022).
What types of existing tools/research seem promising?
Before discussing what topics may be important to work on in the future, it may be valuable to reflect on examples of past work that have introduced interpretability tools that seem to be able to competitively provide engineering-relevant insights. Here is a personal list that is somewhat arbitrary and undoubtedly incomplete. But hopefully it is still valuable. Consider this an engineer’s interpretability reading list of sorts.
Some works have competitively done engineering-relevant things with methods for making novel predictions about how a network will handle OOD inputs.
ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and ...