The Nonlinear Library

AF - Should we publish mechanistic interpretability research? by Marius Hobbhahn


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Should we publish mechanistic interpretability research?, published by Marius Hobbhahn on April 21, 2023 on The AI Alignment Forum.
TL;DR: Multiple people have raised concerns about current mechanistic interpretability research having capabilities externalities. We discuss to which extent and which kind of mechanistic interpretability research we should publish. The core question we want to explore with this post is thus to which extent the statement “findings in mechanistic interpretability can increase capabilities faster than alignment” is true and should be a consideration. For example, foundational findings in mechanistic interpretability may lead to a better understanding of NNs which often straightforwardly generates new hypotheses to advance capabilities.
We argue that there is no general answer and the publishing decision primarily depends on how easily the work advances alignment in relation to how much it can be used to advance capabilities. We recommend a differential publishing process where work with high capabilities potential is initially only circulated with a small number of trusted people and organizations and work with low capabilities potential is published widely.
Related work: A note about differential technological development, Current themes in mechanistic interpretability research, Thoughts on AGI organization and capabilities work, Dan Hendrycks’s take, etc.
We have talked to lots of people about this question and lost track of who to thank individually. In case you talked to either Marius or Lawrence about this, thank you!
The basic case for publishing
Let’s revisit the basic cases for publishing.
Alignment is probably hard. To get even close to a solution, it likely requires many people working together, coordinating their work and patching together different approaches. If the work isn’t public, this becomes much harder and thus a solution to alignment becomes less likely.
More people can engage with the work and build on it. Especially people who want to get into the field or are less connected might not be able to access documents that are only shared in small circles.
It gives more legitimacy to the work of organisations and individuals. For people who are not yet established in the field, publishing their work is the most obvious way to get noticed by an organisation. For academics, publications are the most relevant resource for their careers and organisations can generate more legitimacy by publishing their work (e.g. for grantmakers or other organisations they want to interact with).
It is a form of movement building. If work on mechanistic interpretability is regularly shown on ML conferences and is available on arxiv, it is more likely that people outside of the alignment field notice that the field exists and get interested in it.
Publication leads to accountability and feedback. If you know you will publish something, you put more effort into explaining it well and ensuring that your findings are robust. Furthermore, it provides a possibility for other researchers to engage with your work and give you feedback for improvement or future research directions.
In addition, mechanistic interp seems especially well suited for publication in classic academic venues since it is less speculative than other AI safety work and overlaps with established academic fields.
Thus, publication seems robustly positive as long as it doesn’t advance capabilities more than alignment (which is often hard to predict in advance). The crux of this post, therefore, lies mainly in the possible negative externalities of publications and how they trade off against the alignment benefits.
Capabilities externalities of mechanistic interpretability
The primary reason to think that mechanistic interpretability has large capabilities is that understanding a sys...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings