The Nonlinear Library: Alignment Forum

AF - Barriers to Mechanistic Interpretability for AGI Safety by Connor Leahy


Listen Later

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Barriers to Mechanistic Interpretability for AGI Safety, published by Connor Leahy on August 29, 2023 on The AI Alignment Forum.
I gave a talk at MIT in March earlier this year on barriers to mechanistic interpretability being helpful to AGI/ASI safety, and why by default it will likely be net dangerous. Several people seem to be coming to similar conclusions recently (e.g., this recent post).
I discuss two major points (by no means exhaustive), one technical and one political, that present barriers to MI addressing AGI risk:
AGI cognition is interactive. AGI systems interact with their environment, learn online and will externalize massive parts of their cognition into the environment. If you want to reason about such a system, you also need a model of the environment. Worse still, AGI cognition is reflective, and you will also need a model of cognition/learning.
(Most) MI will lead to capabilities, not oversight. Institutions are not set up and do not have the incentives to resist using capabilities gains and submit to monitoring and control.
This being said, there are more nuances to this opinion, and a lot of it is downstream of lack of coordination and the downsides of publishing in an adversarial environment like we are in right now. I still endorse the work done by e.g. Chris Olah's team as brilliant, but extremely early, scientific work that has a lot of steep epistemological hurdles to overcome, but I unfortunately also believe that on net work such as Olah's is at the moment more useful as a safety-washing tool for AGI labs like Anthropic than actually making a dent on existential risk concerns.
Here are the slides from my talk, and you can find the video here.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear Library: Alignment ForumBy The Nonlinear Fund


More shows like The Nonlinear Library: Alignment Forum

View all
AXRP - the AI X-risk Research Podcast by Daniel Filan

AXRP - the AI X-risk Research Podcast

9 Listeners