Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Formal verification, heuristic explanations and surprise accounting, published by Jacob Hilton on June 25, 2024 on The AI Alignment Forum.
ARC's current research focus can be thought of as trying to combine mechanistic interpretability and formal verification. If we had a deep understanding of what was going on inside a neural network, we would hope to be able to use that understanding to verify that the network was not going to behave dangerously in unforeseen situations. ARC is attempting to perform this kind of verification, but using a mathematical kind of "explanation" instead of one written in natural language.
To help elucidate this connection, ARC has been supporting work on
Compact Proofs of Model Performance via Mechanistic Interpretability by Jason Gross, Rajashree Agrawal, Lawrence Chan and others, which we were excited to see released along with this post. While we ultimately think that provable guarantees for large neural networks are unworkable as a long-term goal, we think that this work serves as a useful springboard towards alternatives.
In this post, we will:
Summarize ARC's takeaways from this work and the problems we see with provable guarantees
Explain ARC's notion of a heuristic explanation and how it is intended to overcome these problems
Describe with the help of a worked example how the quality of a heuristic explanation can be quantified, using a process we have been calling surprise accounting
We are also sharing a draft by Gabriel Wu (currently visiting ARC) describing a heuristic explanation for the same model that appears in the above paper:
max_of_k Heuristic Estimator
Thanks to Stephanie He for help with the diagrams in this post. Thanks to Eric Neyman, Erik Jenner, Gabriel Wu, Holly Mandel, Jason Gross, Mark Xu, and Mike Winer for comments.
Formal verification for neural networks
In
Compact Proofs of Model Performance via Mechanistic Interpretability, the authors train a small transformer on an algorithmic task to high accuracy, and then construct several different formal proofs of lower bounds on the network's accuracy. Without foraying into the details, the most interesting takeaway from ARC's perspective is the following picture:
In the top right of the plot is the brute-force proof, which simply checks every possible input to the network. This gives the tightest possible bound, but is very long. Meanwhile, in the bottom left is the trivial proof, which simply states that the network is at least 0% accurate. This is very short, but gives the loosest possible bound.
In between these two extremes, along the orange Pareto frontier, there are proofs that exploit more structure in the network, leading to tighter bounds for a given proof length, or put another way, shorter proofs for a given bound tightness.
It is exciting to see a clear demonstration that shorter proofs better explain why the neural network has high accuracy, paralleling a common mathematical intuition that shorter proofs offer more insight. One might therefore hope that if we understood the internals of a neural network well enough, then we would be able to provide provable guarantees for very complex behaviors, even when brute-force approaches are infeasible.
However, we think that such a hope is not realistic for large neural networks, because the notion of proof is too strict.
The basic problem with provable guarantees is that they must account for every possible way in which different parts of the network interact with one another, even when those interactions are incidental to the network's behavior. These interactions manifest as error terms, which the proof must provide a worst-case bound for, leading to a looser bound overall. The above picture provides a good demonstration of this: moving towards the left of the plot, the best bound gets looser and looser.
Mo...