The Nonlinear Library

LW - Towards Developmental Interpretability by Jesse Hoogland


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Towards Developmental Interpretability, published by Jesse Hoogland on July 12, 2023 on LessWrong.
Developmental interpretability is a research agenda that has grown out of a meeting of the Singular Learning Theory (SLT) and AI alignment communities. To mark the completion of the first SLT & AI alignment summit we have prepared this document as an outline of the key ideas.
As the name suggests, developmental interpretability (or "devinterp") is inspired by recent progress in the field of mechanistic interpretability, specifically work on phase transitions in neural networks and their relation to internal structure. Our two main motivating examples are the work by Olsson et al. on In-context Learning and Induction Heads and the work by Elhage et al. on Toy Models of Superposition.
Mechanistic interpretability emphasizes features and circuits as the fundamental units of analysis and usually aims at understanding a fully trained neural network. In contrast, developmental interpretability:
is organized around phases and phase transitions as defined mathematically in SLT, and
aims at an incremental understanding of the development of internal structure in neural networks, one phase transition at a time.
The hope is that an understanding of phase transitions, integrated over the course of training, will provide a new way of looking at the computational and logical structure of the final trained network. We term this developmental interpretability because of the parallel with developmental biology, which aims to understand the final state of a different class of complex self-assembling systems (living organisms) by analyzing the key steps in development from an embryonic state.
In the rest of this post, we explain why we focus on phase transitions, the relevance of SLT, and how we see developmental interpretability contributing to AI alignment.
Thank you to @DanielFilan, @bilalchughtai, @Liam Carroll for reviewing early drafts of this document.
Why phase transitions?
First of all, they exist: there is a growing understanding that there are many kinds of phase transitions in deep learning. For developmental interpretability, the most important kind of phase transitions are those that occur during training. Some of the examples we are most excited about:
Olsson, et al., "In-context Learning and Induction Heads", Transformer Circuits Thread, 2022.
Elhage, et al., "Toy Models of Superposition", Transformer Circuits Thread, 2022.
McGrath, et al., "Acquisition of chess knowledge in AlphaZero", PNAS, 2022.
Michaud, et al., "The Quantization Model of Neural Scaling", 2023.
Simon, et al., "On the Stepwise Nature of Self-Supervised Learning" ICML 2023.
The literature on other kinds of phase transitions, such as those appearing as the scale of the model is increased, is even broader. Neel Nanda has conjectured that "phase changes are everywhere."
Second, they are easy to find: from the point of view of statistical physics, two of the hallmarks of a (second-order) phase transition are the divergence of macroscopically observable quantities and the emergence of large-scale order. Divergences make phase transitions easy to spot, and the emergence of large-scale order (e.g., circuits) is what makes them interesting. There are several natural observables in SLT (the learning coefficient or real log canonical threshold, and singular fluctuation) which can be used to detect phase transitions, but we don't yet know how to invent finer observables of this kind, nor do we understand the mathematical nature of the emergent order.
Third, they are good candidates for universality: every mouse is unique, but its internal organs fit together in the same way and have the same function - that's why biology is even possible as a field of science. Similarly, as an emerging field of science, interpretabi...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings