The Nonlinear Library

LW - Jesse Hoogland on Developmental Interpretability and Singular Learning Theory by Michaël Trazzi


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Jesse Hoogland on Developmental Interpretability and Singular Learning Theory, published by Michaël Trazzi on July 6, 2023 on LessWrong.
Jesse Hoogland is a research assistant at David Krueger's lab in Cambridge studying AI Safety who has recently been publishing on LessWrong about how to apply Singular Learning Theory to Alignment, and even organized some workshop in Berkeley last week around this.
I thought it made sense to interview him to have some high-level overview of Singular Learning Theory (and other more general approaches like developmental interpretability).
Below are some highlighted quotes from our conversation (available on Youtube, Spotify, Google Podcast, Apple Podcast). For the full context for each of these quotes, you can find the accompanying transcript.
Interpreting Neural Networks: The Phase Transition View
Studying Phase Transitions Could Help Detect Deception
"We want to be able to know when these dangerous capabilities are first acquired because it might be too late. They might become sort of stuck and crystallized and hard to get rid of. And so we want to understand how dangerous capabilities, how misaligned values develop over the course of training. Phase transitions seem particularly relevant for that because they represent kind of the most important structural changes, the qualitative changes in the shape of these models internals.
Now, beyond that, another reason we’re interested in phase transitions is that phase transitions in physics are understood to be a kind of point of contact between the microscopic world and the macroscopic world. So it’s a point where you have more control over the behavior of a system than you normally do. That seems relevant to us from a safety engineering perspective. Why do you have more control in a physical system during phase transitions?" (context)
A Concrete Example of Phase Transition In Physics and an analogous example inside of neural networks
"Jesse: If you heat a magnet to a high enough temperature, then it’s no longer a magnet. It no longer has an overall magnetization. And so if you bring another magnet to it, they won’t stick. But if you cool it down, at some point it reaches this Curie temperature. If you push it lower, then it will become magnetized. So the entire thing will all of a sudden get a direction. It’ll have a north pole and a south pole. So the thing is though, like, which direction will that north pole or south pole be? And so it turns out that you only need an infinitesimally small perturbation to that system in order to point it in a certain direction. And so that’s the kind of sensitivity you see, where the microscopic structure becomes very sensitive to tiny external perturbations.
Michaël: And so if we bring this back to neural networks, if the weights are slightly different, the overall model could be deceptive or not. Is it something similar?
Jesse: This is speculative. There are more concrete examples. So there are these toy models of superposition studied by Anthropic. And that’s a case where you can see that it’s learning some embedding and unembeddings. So it’s trying to compress data. You can see that the way it compresses data involves this kind of symmetry breaking, this sensitivity, where it selects one solution at a phase transition. So that’s a very concrete example of this." (context)
Developmental Interpretability
"Suppose it’s possible to understand what’s going on inside of neural networks, largely understand them. First assumption. Well then, it’s still going to be very difficult to do that at one specific moment in time. I think intractable. The only way you’re actually going to build up an exhaustive idea of what structure the model has internally, is to look at how it forms over the course of training. You want to look at each moment, where you learn sp...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings