The Nonlinear Library

AF - Trying to isolate objectives: approaches toward high-level interpretability by Arun Jose


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Trying to isolate objectives: approaches toward high-level interpretability, published by Arun Jose on January 9, 2023 on The AI Alignment Forum.
Epistemic status: There’s a lot in this post and my general approach while working on it that, in retrospect, wasn’t thought out well enough. I’m posting it because I figure sharing flawed ideas is better than letting this languish in a doc forever, while I work on improving them.
Thanks to Paul Colognese, Tassilo Neubauer, John Wentworth, and Fabien Roger for useful conversations, Kyle McDonell and Laria Reynolds for suggesting something that led me to some of the ideas I mention here, and Shoshannah Tekofsky for feedback on a draft.
I spent some time trying to find the high-level structure in a neural net corresponding to a deep reinforcement learning model’s objective. In this post, I describe some of the stuff I tried, thoughts from working on this, and mistakes I made.
Context
Many approaches in current interpretability work as I understand them involve understanding low-level components of a model (Circuits and subsequent work, for example), as a way of building up to more complex components. I think we can make progress from the opposite frontier at the same time, and try to identify the presence or nature of certain high-level mechanistic structures or properties in a model. Examples of this include verifying whether a model is doing optimization, whether it’s myopic, isolating the objective of an optimizer, and essentially the entire class of properties with a singular answer for the entire model.
I prefer to distinguish these directions as low-level and high-level interpretability respectively for descriptive clarity. The framing of best-case and worst-case transparency as laid out in the transparency tech tree also points at the same concept. I expect that both directions are aiming at the same goal, but working top-down seems pretty tractable at least for some high-level targets that don’t require lots of deconfusion to even know what we’re looking for.
Ideally, an approach to high-level interpretability would be robust to optimization pressure and deception (in other words, to gradient descent and adversarial mesa optimizers), but I think there’s a lot we can learn toward that ideal from more naive approaches. In this post, I describe my thoughts on trying to identify the high-level structure in a network corresponding to a deep reinforcement learning model’s objective.
So I didn’t set out on this particular approach expecting to succeed at the overall goal (I don’t expect interpretability to be that easy), instead hoping that trying out a bunch of naive directions will give us insight into future high-level interpretability work. To that end, I try to lay out my reasoning at various points, riddled with gaps at points, while thinking about this.
I think there are plenty of plausible avenues to try out here, and for the most part will only be describing the problems with ones I thought of. Further, the directions I tried don’t seem very non-obvious and there are likely much better methods to be tried given more thought here.
A naive approach and patching
Take the case of a small RL model that can optimize for a reward in a given environment. What we want is information about how the model’s objective is represented in the network’s weights. Concretely, we might think about what kind of tests or processes disproportionately affect these weights more than any other, and try to isolate the information we want from there.
One approach we might try is:
Train the initialized model on reward RA for an extended period of time. Let’s call the model at the end of this step MA. We may think of MA as having learned a pretty sophisticated world model at this point, such that further training wouldn’t update it strongly.
Tra...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings