Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Understanding and controlling a maze-solving policy network, published by Alex Turner on March 11, 2023 on The AI Alignment Forum.
TL;DR: We algebraically modified the net's runtime goals without finetuning. We also found (what we think is) a "motivational API" deep in the network. We used the API to retarget the agent.
Summary of a few of the most interesting results:
Langosco et al. trained a range of maze-solving nets. We decided to analyze one which we thought would be interesting. The network we chose has 3.5M parameters and 15 convolutional layers.
This network can be attracted to a target location nearby in the maze—all this by modifying a single activation, out of tens of thousands. This works reliably when the target location is in the upper-right, and not as reliably when the target is elsewhere.
Considering several channels halfway through the network, we hypothesized that their activations mainly depend on the location of the cheese.
We tested this by resampling these activations with those from another random maze (as in causal scrubbing). We found that as long as the second maze had its cheese located at the same coordinates, the network’s behavior was roughly unchanged. However, if the second maze had cheese at different coordinates, the agent's behavior was significantly affected.
This suggests that these channels are inputs to goal-oriented circuits, and these channels affect those circuits basically by passing messages about where the cheese is.
This network decides whether to acquire cheese not only as a function of path-distance to cheese, but—after controlling for path-distance—also as a function of Euclidean/"perceptual" distance between the mouse and the cheese, even though the agent sees the whole maze at once.
Another simple idea: We define a "cheese vector" as the difference in activations when the cheese is present in a maze, and when the cheese is not present in the same maze. For each maze, we generate a single cheese vector and subtract that vector from all forward passes in that maze. The agent now ignores cheese most of the time, instead heading towards the top-right region (the historical location of cheese). Furthermore, a given maze's cheese vector transfers across mazes to other mazes with cheese in the same location.
We propose the algebraic value-editing conjecture (AVEC): It's possible to deeply modify a range of alignment-relevant model properties, without retraining the model, via techniques as simple as "run forward passes on prompts which e.g. prompt the model to offer nice- and not-nice completions, and then take a 'niceness vector' to be the diff between their activations, and then add the niceness vector to future forward passes."
Introducing the training process and visualizations
In this post, we'll mostly discuss what we found, not what our findings mean.
Let's run through some facts about Langosco et al.'s training process. Mazes had varying effective sizes, ranging from 3×3 to 25×25:
Each 64×64 RGB observation is processed by a deeply convolutional (15 conv layers!) network, without memory (i.e. no recurrent state):
Why does the agent go to the cheese sometimes, and the top-right corner other times?
It's not that the agent wasn't trained for long enough.
Sampling rollouts from the trained policy adds a lot of noise. It's also hard to remember what the agent did in what part of the maze. To better understand this mouse, we'll take a bird's-eye view.
A nicer way to view episodes is with a vector field view, which overlays a vector field representing the agent policy for a given maze.
We consider two kinds of vector fields:
While the net probability vector field leaves open two degrees of freedom per net probability vector, in practice it seems fine for eyeballing mouse behavior.
Behavioral analysis
When in doubt, get m...