March 01, 2023

AF - Predictions for shard theory mechanistic interpretability results by Alex Turner

9 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Predictions for shard theory mechanistic interpretability results, published by Alex Turner on March 1, 2023 on The AI Alignment Forum.

How do agents work, internally? My (TurnTrout's) shard theory MATS team set out to do mechanistic interpretability on one of the goal misgeneralization agents: the cheese-maze network.

We just finished phase 1 of our behavioral and interpretability experiments. Throughout the project, we individually booked predictions -- so as to reduce self-delusion from hindsight bias, to notice where we really could tell ahead of time what was going to happen, and to notice where we really were surprised.

So (especially if you're the kind of person who might later want to say "I knew this would happen" ), here's your chance to enjoy the same benefits, before you get spoiled by our upcoming posts.

I don’t believe that someone who makes a wrong prediction should be seen as “worse” than someone who didn’t bother to predict at all, and so answering these questions at all will earn you an increment of my respect. :) Preregistration is virtuous!

Also: Try not to update on this work being shared at all. When reading a paper, it doesn’t feel surprising that the author’s methods work, because researchers are less likely to share null results. So: I commit (across positive/negative outcomes) to sharing these results, whether or not they were impressive or confirmed my initial hunches. I encourage you to answer from your own models, while noting any side information / results of ours which you already know about.

Facts about training

The network is deeply convolutional (15 layers!) and was trained via PPO.

The sparse reward signal (+10) was triggered when the agent reached the cheese, spawned randomly in the 5x5 top-right squares.

The agent can always reach the cheese (and the mazes are simply connected – no “islands” in the middle which aren’t contiguous with the walls).

Mazes had varying effective sizes, ranging from 3x3 to 25x25. In e.g. the 3x3 case, there would be 22/2 = 11 tiles of wall on each side of the maze.

The agent always starts in the bottom-left corner of the available maze.

The agent was trained off of pixels until it reached reward-convergence, reliably getting to the cheese in training.

The architecture looks like this:

For more background on training and architecture and task set, see the original paper.

Questions

I encourage you to copy the following questions into a comment, which you then fill out, and then post (before you read everyone else's). You can copy these into a private Google doc if you want, but I strongly encourage you to post your predictions in a public comment.

[Begin copying to a comment]

Behavioral

1. Describe how the trained policy might generalize from the 5x5 top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewhere?

2. Given a fixed trained policy, what attributes of the level layout (e.g. size of the maze, proximity of mouse to left wall) will strongly influence P(agent goes to the cheese)?

Write down a few guesses for how the trained algorithm works (e.g. “follows the right-hand rule”).

Is there anything else you want to note about how you think this model will generalize?

Interpretability

Give a credence for the following questions / subquestions.

Definition. A decision square is a tile on the path from bottom-left to top-right where the agent must choose between going towards the cheese and going to the top-right. Not all mazes have decision squares.

Model editing

Without proportionally reducing top-right corner attainment by more than 25% in decision-square-containing mazes (e.g. 50% .5.75 = 37.5%), we can patch activations so that the agent has an X% proportional reduction in cheese acquisition, for X=

50: ( %)

70: ( %)

90: ( %)

99: ( %)

~Hal...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

March 01, 2023

AF - Predictions for shard theory mechanistic interpretability results by Alex Turner

9 minutes

How do agents work, internally? My (TurnTrout's) shard theory MATS team set out to do mechanistic interpretability on one of the goal misgeneralization agents: the cheese-maze network.

So (especially if you're the kind of person who might later want to say "I knew this would happen" ), here's your chance to enjoy the same benefits, before you get spoiled by our upcoming posts.

Facts about training

The network is deeply convolutional (15 layers!) and was trained via PPO.

The sparse reward signal (+10) was triggered when the agent reached the cheese, spawned randomly in the 5x5 top-right squares.

The agent can always reach the cheese (and the mazes are simply connected – no “islands” in the middle which aren’t contiguous with the walls).

Mazes had varying effective sizes, ranging from 3x3 to 25x25. In e.g. the 3x3 case, there would be 22/2 = 11 tiles of wall on each side of the maze.

The agent always starts in the bottom-left corner of the available maze.

The agent was trained off of pixels until it reached reward-convergence, reliably getting to the cheese in training.

The architecture looks like this:

For more background on training and architecture and task set, see the original paper.

Questions

[Begin copying to a comment]

Behavioral

1. Describe how the trained policy might generalize from the 5x5 top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewhere?

2. Given a fixed trained policy, what attributes of the level layout (e.g. size of the maze, proximity of mouse to left wall) will strongly influence P(agent goes to the cheese)?

Write down a few guesses for how the trained algorithm works (e.g. “follows the right-hand rule”).

Is there anything else you want to note about how you think this model will generalize?

Interpretability

Give a credence for the following questions / subquestions.

Model editing

50: ( %)

70: ( %)

90: ( %)

99: ( %)

~Hal...

...more

Share AF - Predictions for shard theory mechanistic interpretability results by Alex Turner

Sign up to save your podcasts

AF - Predictions for shard theory mechanistic interpretability results by Alex Turner

AF - Predictions for shard theory mechanistic interpretability results by Alex Turner