January 22, 2023

LW - Large language models learn to represent the world by gjm

4 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Large language models learn to represent the world, published by gjm on January 22, 2023 on LessWrong.

There's a nice recent paper whose authors did the following:

train a small GPT model on lists of moves from Othello games;

verify that it seems to have learned (in some sense) to play Othello, at least to the extent of almost always making legal moves;

use "probes" (regressors whose inputs are internal activations in the network, trained to output things you want to know whether the network "knows") to see that the board state is represented inside the network activations;

use interventions to verify that this board state is being used to decide moves: take a position in which certain moves are legal, use gradient descent to find changes in internal activations that make the output of the probes look like a slightly different position, and then verify that when you run the network but tweak the activations as it runs the network predicts moves that are legal in the modified position.

In other words, it seems that their token-predicting model has built itself what amounts to an internal model of the Othello board's state, which it is using to decide what moves to predict.

The paper is "Emergent world representations: Exploring a sequence model trained on a synthetic task" by Kenneth Li, Aspen Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg; you can find it at.

There is a nice expository blog post by Kenneth Li at/.

Some details that seem possibly-relevant:

Their network has a 60-word input vocabulary (four of the 64 squares are filled when the game starts and can never be played in), 8 layers, an 8-head attention mechanism, and a 512-dimensional hidden space. (I don't know enough about transformers to know whether this in fact tells you everything important about the structure.)

They tried training on two datasets, one of real high-level Othello games (about 140k games) and one of synthetic games where all moves are random (about 4M games). Their model trained on synthetic games predicted legal moves 99.99% of the time, but the one trained on real well-played games only predicted legal moves about 95% of the time. (This suggests that their network isn't really big enough to capture legality and good strategy at the same time, I guess?)

They got some evidence that their network isn't just memorizing game transcripts by training it on a 20M-game synthetic dataset where one of the four possible initial moves is never played. It still predicted legal moves 99.98% of the time when tested on the full range of legal positions. (I don't know what fraction of legal positions are reachable with the first move not having been C4; it will be more than 3/4 since there are transpositions. I doubt it's close to 99.98%, though, so it seems like the model is doing pretty well at finding legal moves in positions it hasn't seen.)

Using probes whose output is a linear function of the network activations doesn't do a good job of reconstructing the board state (error rate is ~25%, barely better than attempting the same thing from a randomly initialized network), but training 2-layer MLPs to do it gets the error rate down to ~5% for the network trained on synthetic games and ~12% for the one trained on championship games, whereas it doesn't help at all for the randomly trained network. (This suggests that whatever "world representation" the thing has learned isn't simply a matter of having an "E3 neuron" or whatever.)

I am not at all an expert on neural network interpretability, and I don't know to what extent their findings really justify calling what they've found a "world model" and saying that it's used to make move predictions. In particular, I can't refute the following argument:

"In most positions, just knowing what moves are legal is enough to give you...

...more

View all episodes

By The Nonlinear Fund

4.6

88 ratings

January 22, 2023

LW - Large language models learn to represent the world by gjm

4 minutes

There's a nice recent paper whose authors did the following:

train a small GPT model on lists of moves from Othello games;

verify that it seems to have learned (in some sense) to play Othello, at least to the extent of almost always making legal moves;

In other words, it seems that their token-predicting model has built itself what amounts to an internal model of the Othello board's state, which it is using to decide what moves to predict.

There is a nice expository blog post by Kenneth Li at/.

Some details that seem possibly-relevant:

"In most positions, just knowing what moves are legal is enough to give you...

...more

Share LW - Large language models learn to represent the world by gjm

Sign up to save your podcasts

LW - Large language models learn to represent the world by gjm

LW - Large language models learn to represent the world by gjm