Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Goal-Direction for Simulated Agents, published by Raymond D on July 12, 2023 on The AI Alignment Forum.
tldr: consistent LLM failure suggests possible avenue for alignment and control
epistemic status: somewhat hastily written, speculative in places, but lots of graphs of actual model probabilities
Today in 'surprisingly simple tasks that even the most powerful large language models can't do': writing out the alphabet but skipping over one of the letters.
This is GPT-4, and it seems to manage about half the time with the ninth letter. You can do the same thing with numbers - the general principle is 'write out a very common pattern, with a slight deviation.
We can probe the phenomenon more carefully in text-davinci-003, which lets us easily track the probabilities:
So for the sixth letter, it's almost getting it - it assigns 21.74% to the correct answer. If we plot the graph for how probable the correct answer (skip) is depending on which letter we ask it to omit, it looks like this:
Some Background: How To Misquery Predictors
Let's imagine you had a truly perfect oracle-like predictor, trained on 1000 agents that have been given the task of turning $1,000 into $10,000.
5 of them make sound financial choices, carefully investing their money and exploiting a subtle but consistent error in the market to all make $10,000
The other 995 blow all their money on lottery tickets, causing 950 to go broke and 45 to luck into $10,000
If you ask your perfect predictor what action maximises the chance of making $10,000, it will indeed reply with the careful investment strategy. But if you simply ask it what you'd expect to see from an agent that makes $10,000, it will say 'lots of lottery tickets', because 90% of the agents that made $10,000 did so with lottery tickets.
Janus gives the following definition of simulators:
I use the generic term "simulator" to refer to models trained with predictive loss on a self-supervised dataset, invariant to architecture or data type (natural language, code, pixels, game states, etc). The outer objective of self-supervised learning is Bayes-optimal conditional inference over the prior of the training distribution, which I call the simulation objective, because a conditional model can be used to simulate rollouts which probabilistically obey its learned distribution by iteratively sampling from its posterior (predictions) and updating the condition (prompt)
I claim that when you simulate an agent conditioned on some goal or target, it falls into exactly the trap above, because of the way it is queried. Moreover, I claim that it should be possible to elicit the model's latent true understanding of how to write the alphabet while skipping a letter, not by adding anything extra to the prompt or any interpretability but through a consideration of its causal structure. Instead of asking "supposing someone makes money, what did they do?" we must ask "what might someone do to most increase their chance of making money?". The challenge is how to do the equivalent of that for a language model writing the alphabet.
Predictive Completion vs Goal-Directed Completion
If you are the kind of person who likes to pause in the middle of essays to work things out for yourself, now would be the moment: how do you correct the distribution of 'what actions are agents likely to take given that they get this reward' to get the distribution of 'what actions make an agent likely to get this reward', in general, without needing to create a separate query for each action on which to condition?
Formally, we're querying P(a|r), the probability of action given reward (I'll call this the 'predictive completion'), when we should be querying P(r|a) (I'll call this the 'goal-directed completion'). Per Bayes' Theorem, the latter distribution is the former multiplied by P(r)P(a), ...