Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Want to predict/explain/control the output of GPT-4? Then learn about the world, not about transformers., published by Cleo Nardo on March 16, 2023 on LessWrong.
Introduction
Consider Act II Scene II of William Shakespeare's Julius Caesar.
In this scene, Caesar is at home with his wife Calphurnia, who has just had a bad dream and is pleading with him not to go to the Senate. Caesar initially agrees to stay home but changes his mind after being convinced by Decius Brutus that the dream was misinterpreted and that the Senate needs him to address important matters.
CAESAR: The cause is in my will: I will not come; That is enough to satisfy the senate. [...]
DECIUS BRUTUS: [...] If Caesar hide himself, shall they not whisper 'Lo, Caesar is afraid'? Pardon me, Caesar; for my dear dear love To our proceeding bids me tell you this; And reason to my love is liable.
CAESAR: How foolish do your fears seem now, Calphurnia! I am ashamed I did yield to them. Give me my robe, for I will go.
This was the morning of the Ides of March, 15 March 44 BC, which is the date today coincidentally. Caesar was assassinated during the Senate meeting.
Suppose I change Caesar's final line to
CAESAR: My mind is firm, Decius. I'll stay within these walls, And not tempt Fortune on this cursed day. Worry me not, for I will stay.
and I feed this modified scene into GPT-4, what would be the output?
I don't know.
But how might I determine the answer?
The claim
You might think that if you want to predict the logits layer of a large autoregressive transformer, then the best thing would be to learn about transformers. Maybe you should read Neel Nanda's blogposts on mechanistic interpretability. Or maybe you should read the Arxiv papers on the GPT models.
But this probably won't help you predict the logits layer for this prompt.
Instead, if your goal is to predict the logits layer, then you should probably learn about Shakespearean dramas, Early Modern English, and the politics of the Late Roman Republic.
And maybe someone has already run GPT-4 on this prompt — if your goal is to explain the logits layer, then you should probably learn about Shakespearean dramas, Early Modern English, and the politics of the Late Roman Republic.
This is also true if you're trying to construct a prompt which will make GPT-4 output a particular target continuation — if your goal is to control the logits layer, then you should probably learn about Shakespearean dramas, Early Modern English, and the politics of the Late Roman Republic.
Dataset vs architecture
The output of a neural network is determined by two things:
The architecture and training algorithm (e.g. transformers, SGD, cross-entropy)
The training dataset (e.g. internet corpus, literature, GitHub code)
As a rough rule-of-thumb, if you want to predict/explain the output of GPT-4, then it's far more useful to know about the training dataset than to know about the architecture and training algorithm.
In other words,
If you want to predict and explain the output of GPT-4 on Haskell code, you need to know Haskell.
If you want to predict and explain the output of GPT-4 on Shakespearean dialogue, you need to know Shakespeare.
If you want to predict and explain the output of GPT-4 on Esperanto, you need to know Esperanto.
If you want to predict and explain the output of GPT-4 on the MMLU benchmark, you need to know the particular facts in the benchmark.
I think alignment researchers (and AI researchers more generally) underestimate the extent to which knowledge of the training dataset is currently far more useful for prediction/explanation than knowledge of the architecture and training algorithm.
Recall that as the cross-entropy loss of LLM steadily decreases, then the logits of the LLM will asymptotically approach the ground-truth distribution which generated the dataset...