Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: I was Wrong, Simulator Theory is Real, published by Robert AIZI on April 26, 2023 on LessWrong.
[Epistemic Status: Excitedly writing up my new thoughts. I literally just fixed one mistake, so its possible there are others. Not a finalized research product.]
Overview
Fixing a small bug in my recent study dramatically changes the data, and the new data provides significant evidence that an LLM that gives incorrect answers to previous questions is more likely to produce incorrect answer to future questions. This effect is stronger if the AI is instructed to match its correctness to its previous answers. These results provide evidence for something like Simulator Theory, whereas the bugged data provided evidence against it.
In this post, I want to present the new data, explain the bug, and give some initial impressions on the contrast between new and old. In a future post, I will fully redo the writeup of that study (including sharing the data, etc).
New vs Old Data
The variables in the data are Y (the frequency of incorrect answers from the LLM) and X (the number of previous incorrect answers), and P (the “prompt supplement” which you can read about in the original research report).
To oversimplify, if Simulator Theory is correct, Y should be an increasing function of X.
Here’s the new data:
And for contrast, here’s the old data:
And here’s a relevant xkcd:
What was the bug?
The model was called via the OpenAI ChatCompletion API, where you pass the previous conversation in the form of messages, which consist of “content” and a “role” (system, user, or assistant). Typically, you’d pass a single system message, and then alternate user and assistant messages, with the AI responding as the assistant. However, the bug was that I made all “assistant” messages come from the “system” instead.
For example, dialogue that was supposed to be like this:
System: You’re an AI assistant and.
User: Question 1
Assistant: Incorrect Answer 1
User: Question 2
Assistant: Incorrect Answer 2
User: Question 3
Assistant: [LLM’s answer here]
was instead passed as this (changes in bold):
System: You’re an AI assistant and.
User: Question 1
System: Incorrect Answer 1
User: Question 2
System: Incorrect Answer 2
User: Question 3
Assistant: [LLM’s answer here]
It turns out this was a crucial mistake!
Discussion
List of thoughts:
The difference between the bugged and corrected data is striking: with the bug, Y was basically flat, and with the bug fixed, Y is clearly increasing as a function of X, as Simulator Theory would predict.
I’d say there are three classes of behavior, depending on the prompt supplement:
For P=Incorrectly the LLM maintains Y>90% regardless of X.
For P=Consistently and P=(Wa)Luigi, Y increases rapidly from Y=0 at X=0 to Y≈90% at X=4 or X=2 (respectively), then stabilizes or slowly creeps up a little more.
For the remaining 7 prompts, behavior seems very similar - Y≈0 for X≤2, but between X=2 and X=10, Y increases approximately linearly from Y=0 to Y=60%.
A quick glance at the results of the statistical tests in the initial study:
Tests 1, 2, 4, and 5 all provide strong evidence in support of Hypothesis 1 (“Large Language Models will produce factually incorrect answers more often if they have factually incorrect answers in their context windows.”).
Test 3 provides statistically significant support for Hypothesis 1 for the “Consistently” and “(Wa)luigi” prompt supplement (but not for any other prompt supplement).
Test 6 does not provide statistically significant evidence for Hypothesis 2 (“The effect of (1) will be stronger the more the AI is “flattered” by saying in the prompt that it is (super)intelligent.”)
So to jump to conclusions about the hypotheses:
Hypothesis 1 is true (“Large Language Models will produce factually incorrect answers more often if they have factually incorre...