Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [ASoT] Simulators show us behavioural properties by default, published by Arun Jose on January 13, 2023 on The AI Alignment Forum.
I’m trying out making a few posts with less polish and smaller scope, to iterate more quickly on my thoughts and write about some interesting ideas in isolation before having fully figured them out. Expect middling confidence in any conclusions drawn, and occasionally just chains of reasoning without fully contextualized conclusions.
Let’s say we prompt a language model to do chain-of-thought reasoning while answering some question. What are we actually seeing? One answer I’ve seen people use, often implicitly, is along the lines that we’re seeing the underlying reasoning being done by the model to get at the answer.
I think this take is wrong, and in this post I’ll explain the frame I generally use for this.
Languages describe processes
How do we use text? Well, broadly we use it to describe our own thoughts, or to describe something that’s happening externally in the real world. In both of these cases, language is used as logs of real, mechanistic processes.
To be clear, I think these are very good and comprehensive logs of many relevant processes. Contrary to some contexts in which I’ve heard this description used, I think that large language models trained on these logs learn the processes underlying them pretty well, and end up understanding and being able to use the same (or pretty similar) mechanics at the limit.
The key point I want to make with the idea that text primarily functions as reports on processes is that this is what we’re observing with a language model’s outputs. One way to think about it is that our prompt shines a lens onto some specific part of a simulation, and the language model just keeps describing everything under the lens.
In other words, when we prompt simulators, what we should expect to see are the behavioural characteristics of the simulation, not the mechanistic characteristics of the underlying processes.
Focusing the lens onto the mechanics
I use the terms “behavioural and mechanistic characteristics” in a broader sense than is generally used in the context of something in our world - I expect we can still try to focus our lens on the mechanics of some simulacrum (they’re plausibly also part of the simulation, after all), I just expect it to be a lot harder because you have to make sure the lens actually is focused on the underlying process. One way to view this is that you’re still getting reports on the simulation from an omniscient observer, you just have to craft your prompt such that that the observer knows to describe the pretty small target of the actual mechanistic process.
So just saying “explain your reasoning step-by-step” doesn’t work - for example, consider the experiments showing that incorrect chains of thought in a few-shot setting result in better performance than some human-engineered chain-of-thought prompts. You aren’t seeing the internal computation of some process that thinks in wildly wrong ways and gets to the correct answer anyway, you’re just giving information to the underlying mechanics of the model on what process it should be simulating, and you never really see that process’ computations. What you do end up seeing is something like the incoherent ramblings of a simulacrum that thinks in the right way and doesn’t express these thoughts or is really bad at expressing these thoughts.
In other words, your prompt wasn’t well-specified enough to report on true underlying mechanics. I expect getting this right to be hard for a number of reasons, some isomorphic to the problems facing ELK, but a pretty underrated one in this context in my opinion is that in some cases (probably not the ones where we compute basic math) the actual computations being done by the target simulacrum m...