Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Near-mode thinking on AI, published by Olli Järviniemi on August 5, 2024 on LessWrong.
There is a stark difference between rehearsing classical AI risk 101 arguments about instrumental convergence, and tackling problems like "Design and implement the exact evaluations we'll run on GPT-5 to determine whether it's capable enough that we should worry about it acting aligned until it can execute a takeover".
And naturally, since I've started working on problems like the one above, I've noticed a large shift in my thinking on AI. I describe it as thinking about risks in near-mode, as opposed to far-mode.
In this post, I share a few concrete examples about my experiences with this change-of-orientation.
I. Prerequisites for scheming
Continuing with the example from the intro: A year ago I was confident about the "the AI is just playing along with our training and evaluations, until it is in a position where it can take over" threat model (deceptive alignment / scheming) basically being the default outcome and the main source of AI x-risk. I now think I was overconfident.
Past-me hadn't, however, really thought through the prerequisites for scheming. A textbook example of a successful deceptive alignment story, applied to an LLM, paints a picture of a model that:
Understands that the inputs it sees are simply prompts written by humans
As opposed to the model being "immersed to the setting", thinking that the current context more-or-less reflects reality, or flat-out is reality, taking the prompt at face value
Understands that it's undergoing a training process, and having an explicit (correct!) model of the training objective used to train it
As opposed the model not thinking about the training process and objective at all, and a model's changed behavior upon training being reflective of the training process molding the model, rather than the model learning the training objective and optimizing for it
Performs lots of the same cognition (e.g. the above two bullet points) regardless of the input fed to it
As opposed to the model's internal processes being heavily dependent on the input
Never performs instrumental thinking "out loud", and successfully evades any attempts made by humans to trick the model into thinking it has been deployed
As opposed to the model sometimes slipping, failing to generalize correctly to adversarial inputs, or just being insufficiently good at distinguishing training/evaluation/red-teaming from deployment.
Now, one may argue whether it's strictly necessary that a model has an explicit picture of the training objective, for example, and revise one's picture of the deceptive alignment story accordingly. We haven't yet achieved consensus on deceptive alignment, or so I've heard.
It's also the case that, as past-me would remind you, a true superintelligence would have no difficulty with the cognitive feats listed above (and that current models show sparks of competence in some of these).
But knowing only that superintelligences are really intelligent doesn't help with designing the scheming-focused capability evaluations we should do on GPT-5, and abstracting over the specific prerequisite skills makes it harder to track when we should expect scheming to be a problem (relative to other capabilities of models).[1] And this is the viewpoint I was previously missing.
II. A failed prediction
There's a famous prediction market about whether AI will get gold from the International Mathematical Olympiad by 2025. For a long time, the market was around 25%, and I thought it was too high.
Then, DeepMind essentially got silver from the 2024 IMO, short of gold by one point. The market jumped to 70%, where it has stayed since.
Regardless of whether DeepMind manages to improve on that next year and satisfy all minor technical requirements, I was wrong. Hearing abou...