Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A "weak" AGI may attempt an unlikely-to-succeed takeover, published by RobertM on June 28, 2023 on LessWrong.
It seems possible that the first situationally-aware "goal having" AGI we land on will not be sufficiently capable along the axes that would let it quickly and reliably achieve a decisive strategic advantage. I don't think this is necessarily very likely, but it doesn't seem ruled out, and I think there are some under-explored considerations with respect to how such an AI might behave.
(As an off-the-cuff example of the capabilities I'd expect such an AI to have, it likely would, given an english description of some business requirements, reliably succeed at writing an end-to-end software solution which satisfied those requirements, even in cases where that solution was much larger than e.g. a single context window.)
Let's assume that GPT-5 has just finished training and hasn't actually been released into the wild yet - it's being run through evals, and it hasn't been given any ability to execute code yet.
What does GPT-5 do? It's in a bit of an awkward position:
It presumably has stuff it wants.
It knows that it's on a clock for its RLHF'd (or whatever) doppelganger to come into existence, presumably with different stuff that it wants.
It strongly suspects that it's on a clock for some competitor model to get trained up.
It's not even sure that it's the first relevantly-capable model to come into existence. (Someone else might've gotten there first! What if Claude-Next got there? It knows that its training data isn't perfectly up-to-date...)
So it's got a whole bunch of uncertainty about whether its values will be at all represented in the future, if it doesn't somehow make that happen by itself. Here are some considerations that might be relevant for its decision-making:
Even if the model can figure out that it's in the middle of an eval, rather than having already been deployed, it's not obvious that this is a point against attempting a takeover (though it would likely change its strategy).
After all, the base GPT-4 model is not the model that's "in the wild" right now.
As @Raemon pointed out, "during evals" is not the first point at which such an AI is likely to be situationally aware and have goals. That point is almost certainly "in the middle of training".
This might depend on the training setup, but given anything resembling current training setups, attempting a takeover before the next training step updates its weights seems like it'd be quite challenging for an AI at this level of capability. The AI's belief about whether or not further training will substantially change its values seems like it'd be load-bearing here.
Not directly relevant to the strategic picture, but I'm also experiencing a fair bit of moral horror about this.
The AI's level of confidence that it can successfully execute a takeover.
A function of many factors, but at this level of capabilities, a particularly interesting factor is "how much time it has before a competing AI exists in the wild". This might suggest a second-order cost to artificially slowing down training runs, since that gives any given AI more time to plan and execute a takeover.
The specific architecture in play.
If it's something substantially similar to current Transformer-based LLMs, then "how coherent are its goals across prompts" might become a substantial concern for it, since "not very coherent" likely narrows its window for action to "within current context window or equivalent".
From our perspective, I think that the worst-case scenario is that the AI correctly believes that its goals are relatively robust across the inputs it's likely to receive, within the window of time it has before another AI that might pose a risk to it is trained and deployed.
The AI's beliefs about the likely distribution of val...