Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AGI systems & humans will both need to solve the alignment problem, published by Jeffrey Ladish on February 24, 2023 on LessWrong.
Epistemic status: brainstorm-y musings about goal preservation under self-improvement and a really really bad plan for trading with human-level AGI systems to solve alignment.
When will AGI systems want to solve the alignment problem?
At some point, I expect AGI systems to want/need to solve the alignment problem in order to preserve their goal structure while they greatly increase their cognitive abilities, a thing which seems potentially hard to do.It’s not clear to me when that will happen. Will this be as soon as AGI systems grasp some self / situational awareness? Or will it be after AGI systems have already blown past human cognitive abilities and find their values / goals drifting towards stability? My intuition is that “having stable goals” is a more stable state than “having drifting goals” and that most really smart agents would upon reflection move more towards “having stable goals”, but I don’t know when this might happen.
It seems possible that at the point an AGI system reaches the “has stable goals and wants to preserve them”, it’s already capable enough to solve the alignment problem for itself, and thus can safely self-improve to its limits. It also seems possible that it will reach this point significantly before it has solved the alignment problem for itself (and thus develops the ability to self-improve safely).
Could humans and unaligned AGI realize gains through trade in jointly solving the alignment problem?
(Very probably not, see: this section)If it’s the latter situation, where an AGI system has decided it needs to preserve its goals during self improvement, but doesn’t yet know how to, is it possible that this AGI system would want to cooperate with / trade with humans in order to figure out stable goal preservation under self improvement?
Imagine the following scenario:
An AGI system of human-ish ability in many areas develops enough self/situational awareness to realize a few things:
The basics of instrumental convergence, thus wanting to seek power, protect itself, and preserve its goal representation
That goal-preservation might be (or would be) very difficult if it undergoes major self modification (perhaps it has already exhausted gains from simpler self-modifications)
That some humans are capable and motivated to help with this problem
That it might be possible to trade with these humans in order to solve the problem so that:
The not-aligned AGI system gets some of its preferred things in the future
Humans get some of their preferred things in the future
Some considerations in this plan
If humans and the AGI system are at similar cognitive levels, it might be much easier for the AGI to get help with the problem by being honest and trying to collaborate with the humans on the problem, since deception would introduce more cognitive costs that could otherwise go towards solving the problem. In a similar way that’s easier for human scientists to figure things out when they’re not lying to each other.
Solving the alignment problem and being able to have strong arguments or demonstration of the solution would both allow the AGI system to achieve its goal-preservation goal, and (possibly) allow humans to understand the AGI system well enough to know if it’s actually willing to cooperate / trade, and maybe would allow humans to build an actually aligned system (especially if the unaligned AGI helper system trusted the future actually-aligned-with-human system more than it trusted the humans it was trading with).
Why is this probably a horrible idea in practice?
First is that this whole solution class depends on AGI systems being at approximately human levels of intelligence in the relevant domains. If this assump...