Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some thoughts on automating alignment research, published by Lukas Finnveden on May 26, 2023 on The AI Alignment Forum.
As AI systems get more capable, they may at some point be able to help us with alignment research. This increases the chance that things turn out ok.[1] Right now, we don’t have any particularly scalable or competitive alignment solutions. But the methods we do have might let us use AI to vastly increase the amount of labor spent on the problem before AI has the capability and motivation to take over. In particular, if we’re only trying to make progress on alignment, the outer alignment problem is reduced to (i) recognising progress on sub-problems of alignment (potentially by imitating productive human researchers), and (ii) recognising dangerous actions like e.g. attempts at hacking the servers.[2]
But worlds in which we’re hoping for significant automated progress on alignment are fundamentally scary. For one, we don’t know in what order we’ll get capabilities that help with alignment vs. dangerous capabilities.[3] But even putting that aside, AIs will probably become helpful for alignment research around the same time as AIs become better at capabilities research. Once AI systems can significantly contribute to alignment (say, speed up research by >3x), superintelligence will be years or months away.[4] (Though intentional, collective slow-downs could potentially buy us more time. Increasing the probability that such slow downs happen at key moments seems hugely important.)
In such a situation, we should be very uncertain about how things will go.
To illustrate one end of the spectrum: It’s possible that automated alignment research could save the day even in an extremely tense situation, where multiple actors (whether companies or nations) were racing towards superintelligence. I analyze this in some detail here. To briefly summarize:
If a cautious coalition were to (i) competitively advance capabilities (without compromising safety) for long enough that their AI systems became really productive, and (ii) pause dangerous capabilities research at the right time — then even if they only had a fairly small initial lead, that could be enough to do a lot of alignment research.
How could we get AI systems that significantly accelerates alignment research without themselves posing an unacceptable level of risk? It’s not clear that we will, but one possible story is motivated in this post: It’s easier to align subhuman models than broadly superhuman models, and in the current paradigm, we will probably be able to run hundreds of thousands of subhuman models before we get broadly superhuman models, each of them thinking 10-100X faster than humans. Perhaps they could make rapid progress on alignment.
In a bit more detail:
Let’s say that a cautious coalition starts out with an X-month lead, meaning that it will take X months for other coalitions to catch up to their current level of technology. The cautious coalition can maintain that X-month lead for as long as they don't leak any technology,[5] and for as long as they move as fast as their competitors.[6]
In reality, the cautious coalition should gradually become more careful, which might gradually reduce their lead (if their competitors are less cautious). But as a simple model, let’s say that the cautious coalition maintains their X-month lead until further advancement would pose a significant takeover risk, at which point they entirely stop advancing capabilities and redirect all their effort towards alignment research.
Simplifying even further, let’s say that, when they pause, their current AI capabilities are such that 100 tokens of an end-to-end trained language model on average lead to equally much alignment progress as 1 human researcher second (and that it does so safely).
According to this oth...