Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Ten Levels of AI Alignment Difficulty, published by Samuel Dylan Martin on July 3, 2023 on The AI Alignment Forum.
Chris Olah recently released a tweet thread describing how the Anthropic team thinks about AI alignment difficulty. On this view, there is a range of possible scenarios ranging from ‘alignment is very easy’ to ‘alignment is impossible’, and we can frame AI alignment research as a process of increasing the probability of beneficial outcomes by progressively addressing these scenarios. I think this framing is really useful, and here I have expanded on it by providing a more detailed scale of AI alignment difficulty and explaining some considerations that arise from it.
The discourse around AI safety is dominated by detailed conceptions of potential AI systems and their failure modes, along with ways to ensure their safety. This article by the DeepMind safety team provides an overview of some of these failure modes. I believe that we can understand these various threat models through the lens of "alignment difficulty" - with varying sources of AI misalignment sorted from easy to address to very hard to address, and attempt to match up technical AI safety interventions with specific alignment failure mode scenarios. Making this uncertainty clearer makes some debates between alignment researchers easier to understand.
An easier scenario could involve AI models generalising and learning goals in ways that fit with common sense. For example, it could be the case that LLMs of any level of complexity are best understood as generative frameworks over potential writers, with Reinforcement Learning from Human Feedback (RLHF) or Constitutional AI (CAI) selecting only among potential writers. This is sometimes called ‘alignment by default’.
A hard scenario could look like that outlined in ‘Deep Deceptiveness’, where systems rapidly and unpredictably generalise in ways that quickly obsolete previous alignment techniques, and they also learn deceptive reward-hacking strategies that look superficially identical to good behaviour according to external evaluations, red-teaming, adversarial testing or interpretability examinations.
When addressing the spectrum of alignment difficulty, we should examine each segment separately.
If we assume that transformative AI will be produced, then the misuse risk associated with aligned transformative AI does not depend on how difficult alignment is. Therefore, misuse risk is a relatively bigger problem the easier AI alignment is.
Easy scenarios should therefore mean more resources should be allocated to issues like structural risk, economic implications, misuse, and geopolitical problems. On the ‘harder’ end of easy, where RLHF-trained systems typically end up honestly and accurately pursuing oversimplified proxies for what we want, like ‘improve reported life satisfaction’, or ‘raise the stock price of company X’, we also have to worry about scenarios like Production Web or What Failure looks like 1 which require a mix of technical and governance interventions to address.
Intermediate scenarios are cases where behavioural safety isn’t good enough and the easiest ways to produce Transformative AI result in dangerous deceptive misalignment. This is when systems work against our interests but pretend to be useful and safe. This scenario requires us to push harder on alignment work and explore promising strategies like scalable oversight, AI assistance on alignment research and interpretability-based oversight processes. We should also focus on governance interventions to ensure the leading projects have the time they need to actually implement these solutions and then use them (in conjunction with governments and civil society) to change the overall strategic landscape and eliminate the risk of misaligned AI.
In contrast, if alignment is...