Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Continuous Adversarial Quality Assurance: Extending RLHF and Constitutional AI, published by Benaya Koren on July 8, 2023 on The AI Alignment Forum.
Introduction
Lately, the problem of aligning artificial intelligence with human values rapidly changed its status from hypothetical to most concrete, with the rise of more general and more powerful models. The existing methods (Constitutional AI, RLHF and the like) are mostly good enough for common usage with the current models, but are probably not robust enough to scale much beyond human level, or to stand against smart attempts at malicious usage. My goal in this post is not to replace those methods with a complete solution to the AGI Alignment Problem, but to try and make the existing methods more robust - to buy us some more time before those break, and to maybe make our chances slightly better if an ASI suddenly emerges.
Broadly speaking, my approach here is aimed at the outer alignment problem - i.e. making the model train on a signal of human values, as clean and as diverse as possible. The approach is based on explicit modeling of how Human Values are supposed to flow from humanity into the model, and then using this model to improve the flow. To improve the flow I will present two concrete directions.The first - and the one that I will develop in more detail - is about making the flow more robust by putting continuous adversarial pressure on every part of the chain. I will call it Continuous Adversarial Quality Assurance. The second direction for improving the flow of values is more holistic - there, the idea is to use explicit modeling of the relations between different sources of information about human values, in order to develop more principled ways of aggregating them. Both directions may be applied to improve RLHF, Constitutional AI, or any similar method.
I also try to relate my suggestions with other well-known agendas - namely Debate and CIRL.
Disclaimer: Against Benevolent Sovereigns
In the introduction, I intentionally used “Human Values” with capital letters. The reason was to highlight the simplification. I am not a moral realist, and subscribe to some version of shard-theory of human values. I view human values as a messy negotiation between local preferences and long-term goals of individuals and coalitions, popular memes in the moral and aesthetic discourse, laws and institutions, societal norms and coordination mechanisms, etc. I use “Human Values” as a placeholder for something like “what humanity would have agreed to value, if it could cohere into something that has values”. Basically “Coherent Extrapolated Volition”, but without trusting the AI to do the extrapolation. Instead of a fixed target that a system finds and then maximizes, it is a moving target, always partially and locally defined, that should not be optimized faster than it may be actually negotiated by humanity.
This text should therefore not be read as an attempted recipe for the ASI that would bring Utopia, but as a modest step for creating an AI that is compatible with open society and liberal democracy - an AI that does what its user asks it to do, but warn from unintended consequences and refuse illegal requests or those with substantial consequences for other people. Trying to build an ASI as a global optimizer of Human Values, and for that purpose to negotiate Human Values on a deadline, may only result in a Paperclip Maximizer or in a World War around which kind of paperclips each country wants to maximize.
However, as long as we don’t push in that direction too hard, the concept of “Aligning with Human Values” is a modestly good approximation of what we try to do.
Continuous Adversarial Quality Assurance
The basic steps of Constitutional AI as described in Anthropic’s paper are:
How is that process supposed to bring Hu...