Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Making AIs less likely to be spiteful, published by Nicolas Macé on September 26, 2023 on LessWrong.
Which forms of misalignment might result in particularly bad outcomes? And to what extent can we prevent them even if we fail at intent alignment? We define spite as a terminal preference for frustrating others' preferences, at least under some conditions. Reducing the chances that an AI system is spiteful is a candidate class of interventions for reducing risks of AGI conflict, as well as risks from malevolence. This post summarizes some of our thinking on the topic. We give an overview of why spite might lead to catastrophic conflict; how we might intervene to reduce it; ways in which the intervention could fail to be impactful, or have negative impact; and things we could learn that would update us on the value of this intervention.
Key takeaways
Spiteful preferences include a generalized preference for harming others, as well as other preferences like vengefulness and spite towards certain groups. The basic reason to focus on reducing spite is that such interventions may stably make AIs less likely to take risks of mutually costly conflict (or deliberately create suffering because they intrinsically value it), even if alignment fails. (more)
Spite might be selected for in ML systems because (a) it serves as a strategically valuable commitment device, (b) it is a direct proxy for high-scoring behavior in environments where the optimal behavior involves harming other agents (e.g., environments with competition between agents), (c) it is (correctly or incorrectly) inferred from human preferences, or (d) it results from miscellaneous generalization failures. (more)
Thus potentially low-cost interventions to reduce the chances of spite include modifications to the training data or loss function to reduce selection pressure towards spite (e.g., avoiding selecting agents based on relative performance in multi-agent tasks, or filtering human feedback that could select for spite). (more)
Reducing spite carries some prima facie backfire risk, via potentially increasing the exploitability of AIs that share human values to some extent. We currently don't think that this consideration makes the sign of spite reduction negative, however. One reason is that, for interventions to backfire by making our AI more exploitable, they have to change other agents' beliefs about how spiteful our AI is, and there are various reasons to doubt this will happen.
Interventions to reduce spite seem most likely to be counterfactual in worlds where alignment fails. However, it's currently very unclear to us how an AI's goals will relate to design features we can intervene on (e.g., training environments), conditional on alignment failure.
If we are in a world in which inner objectives are highly path-dependent, we need it to be the case that (i) we can reliably influence motivations formed early in training and that (ii) these motivations reliably influence the agent's final goals. (more)
If we are in a world in which inner objectives are not very highly path-dependent, we need it to be the case that deceptive alignment isn't favored by models' inductive biases, and that the agent's inner objective generalizes off-distribution as intended. (more)
Our future work: We are not confident that it will be possible to reduce spite in misaligned AIs. However, (i) based on our reading of the alignment literature and conversations with alignment researchers, it doesn't seem strongly ruled out by the current understanding of alignment, and (ii) there may be spite interventions that are particularly cheap and easy to persuade AI developers to implement. Thus we think it's worthwhile to continue to devote some of our portfolio to spite for now.
Our conceptual work on this topic will likely include:
Developing...