Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Collective Identity, published by Nicholas Kees Dupuis on May 18, 2023 on The AI Alignment Forum.
Thanks to Simon Celinder, Quentin Feuillade--Montixi, Nora Ammann, Clem von Stengel, Guillaume Corlouer, Brady Pelkey and Mikhail Seleznyov for feedback on drafts. This post was written in connection with the AI Safety Camp.
Executive Summary:
This document proposes an approach to corrigibility that focuses on training generative models to function as extensions to human agency. These models would be designed to lack independent values/preferences of their own, because they would not have an individual identity; rather they would identify as part of a unified system composed of both human and AI components.
The selfless soldier: This section motivates the difference between two kinds of group centric behavior, altruism (which is based in individual identity) and collective identity.
Modeling groups vs individuals: Here we argue that individuals are not always the most task-appropriate abstraction, and that it often makes sense to model humans on the group level.
Generative predictive models: This section describes how generative predictive models will model themselves and their environment, and motivates the importance of the “model of self” and its connection to personal identity.
Strange identities: There are several ways (in humans) in which the one-to-one correspondence between a neural network and its model of self breaks down, and this section discusses three of those examples in order to suggest that identity is flexible enough that an AI’s identity need not be individual or individuated.
Steps toward identity fusion: Here we aim to clarify the goal of this agenda and what it would mean for an AI to have an identity based on a human-AI system such that the AI component extends the human’s agency. While we don’t give a clear plan for how to bring about this fusion, we do offer an antithetical example of what kind of training would clearly fail.
Relevance for corrigibility: This section concludes the document by drawing more direct connections to corrigibility, and by offering a series of open questions for how this research might be made more concrete.
The selfless soldier
In the heat of battle a grenade is tossed into the middle of a troop of soldiers. One soldier throws themself on top of the grenade, sacrificing themself for the survival of the troop. There are two main ways to frame what just happened.
Altruism (individual identity): The soldier has the personal value/preference of protecting their troop from harm. Reasoning (quickly) from this value, the soldier deduces that they must sacrifice themselves in order to bring about the future where their fellow soldiers are safe.
Collective Identity: The individual soldier is not the most important abstraction to explain/predict this situation, rather it is the troop as a whole. The troop cares about its own survival, and this manifests itself in the decision to sacrifice one of its members to protect itself from further harm (even though the cognition, at least at the moment of decision, happens entirely within one brain). While the individual soldier could theoretically use clever arguments to escape this conclusion, they do not (because as a component of the troop, this is not their function).
The problem of alignment is often framed as trying to ensure that the values of an AI system are aligned with humanity, ideally imbuing them with a certain kind of perfect altruism toward humankind. The problem of corrigibility is often framed as ensuring that even when those values are not (yet) perfectly aligned with our own, an adversarial relationship does not develop between the AI and its human designers (such that it would resist shutdown attempts or changes to its source code).
This approach tries instead to expl...