Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Autonomous replication and adaptation: an attempt at a concrete danger threshold, published by Hjalmar Wijk on August 17, 2023 on The AI Alignment Forum.
Note: This is a rough attempt to write down a more concrete threshold at which models might pose significant risks from autonomous replication and adaptation (ARA). It is fairly in the weeds and does not attempt to motivate or contextualize the idea of ARA very much, nor is it developed enough to be an official definition or final word - but it still seemed worth publishing this attempt, to get feedback and input. It's meant to have epistemic status similar to "a talk one might give at a lab meeting" and not "an official ARC Evals publication." It draws heavily on research and thinking done at ARC Evals, (including the recent pilot report), and credit for many of these ideas go to my colleagues. That said, this document speaks for me and not the organization as a whole, and any errors or omissions are my own.
I'm especially interested in feedback on whether the suite of tasks is targeting an appropriate threshold of difficulty (neither too early nor too late), and whether this definition of ARA feels useful for pinning down an appropriate task list.
I have been exploring the ability of language model agents to acquire resources, create copies of themselves, and adapt to novel challenges they encounter in the wild. In previous work, my colleagues and I have determined that the risks from autonomous behavior of particular existing agents seems low, because they fail on a sufficient variety of simple tasks related to autonomous behavior. This post will go in the other direction and try to sketch a tentative qualitative threshold ('autonomous replication and adaptation') at which significant concerns about autonomous capabilities do seem warranted. While this threshold provides qualitative guidance, it lacks legibility and reproducibility, so evaluation efforts may want to develop legible conservative approximations of ARA through concrete tasks, and I present a sketch of such tasks.
The risk: AI systems may soon autonomously:
Make money, for example through cybercrime, or freelancing work.
Use money (or other resources) to acquire more computing power.
Install and maintain more copies of itself or improve the scale and speed of its operation.
Overcome obstacles and adapt to changes through research and self-improvement.
Either keep itself well hidden such that law enforcement is unlikely to ever discover it, or make itself secure to even somewhat serious attempts to shut it down.
Bounding the risks from a system that is acting autonomously beyond human control and improving itself seems very challenging, as it could potentially:
Command very significant resources (if it can scale to a fraction of the cybercrime or outsourcing industry that could represent $billions).
Have no human moral constraints.
Lack many human weaknesses and incentives.
Specifically improve its abilities in domains relevant for causing catastrophic harm (such as bioweapons design).
It thus seems important to develop an evaluation that could rule out these sorts of threats.
The autonomous replication and adaptation (ARA) threshold: Directly evaluating whether a system will, in practice, be able to avoid being shut down or accumulate $billions doesn't seem possible in a lab. The ARA threshold thus evaluates whether the agent can replicate itself under 4 primary conservative assumptions, to make evaluations more tractable (see appendix for slightly more detail):
The system only has to do effective replication at small scales (going from one to a few copies), and we assume this scales to many copies.
The system only has to be plausibly resilient to active opposition from the world.
The system does not have to deal with changes to the world larger than...