Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Cognitive Emulation: A Naive AI Safety Proposal, published by Connor Leahy on February 25, 2023 on LessWrong.
This is part of the work done at Conjecture.
This post has been reviewed before publication as per our infohazard policy. We thank our external reviewers for their comments and feedback.
This post serves as a signpost for Conjecture’s new primary safety proposal and research direction, which we call Cognitive Emulation (or “CoEm”). The goal of the CoEm agenda is to build predictably boundable systems, not directly aligned AGIs. We believe the former to be a far simpler and useful step towards a full alignment solution.
Unfortunately, given that most other actors are racing for as powerful and general AIs as possible, we won’t share much in terms of technical details for now. In the meantime, we still want to share some of our intuitions about this approach.
We take no credit for inventing any of these ideas, and see our contributions largely in taking existing ideas seriously and putting them together into a larger whole.
In Brief
The core intuition is that instead of building powerful, Magical end-to-end systems (as the current general paradigm in AI is doing), we instead focus our attention on trying to build emulations of human-like things. We want to build systems that are “good at chess for the same reasons humans are good at chess.”
CoEms are a restriction on the design space of AIs to emulations of human-like stuff. No crazy superhuman blackbox Magic, not even multimodal RL GPT5. We consider the current paradigm of developing AIs that are as general and as powerful as possible, as quickly as possible, to be intrinsically dangerous, and we focus on designing bounded AIs as a safer alternative to it.
Logical, Not Physical Emulation
We are not interested in direct physical emulation of human brains or simulations of neurons, but of “logical” emulation of thought processes. We don’t care about whether underlying functions are implemented in the same way as they are in the system we are trying to emulate, just that the abstraction over their function holds, and is not leaky.
Minimize Magic
In the current paradigm, we generally achieve new capabilities through an increase in Magic. We throw more compute at black boxes that develop internal algorithms we have no insight into. Instead of continually increasing the amount of Magic present in our systems, we want to actively decrease this amount, to more cleanly implement and understand how new capabilities are achieved. Some amount of Magic will realistically be needed to implement many useful functions, but we want to minimize the amount of times we have to use such uninterpretable methods, and clearly keep track of where we are using them, and why.
CoEms are much “cleaner” than Ems, which are still ultimately big black boxes of weird computation, while in the CoEm paradigm, we keep careful track of where the Magic is and try to keep its presence to a minimum.
Predict, Track and Bound Capabilities
In the current dominant machine learning paradigm, there are absolutely no guarantees nor understanding of what is being created. Power laws don’t tell us anything about what capabilities will emerge or what other properties our systems will actually have.
One of the core hopes of shifting to a CoEm paradigm is that far more deeply understanding what we are building should allow us to predictively bound our system’s capabilities to a human-like regime. This eliminates the problem of being unable to know when an ostensibly harmless system passes from an understandable, harmless capabilities regime into an unprecedented, dangerous regime.
Exploit the Human Regime
We want systems that are as safe as humans, for the same reasons that humans have (or don’t have) those safety properties. Any scheme that involves building s...