Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A list of core AI safety problems and how I hope to solve them, published by davidad (David A. Dalrymple) on August 26, 2023 on The AI Alignment Forum.
Context: I sometimes find myself referring back to this tweet and wanted to give it a more permanent home. While I'm at it, I thought I would try to give a concise summary of how each distinct problem would be solved by an Open Agency Architecture (OAA), if OAA turns out to be feasible.
1. Value is fragile and hard to specify.
See: Specification gaming examples, Defining and Characterizing Reward Hacking
OAA Solution:
1.1. First, instead of trying to specify "value", instead "de-pessimize" and specify the absence of a catastrophe, and maybe a handful of bounded constructive tasks like supplying clean water. A de-pessimizing OAA would effectively buy humanity more time to tackle the CEV-style alignment problem - which is harder than merely mitigating extinction risk. This doesn't mean limiting the power of underlying AI systems so that they can only do bounded tasks, but rather containing that power and limiting its use.
The absence of a catastrophe is also still hard to specify and will take a lot of effort, but the hardness is concentrated on bridging between high-level human concepts and the causal mechanisms in the world by which an AI system can intervene.
1.2. Leverage human-level AI systems to automate much of the cognitive labor of formalizing scientific models - from quantum chemistry to atmospheric dynamics - and formalizing the bridging relations between levels of abstraction, so that we can write specifications in a high-level language with a fully explainable grounding in low-level physical phenomena. Physical phenomena themselves are likely to be robust, even if the world changes dramatically due to increasingly powerful AI interventions, and scientific explanations thereof happen to be both robust and compact enough for people to understand.
2. Corrigibility is anti-natural.
See: The Off-Switch Game, Corrigibility (2014)
OAA Solution: (2.1) Instead of building in a shutdown button, build in a shutdown timer. See You can still fetch the coffee today if you're dead tomorrow. This enables human stakeholders to change course periodically (as long as the specification of non-catastrophe is good enough to ensure that most humans remain physically and mentally intact).
3. Pivotal processes require dangerous capabilities.
See: Pivotal outcomes and pivotal processes
OAA Solution: (3.1) Indeed, dangerous capabilities will be required. Push for reasonable governance. This does not mean creating one world government, but it does mean that the objectives of a pivotal process will need to be negotiated and agreed upon internationally. Fortunately, for now, dangerous capabilities seem to depend on having large amounts of computing hardware, which can be controlled like other highly dangerous substances.
4. Goals misgeneralize out of distribution.
See: Goal misgeneralization: why correct specifications aren't enough for correct goals, Goal misgeneralization in deep reinforcement learning
OAA Solution: (4.1) Use formal methods with verifiable proof certificates. Misgeneralization can occur whenever a property (such as goal alignment) has been tested only on a subset of the state space. Out-of-distribution failures of a property can only be ruled out by an argument for a universally quantified statement about that property - but such arguments can in fact be made! See VNN-COMP. In practice, it will not be possible to have enough information about the world to "prove" that a catastrophe will not be caused by an unfortunate coincidence, but instead we can obtain guaranteed probabilistic bounds via stochastic model checking.
5. Instrumental convergence.
See: The basic AI drives, Seeking power is often convergently instrument...