The Nonlinear Library

LW - Davidad's Bold Plan for Alignment: An In-Depth Explanation by Raphaël S


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Davidad's Bold Plan for Alignment: An In-Depth Explanation, published by Raphaël S on April 19, 2023 on LessWrong.
Gabin Kolly and Charbel-Raphaël Segerie contributed equally to this post. Davidad proofread this post.
Thanks to Vanessa Kosoy, Siméon Campos, Jérémy Andréoletti, Guillaume Corlouer, Jeanne S., Vladimir I. and Clément Dumas for useful comments.
Context
Davidad has proposed an intricate architecture aimed at addressing the alignment problem, which necessitates extensive knowledge to comprehend fully. We believe that there are currently insufficient public explanations of this ambitious plan. The following is our understanding of the plan, gleaned from discussions with Davidad.
This document adopts an informal tone. The initial sections offer a simplified overview, while the latter sections delve into questions and relatively technical subjects. This plan may seem extremely ambitious, but the appendix provides further elaboration on certain sub-steps and potential internship topics, which would enable us to test some ideas relatively quickly.
Davidad’s plan is an entirely new paradigmatic approach to address the hard part of alignment: The Open Agency Architecture aims at “building an AI system that helps us ethically end the acute risk period without creating its own catastrophic risks that would be worse than the status quo”.
This plan is motivated on the assumption that current paradigms for model alignment won’t be successful:
LLMs won't be able to be aligned just with RLHF or a variation of this technique.
Scalable oversight will be too difficult.
Interpretability used to retro-engineer an arbitrary model will not be feasible. Instead, it would be easier to iteratively construct an understandable world model.
Unlike OpenAI's plan, which is a meta-level plan that delegates the task of finding a solution for alignment with future AI, davidad's plan is an object-level plan that takes drastic measures to prevent future problems. It is also based on rather testable assumptions that can be relatively quickly tested (see the annex).
Plan’s tldr: Utilize near-AGIs to build a detailed world simulation, train and formally verify within it that the AI adheres to coarse preferences and avoids catastrophic outcomes.
How to read this post?
This post is much easier to read than the original post. But we are aware that it still contains a significant amount of technicality. Here's a way to read this post gradually:
Start with the Bird's-eye view (5 minutes)
Contemplate the Bird's-eye view diagram (5 minutes), without spending time understanding the mathematical notations in the diagram.
Fine Grained Scheme: Focus on the starred sections. Skip the non-starred sections. Don’t spend too much time on difficulties (10 minutes)
From "Hypothesis discussion" onwards, the rest of the post should be easy to read (10 minutes)
For more technical details, you can read:
The starred sections of the Fine Grained Scheme
The Annex.
Bird's-eye view
The plan comprises four key steps:
Understand the problem: This entails formalizing the problem, similar to deciphering the rules of an unfamiliar game like chess. In this case, the focus is on understanding reality and human preferences.
World Modeling: Develop a comprehensive and intelligent model of the world capable of being used for model-checking. This could be akin to an ultra-realistic video game built on the finest existing scientific models. Achieving a sufficient model falls under the Scientific Sufficiency Hypothesis (a discussion of those hypotheses can be found later on) and would be one of the most ambitious scientific projects in human history.
Specification Modeling: Generate a list of moral desiderata, such as a model that safeguards humans from catastrophes. The Deontic Sufficiency Hypothesis posits that it is possible to ...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings