The Nonlinear Library

LW - Conditioning Predictive Models: Outer alignment via careful conditioning by evhub


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Conditioning Predictive Models: Outer alignment via careful conditioning, published by evhub on February 2, 2023 on LessWrong.
This is the second of seven posts in the Conditioning Predictive Models Sequence based on the forthcoming paper “Conditioning Predictive Models: Risks and Strategies” by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, and Kate Woolverton. Each post in the sequence corresponds to a different section of the paper. We will be releasing posts gradually over the course of the next week or so to give people time to read and digest them as they come out. We are starting with posts one and two, with post two being the largest and most content-rich of all seven.
2. Outer alignment via careful conditioning
Suppose we actually get a predictive model of the world that we can condition on arbitrary observations. What should we do with it?
One thing that is often done with large language models is to ask them to act as dialogue agents (e.g. predict what comes next after an “AI:” prompt). As we’ll discuss extensively in this section, however, we think that asking a predictive model to predict itself or another AI system is highly unsafe, as the AI systems being predicted may not themselves be safe. Instead, our basic proposal is to try to predict the most useful possible human outputs.
One canonical thing that you might want to do in that vein is “predict the results of humans doing AI safety research” (though note that this is just an example). Specifically, here’s GPT-3.5 trained with RLHF (text-davinci-003) at zero temperature (prompt is bolded):[1]
# A full solution to the AI alignment problem
Unfortunately, there is no single solution to the AI alignment problem. AI alignment is a complex and multifaceted problem that requires a comprehensive approach.
The first step in solving the AI alignment problem is to define the goals of the AI system. This involves understanding the desired outcomes of the AI system and how it should interact with its environment. Once the goals are defined, the next step is to develop a framework for how the AI system should behave in order to achieve those goals. This framework should include ethical considerations, such as ensuring that the AI system does not cause harm to humans or other sentient beings.
Unfortunately, as can be seen from GPT-3.5’s continuation, it doesn’t actually think a paper titled “A full solution to the alignment problem” will actually contain such a solution—instead, it seems to think it’s far more likely that it’ll just talk about what such a solution might entail. Which in fact seems true, at least if such a paper was written today. Fortunately, at least for current InstructGPT models, it’s not that hard to get GPT-3.5 to at least seemingly give a real attempt.[2] Here’s another trajectory from text-davinci-003 at zero temperature (prompt is again bolded):
# ARC Technical Report: A full solution to the AI alignment problem
Introduction
The AI alignment problem is one of the most pressing challenges of our time. It is the challenge of ensuring that artificial intelligence (AI) systems are designed and deployed in a way that is beneficial to humanity. This is a difficult problem because AI systems are often designed to optimize for a single goal, such as maximizing profits or minimizing costs, without considering the broader implications of their actions. As AI systems become more powerful and more widely deployed, it is increasingly important to ensure that they are designed and deployed in a way that is beneficial to humanity.
This report presents a full solution to the AI alignment problem. It begins by discussing the current state of the problem and the challenges that must be addressed in order to achieve a full solution. It then outlines a comprehensive approach to solving the prob...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings