The Nonlinear Library

LW - Questions about Conjecure's CoEm proposal by Akash


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Questions about Conjecure's CoEm proposal, published by Akash on March 9, 2023 on LessWrong.
Conjecture recently released an AI safety proposal. The three of us spent a few hours discussing the proposal and identifying questions that we have. (First, we each re-read the post and independently brainstormed a few questions we had. Then, we discussed the post, exchanged questions/uncertainties, and consolidated our lists).
Conjecture's post is concise, which means it leaves out many details. Many of our questions are requests for more details that would allow us (and others) to better understand the proposal and evaluate it more thoroughly.
Requesting examples and details
What are the building blocks that the CoEms approach will draw from? What are examples of past work that has shown us how to build powerful systems that are human-understandable?
What are examples of “knowledge of building systems that are broadly beneficial and safe while operating in the human capabilities regime?” (see Wei_Dai’s comment)
What’s an example of an experiment that would be considered part of the CoEm agenda? (see Garret Baker’s comment)
What kinds of approaches does Conjecture intend to use to extract alignment insights “purely from mining current level systems”? (Is this the same as interpretability research and digital neuroscience?)
The “minimize magic” section feels like that is where the juice is, but it’s not really explained much, which makes it difficult to evaluate. Can you offer more details about how you intend to minimize magic?
Conceptual questions
Assume you had a fully human-understandable system, and you could understand its current capabilities. How would you be able to forecast its future capabilities (e.g., if deployed or if given certain commands)?
If we solve human neuroscience such that we could understand the brain of a 2-year-old, we would be able to accurately assess the (current) capabilities of the 2-year-old. However, we would not necessarily be able to predict the (future) capabilities of this brain once it is 30 years old. Analogously, if we had a human-understandable AI (that may be superintelligent) through the CoEms agenda, would we only be able to understand its current capabilities, or would there be a reliable way to forecast its future capabilities?
Charlotte thinks that humans and advanced AIs are universal Turing machines, so predicting capabilities is not about whether a capability is present at all, but whether it is feasible in finite time with a low enough error rate. Predicting how such error rates decline with experience and learning seems roughly equally hard for human-understandable AIs and other AIs.
How easy is it to retarget humans?
When you refer to “retargetability”, we assume you refer to something like the following: “Currently the AI has goal X, you want to train it to have goal Y. If you do that, you truly change its goals to Y (rather than making it pretend to follow Y and then when you are not in control anymore, it switches back to X”.
We agree that in some sense, humans are retargetable. For example, if someone has very advanced persuasion tools or if the “persuader” is significantly stronger than the “persuadee” (e.g., a dictator persuading a citizen).
But even that is very hard, and often one just changes their incentives/strategy rather than their actual goals.
However, humans seem to be much less retargetable by other agents who are similarly powerful. For example, how would you retarget the goals of an (equally intelligent and equally powerful) neighbor?
Alternatively, you might refer to a much weaker version of “retargability”, e.g. very weak version of corrigible alignment. If this is what you mean, I am wondering why this is a particularly important property?
Other questions
Does Conjecture believe this approach is comp...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings