The Nonlinear Library

AF - The Translucent Thoughts Hypotheses and Their Implications by Fabien Roger


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Translucent Thoughts Hypotheses and Their Implications, published by Fabien Roger on March 9, 2023 on The AI Alignment Forum.
Epistemic status: Uncertain about the validity of the claims I’m making here, and looking for feedback about the research directions I’m suggesting.
Thanks to Marius Hobbhahn, Johannes Treutlein, Siméon Campos, and Jean-Stanislas Denain for helpful feedback on drafts.
Here is a set of hypotheses:
The first AGIs will have LLMs at their core
Effective plans to defeat humanity can’t be found in a single LLM forward pass
LLMs will solve complex tasks by using English text (self-prompting, scratch pads, combination of expert LLMs, .)
I call these the Translucent Thoughts hypotheses.
I think the Translucent Thoughts hypotheses are likely (around 20% conditioning on AGI before 2030) because:
Text pretraining is more efficient at building algorithms and knowledge required for real-world plan generation and evaluation than alternative methods;
Future models are likely to be like Transformers, which use a limited amount of serial step in a single forward pass, and deception requires many serial steps;
Text pretraining and slight fine-tuning makes model able to use text generation to increase the maximum number of serial steps by a huge factor. Getting this increase through other means is likely to be hard and not competitive.
If these hypotheses are true, it should lead us to prioritize underexplored research directions, such as circumventing steganography or building extremely reliable text-supervision methods. I think those deserve attention, because Translucent Thoughts AIs are not safe by default.
In this post, I argue that we may will in a world where the first AGIs will look like X, and I then describe ways to make the first AGIs safer given X. This is different from most other works in this space, which often directly describe a kind of safe AGI. Despite this, the ideas of this post are close to some other works describing paths to safe AGIs, such as:
Externalized Reasoning Oversight, which describes a class of solutions similar to the one outlined here, but also aims for additional properties which I argue can be replaced with a less stringent hypothesis about AI systems;
Conditioning Predictive Models, which makes assumptions slightly different from the Translucent Thoughts hypotheses, yielding different research directions;
The Open Agency Model and Factored Cognition which describe subsets of AIs with Translucent Thoughts, which might be safe.
The Translucent Thoughts Hypotheses
Here, I sketch a world in which the first AGIs have certain properties. I argue that this world is likely, and thus a subset of all possible futures to care about. But I think it’s not a large part of all possible futures (20% conditioning on AGI before 2030).
The First AGIs Will Have LLMs at Their Core
By “first AGIs” I mean the first systems able to automate all cognitive tasks.
AGI is likely to do reasoning and planning using LLMs. AGI might rely on vision models for some tasks and interactions with the world, and it might use explicit search processes like AlphaGo. But I expect LLMs to do plan generation and evaluation, which are the core of the system (from an Alignment point of view).
Why: Vision systems are bad at coming up with and evaluating deceptive plans. Explicit search processes can’t generate and evaluate plans in the real world. LLMs seem to be able to do both plan generation and evaluation. (Plan generation and evaluation are the core tasks we would like to monitor to make AGIs safe, which is why I focus on those.)
End-to-end neural networks won’t be able to compete with LLMs when it comes to reasoning and planning, or at least, end-to-end networks will use “their LLMs parts” to do their most advanced form of reasoning and planning. This means tha...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings