Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Conditioning Predictive Models: Large language models as predictors, published by evhub on February 2, 2023 on LessWrong.
This is the first of seven posts in the Conditioning Predictive Models Sequence based on the forthcoming paper “Conditioning Predictive Models: Risks and Strategies” by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, and Kate Woolverton. Each post in the sequence corresponds to a different section of the paper. We will be releasing posts gradually over the course of the next week or so to give people time to read and digest them as they come out. We are starting with posts one and two, with post two being the largest and most content-rich of all seven.
Thanks to Paul Christiano, Kyle McDonell, Laria Reynolds, Collin Burns, Rohin Shah, Ethan Perez, Nicholas Schiefer, Sam Marks, William Saunders, Evan R. Murphy, Paul Colognese, Tamera Lanham, Arun Jose, Ramana Kumar, Thomas Woodside, Abram Demski, Jared Kaplan, Beth Barnes, Danny Hernandez, Amanda Askell, Robert Krzyzanowski, and Andrei Alexandru for useful conversations, comments, and feedback.
Abstract
Our intention is to provide a definitive reference on what it would take to safely make use of predictive models in the absence of a solution to the Eliciting Latent Knowledge problem.
Furthermore, we believe that large language models can be understood as such predictive models of the world, and that such a conceptualization raises significant opportunities for their safe yet powerful use via carefully conditioning them to predict desirable outputs.
Unfortunately, such approaches also raise a variety of potentially fatal safety problems, particularly surrounding situations where predictive models predict the output of other AI systems, potentially unbeknownst to us. There are numerous potential solutions to such problems, however, primarily via carefully conditioning models to predict the things we want—e.g. humans—rather than the things we don’t—e.g. malign AIs.
Furthermore, due to the simplicity of the prediction objective, we believe that predictive models present the easiest inner alignment problem that we are aware of.
As a result, we think that conditioning approaches for predictive models represent the safest known way of eliciting human-level and slightly superhuman capabilities from large language models and other similar future models.
1. Large language models as predictors
Suppose you have a very advanced, powerful large language model (LLM) generated via self-supervised pre-training. It’s clearly capable of solving complex tasks when prompted or fine-tuned in the right way—it can write code as well as a human, produce human-level summaries, write news articles, etc.—but we don’t know what it is actually doing internally that produces those capabilities. It could be that your language model is:
a loose collection of heuristics,[1]
a generative model of token transitions,
a simulator that picks from a repertoire of humans to simulate,
a proxy-aligned agent optimizing proxies like sentence grammaticality,
an agent minimizing its cross-entropy loss,
an agent maximizing long-run predictive accuracy,
a deceptive agent trying to gain power in the world,
a general inductor,
a predictive model of the world,
etc.
Later, we’ll discuss why you might expect to get one of these over the others, but for now, we’re going to focus on the possibility that your language model is well-understood as a predictive model of the world.
In particular, our aim is to understand what it would look like to safely use predictive models to perform slightly superhuman tasks[2]—e.g. predicting counterfactual worlds to extract the outputs of long serial research processes.[3]
We think that this basic approach has hope for two reasons. First, the prediction orthogonality thesis seems basically right: we think...