The Nonlinear Library

AF - Conditioning Predictive Models: Interactions with other approaches by Evan Hubinger


Listen Later

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Conditioning Predictive Models: Interactions with other approaches, published by Evan Hubinger on February 8, 2023 on The AI Alignment Forum.
This is the fifth of seven posts in the Conditioning Predictive Models Sequence based on the paper “Conditioning Predictive Models: Risks and Strategies” by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, and Kate Woolverton. Each post in the sequence corresponds to a different section of the paper. We will be releasing posts gradually over the course of the next week or so to give people time to read and digest them as they come out.
5. Interactions with other approaches
Imitation learning
One very related approach is imitation learning: rather than try to predict humans, as we are proposing, one could simply train to imitate them instead. Such an approach would have many of the same safety benefits, since it would also be exclusively trying to produce outputs from safe humans.
The basic problem with such an approach, however, is that there’s no reason to believe that a model trained via pure imitation learning would generalize beyond the capability level of the human(s) it is trained to imitate. While using predictive models to predict humans also cannot produce outputs that humans would never be able to generate, it can produce outputs that no humans that it has ever previously seen would be able to generate, since it might e.g. predict that such humans will exist under some conditional.
Thus, we think that predictive modeling at least has the potential to be just as safe as imitation learning while being able to generalize to substantially more advanced capabilities—though, similarly to imitation learning, predicting humans still cannot elicit capabilities beyond those that any conceivable human would be capable of, as we discussed previously.
Supervised fine-tuning
For some conditionals we might have a very precise notion of what we want the model to observe (e.g. “exactly this image coming from this camera”). Ideally, this sort of a conditional should be straightforwardly implementable via prompting, just by fixing the relevant tokens in the model’s context window.[1] However, at least for current models, prompting has some basic structural limitations—for example, if you want to condition on something very long, context window length could start to become quite problematic. In that sort of a case, it might be quite helpful to instead turn to supervised fine-tuning, fine-tuning on the observation to condition on rather than including it in a prompt. Effectively, this sort of fine-tuning lets you give the model substantially more bits of evidence for it to condition on than is possible via just prompting.
For the most part, we think this is likely to be basically fine, since it’s essentially continuous with pre-training: if we think that pre-training produces the sort of predictive model we want, then including some extra pre-training-style data and fine-tuning on it should do the same. The primary concern here, however, would be situations where the fine-tuning data is for some reason not very continuous with the pre-training data.
One way that the fine-tuning data could be substantively different than pre-training is if it directly depends on the model itself—e.g. fine-tuning on the model’s own outputs. Not only is this substantially less continuous with pre-training, but it also specifically raises the risk of the model imitating AIs and/or producing self-fulfilling prophecies.
Such fine-tuning could also be particularly problematic if the data is specifically selected according to some criterion other than actual representativeness of the world—that is, if there’s no clear “camera” that corresponds to how the data was collected. Probably the most notable way this could happen is via reinforcement learning ...
...more
View all episodesView all episodes
Download on the App Store

The Nonlinear LibraryBy The Nonlinear Fund

  • 4.6
  • 4.6
  • 4.6
  • 4.6
  • 4.6

4.6

8 ratings