Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Conditioning Predictive Models: Open problems, Conclusion, and Appendix, published by Evan Hubinger on February 10, 2023 on The AI Alignment Forum.
This is the final of seven posts in the Conditioning Predictive Models Sequence based on the paper “Conditioning Predictive Models: Risks and Strategies” by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, and Kate Woolverton. Each post in the sequence corresponds to a different section of the paper.
7. Open problems
We think that there are a wide variety of ways—both experimental and theoretical—in which our analysis could be expanded upon. Here, we’ll try to briefly lay out some of the future directions that we are most excited about—though note that this is only a sampling of some possible future directions, and is thus a highly incomplete list:
Are pre-trained LLMs well-modeled as predictive models or agents?
As pre-trained model scale increases, do markers of agentic behavior increase as well?
See “Discovering Language Model Behaviors with Model-Written Evaluations” for some initial results on this question.
To what extent do LLMs exhibit distributional generalization?
Distributional generalization seems like evidence of acting as a generative/predictive model rather than just optimizing cross-entropy loss.
To the extent that current LLMs are doing some sort of prediction, can we find evidence of that in their internal structure?
Is the RLHF conditioning hypothesis true?
How do markers of agentic behavior change as the amount of RLHF done increases, and under different RLHF fine-tuning regimes?
See “Discovering Language Model Behaviors with Model-Written Evaluations” for some initial results on this question.
For anything that an RLHF model can do, is there always a prompt that gets a pre-trained model to do the same thing? What about a soft prompt or a prompt chain?
In addition to validating the extent to which RLHF models can be mimicked using techniques that are more clearly implementing a conditional, a positive result here could also provide an alternative to RLHF that allows us to get the same results without relying on the RLHF conditioning hypothesis at all.
More generally, how similar are RLHF fine-tuned models to pre-trained models with fine-tuned soft prompts?
The idea here being that a soft prompt is perhaps more straightforward to think of as a sort of conditional.
To what extent do RLHF fine-tuned models exhibit distributional generalization?
Relevant here for the same reason as in the pre-training case.
To what extent can you recover the original pre-trained distribution/capabilities from an RLHF fine-tuned model?
If an RLHF model no longer successfully solves some prediction task by default, how easy is it to turn back on that capability via additional fine-tuning, or did the RLHF destroy it completely?
If it is generally possible to do this, it is some evidence that the original pre-trained distribution is still largely maintained in the RLHF model.
How do markers of agentic behavior change as we change the RL reward? Is it very different between human-like and random rewards? What happens if we exactly invert the standard helpfulness reward?
This can help test whether agency is coming from the specific choice of RL reward or the general process of RLHF.
How do RLHF fine-tuned models differ from their own preference model, especially regarding markers of agentic behavior?
To the extent that fine-tuned models get closer to their preference models as scale increases, preference models can serve as a proxy for future RLHF models.
Are there ways of changing standard RLHF techniques to make them more likely to produce conditionals rather than agents?
How do alternative, more myopic RL training schemes—such as the one described here—affect markers of agentic behavior? Can we use such techniques...