March 30, 2023

AF - Imitation Learning from Language Feedback by Jérémy Scheurer

19 minutes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Imitation Learning from Language Feedback, published by Jérémy Scheurer on March 30, 2023 on The AI Alignment Forum.

TL;DR: Specifying the intended behavior of language models is hard, and current methods, such as RLHF, only incorporate low-resolution (binary) feedback information. To address this issue, we introduce Imitation learning from Language Feedback (ILF), an iterative algorithm leveraging language feedback as an information-rich and natural way of guiding a language model toward desired outputs. We showcase the effectiveness of our algorithm in two papers on the task of summary writing (Scheurer et al. 2023) and code generation (Chen et al. 2023). We discuss how language feedback can be used for process-based supervision and to guide model exploration, potentially enabling improved safety over RLHF. Finally, we develop theory showing that our algorithm can be viewed as Bayesian Inference, just like RLHF, which positions it as a competitive alternative to RLHF while having the potential safety benefits of predictive models.

We propose an iterative algorithm called Imitation learning from Language Feedback (ILF) that leverages language feedback to train language models to generate text that (outer-) aligns with human preferences. The algorithm assumes access to an initial LM which generates an output given a specific input. A human then provides language feedback on the input-output pair. The language feedback is not restricted in any way and can highlight issues, suggest improvements, or even acknowledge positive aspects of the output. ILF then proceeds in three steps:

Generate multiple refinements of the initial LM-generated output given the input and language feedback. We use a Refinement LM (e.g., an instruction-finetuned LM) to generate the refinements (one could however use the same LM that generated the initial output).

Select the refinement that best incorporates the feedback, using a language reward model such as an instruction-finetuned LM, which we call InstructRM (Scheurer et al. 2023), or using unit tests (Chen et al. 2023).

Finetune the initial LM on the selected refinements given the input.These steps can be applied iteratively by using the finetuned model to generate initial outputs in the next iteration and collect more feedback on its outputs etc. Using this refine-and-finetune approach; we are finetuning an LM using language feedback in a supervised manner. A single iteration of ILF is also used as a first step in the Constitutional AI method (Bai et. al 2022). In the below figures, we show the full ILF algorithm on the task of summarization (top) and code generation (bottom).

Why Language Feedback?

Language Feedback is a Natural Abstraction for Humans

Language Models (LMs) are powerful tools that are trained on large datasets of text from the internet. However, it is difficult to specify the intended behavior of an LM, particularly in difficult tasks where the behavior can't be adequately demonstrated or defined, which can result in catastrophic outcomes caused by goal misspecification (Langosco et al. 2021, Shah et. al 2022). To address this issue, we propose using language feedback as a way to outer-align LMs with human preferences and introduce a novel algorithm called Imitation learning from language Feedback. Compared to binary comparisons used in Reinforcement Learning with Human Feedback (RLHF), language feedback is a more natural and information-rich form of human feedback that conveys more bits of information, enabling a more nuanced and comprehensive understanding of human preferences. Additionally, expressing feedback in language provides natural abstractions that align well with human ontology.

The use of language as a transmission protocol and file format has been optimized over thousands of years to facilitate human cooperati...

...more