
Sign up to save your podcasts
Or


This paper introduces InstructGPT, demonstrating a highly effective method for aligning large language models with user intent using Reinforcement Learning from Human Feedback (RLHF). The authors show that making language models bigger does not inherently make them more helpful, truthful, or safe; instead, fine-tuning them with human feedback drastically improves their ability to follow instructions.The methodology consists of three core steps:
1. Supervised Fine-Tuning (SFT): The process begins by fine-tuning GPT-3 using supervised learning on a dataset of human-written demonstrations of desired model behavior.
2. Reward Model (RM) Training: The researchers then collect a dataset of comparisons, where human labelers rank different model outputs from best to worst. This data is used to train a reward model to predict which output a human would prefer.
3. Reinforcement Learning (RL): Finally, the supervised model is fine-tuned to maximize the reward from the RM using the Proximal Policy Optimization (PPO) algorithm.
The key findings of the paper include:
• Significant Preference over GPT-3: Human evaluators vastly preferred InstructGPT outputs over standard GPT-3 outputs. Remarkably, the 1.3B parameter InstructGPT model outperformed the 175B parameter GPT-3 model, despite having 100x fewer parameters.
• Increased Truthfulness and Safety: InstructGPT models generate truthful answers about twice as often as GPT-3, hallucinate less frequently, and generate 25% fewer toxic outputs when prompted to be respectful.
• Mitigating the "Alignment Tax": Standard RLHF fine-tuning caused performance regressions on certain public NLP datasets. The authors solved this by mixing PPO updates with pretraining updates (creating the "PPO-ptx" model), effectively minimizing this alignment tax while maintaining high human preference scores.
• Out-of-Distribution Generalization: The models successfully generalized the concept of "following instructions" to areas rarely seen in the fine-tuning data, such as writing code and understanding non-English languages.
Despite these major improvements, the authors note that InstructGPT still has limitations. It can still make simple mistakes, overly hedge on simple questions, exhibit biases, and, crucially, will often follow a user's instruction even if the requested output is toxic or harmful. The paper concludes that while fine-tuning with human feedback is a highly promising direction for AI alignment, significant work remains to ensure full safety and reliability.
By Yun WuThis paper introduces InstructGPT, demonstrating a highly effective method for aligning large language models with user intent using Reinforcement Learning from Human Feedback (RLHF). The authors show that making language models bigger does not inherently make them more helpful, truthful, or safe; instead, fine-tuning them with human feedback drastically improves their ability to follow instructions.The methodology consists of three core steps:
1. Supervised Fine-Tuning (SFT): The process begins by fine-tuning GPT-3 using supervised learning on a dataset of human-written demonstrations of desired model behavior.
2. Reward Model (RM) Training: The researchers then collect a dataset of comparisons, where human labelers rank different model outputs from best to worst. This data is used to train a reward model to predict which output a human would prefer.
3. Reinforcement Learning (RL): Finally, the supervised model is fine-tuned to maximize the reward from the RM using the Proximal Policy Optimization (PPO) algorithm.
The key findings of the paper include:
• Significant Preference over GPT-3: Human evaluators vastly preferred InstructGPT outputs over standard GPT-3 outputs. Remarkably, the 1.3B parameter InstructGPT model outperformed the 175B parameter GPT-3 model, despite having 100x fewer parameters.
• Increased Truthfulness and Safety: InstructGPT models generate truthful answers about twice as often as GPT-3, hallucinate less frequently, and generate 25% fewer toxic outputs when prompted to be respectful.
• Mitigating the "Alignment Tax": Standard RLHF fine-tuning caused performance regressions on certain public NLP datasets. The authors solved this by mixing PPO updates with pretraining updates (creating the "PPO-ptx" model), effectively minimizing this alignment tax while maintaining high human preference scores.
• Out-of-Distribution Generalization: The models successfully generalized the concept of "following instructions" to areas rarely seen in the fine-tuning data, such as writing code and understanding non-English languages.
Despite these major improvements, the authors note that InstructGPT still has limitations. It can still make simple mistakes, overly hedge on simple questions, exhibit biases, and, crucially, will often follow a user's instruction even if the requested output is toxic or harmful. The paper concludes that while fine-tuning with human feedback is a highly promising direction for AI alignment, significant work remains to ensure full safety and reliability.