
Sign up to save your podcasts
Or


Reinforcement Learning from Human Feedback (RLHF) has emerged as the central alignment technique used to finetune state-of-the-art systems such as GPT-4, Claude-2, Bard, and Llama-2. However, RLHF has a number of known problems, and these models have exhibited some troubling alignment failures. How did we get here? What lessons should we learn? And what does it mean for the next generation of AI systems?
By Aaron BergmanReinforcement Learning from Human Feedback (RLHF) has emerged as the central alignment technique used to finetune state-of-the-art systems such as GPT-4, Claude-2, Bard, and Llama-2. However, RLHF has a number of known problems, and these models have exhibited some troubling alignment failures. How did we get here? What lessons should we learn? And what does it mean for the next generation of AI systems?