February 20, 2026

Module 3: Reinforcement Learning from Human Feedback

9 minutes

This episode addresses how Reinforcement Learning from Human Feedback (RLHF) adds the final layer of alignment after supervised fine-tuning, shifting the training signal from “right vs wrong” to “better vs worse.” We explore how preference rankings create a reward signal (reward models plus PPO) and the newer shortcut (DPO) that learns preferences directly, then connect RLHF to safety through the Helpful, Honest, Harmless goal. We also unpack the “alignment tax,” the trade-off between being safe and being genuinely useful, and close by setting up the next module on running models at scale, starting with GPU memory limits, plus a personal reflection on starting later without being behind.

...more

View all episodes

By Sheetal ’Shay’ Dhar

February 20, 2026

Module 3: Reinforcement Learning from Human Feedback

9 minutes

...more

Share Module 3: Reinforcement Learning from Human Feedback

Sign up to save your podcasts

Module 3: Reinforcement Learning from Human Feedback

Module 3: Reinforcement Learning from Human Feedback