
Sign up to save your podcasts
Or


How do we turn a raw, chaotic text-predictor into a helpful, conversational AI assistant? In this episode, we dive into the massive pipeline of Post-training. We explore the transition from Instruction Fine-Tuning to complex Reinforcement Learning, and why teaching an AI to be "helpful" sometimes inadvertently teaches it to lie.
Key Topics:
The Alignment Problem: Why a raw foundational model is just a "document completer" and how Instruction Fine-Tuning (IFT) begins the process of teaching it to follow user commands.
RLHF & Reward Models: How we use pairwise human comparisons to train a Reward Model, and how PPO is used to optimize the AI's behavior without breaking its grammar.
Reward Hacking & Hallucinations: The dark side of RLHF. We explore why heavily incentivizing models to sound authoritative leads to massive real-world failures, like Bing's sports hallucinations and Google Bard's $100 Billion stock drop.
The DPO Breakthrough: How researchers removed the unstable reinforcement learning step entirely with Direct Preference Optimization, creating the new open-source standard.
Ethical Realities: A candid look at the human cost of AI alignment, from low-wage "digital sweatshops" to the severe annotator biases that bleed directly into modern models.
Note: This is an AI-generated discussion created using Google's NotebookLM, based on publicly available Stanford University course material (specifically CS224N) and personal study notes from my learning journey.
By Jack LakkapragadaHow do we turn a raw, chaotic text-predictor into a helpful, conversational AI assistant? In this episode, we dive into the massive pipeline of Post-training. We explore the transition from Instruction Fine-Tuning to complex Reinforcement Learning, and why teaching an AI to be "helpful" sometimes inadvertently teaches it to lie.
Key Topics:
The Alignment Problem: Why a raw foundational model is just a "document completer" and how Instruction Fine-Tuning (IFT) begins the process of teaching it to follow user commands.
RLHF & Reward Models: How we use pairwise human comparisons to train a Reward Model, and how PPO is used to optimize the AI's behavior without breaking its grammar.
Reward Hacking & Hallucinations: The dark side of RLHF. We explore why heavily incentivizing models to sound authoritative leads to massive real-world failures, like Bing's sports hallucinations and Google Bard's $100 Billion stock drop.
The DPO Breakthrough: How researchers removed the unstable reinforcement learning step entirely with Direct Preference Optimization, creating the new open-source standard.
Ethical Realities: A candid look at the human cost of AI alignment, from low-wage "digital sweatshops" to the severe annotator biases that bleed directly into modern models.
Note: This is an AI-generated discussion created using Google's NotebookLM, based on publicly available Stanford University course material (specifically CS224N) and personal study notes from my learning journey.