June 11, 2026

EP 46 | CS224N: Post-training

22 minutes

How do we turn a raw, chaotic text-predictor into a helpful, conversational AI assistant? In this episode, we dive into the massive pipeline of Post-training. We explore the transition from Instruction Fine-Tuning to complex Reinforcement Learning, and why teaching an AI to be "helpful" sometimes inadvertently teaches it to lie.

Key Topics:

The Alignment Problem: Why a raw foundational model is just a "document completer" and how Instruction Fine-Tuning (IFT) begins the process of teaching it to follow user commands.
RLHF & Reward Models: How we use pairwise human comparisons to train a Reward Model, and how PPO is used to optimize the AI's behavior without breaking its grammar.
Reward Hacking & Hallucinations: The dark side of RLHF. We explore why heavily incentivizing models to sound authoritative leads to massive real-world failures, like Bing's sports hallucinations and Google Bard's $100 Billion stock drop.
The DPO Breakthrough: How researchers removed the unstable reinforcement learning step entirely with Direct Preference Optimization, creating the new open-source standard.
Ethical Realities: A candid look at the human cost of AI alignment, from low-wage "digital sweatshops" to the severe annotator biases that bleed directly into modern models.

Note: This is an AI-generated discussion created using Google's NotebookLM, based on publicly available Stanford University course material (specifically CS224N) and personal study notes from my learning journey.

...more

View all episodes

By Jack Lakkapragada

June 11, 2026

EP 46 | CS224N: Post-training

22 minutes

Key Topics:

The Alignment Problem: Why a raw foundational model is just a "document completer" and how Instruction Fine-Tuning (IFT) begins the process of teaching it to follow user commands.
RLHF & Reward Models: How we use pairwise human comparisons to train a Reward Model, and how PPO is used to optimize the AI's behavior without breaking its grammar.
Reward Hacking & Hallucinations: The dark side of RLHF. We explore why heavily incentivizing models to sound authoritative leads to massive real-world failures, like Bing's sports hallucinations and Google Bard's $100 Billion stock drop.
The DPO Breakthrough: How researchers removed the unstable reinforcement learning step entirely with Direct Preference Optimization, creating the new open-source standard.
Ethical Realities: A candid look at the human cost of AI alignment, from low-wage "digital sweatshops" to the severe annotator biases that bleed directly into modern models.

...more

Share EP 46 | CS224N: Post-training

Sign up to save your podcasts

EP 46 | CS224N: Post-training

EP 46 | CS224N: Post-training