In this episode:
• Introduction: The Wait for Tokens: Professor Norris and Linda introduce the episode's paper, EAGLE-3, and discuss the persistent bottleneck of autoregressive generation costs in modern LLMs.
• The Speculative Ceiling: Linda explains how previous speculative sampling methods like EAGLE hit a performance wall where adding more training data failed to improve the draft model, identifying the feature prediction constraint as the culprit.
• Innovation: Training-Time Test: A deep dive into EAGLE-3's core innovation: abandoning feature prediction in favor of direct token prediction that simulates the testing environment during the training phase.
• Going Deeper: Multi-Layer Fusion: The hosts discuss the second major architectural change, where the model stops relying solely on top-layer features and instead fuses low, mid, and high-level features for better context.
• Results: A New Scaling Law: Linda reveals the experimental results, including a 6.5x speedup, SGLang integration, and the discovery of a scaling law where draft models finally benefit from more data.