In this episode:
• Honey, I Shrunk the Activated Parameters: Linda introduces the massive Kimi K2 paper, focusing on its 'agentic intelligence' and surprisingly small number of activated parameters. Professor Norris offers his initial witty skepticism about yet another trillion-parameter model.
• Taming the Exploding Logits: The hosts get technical, discussing the novel MuonClip optimizer designed to solve training instability. They also explore the clever pre-training data strategy of 'rephrasing' to maximize token utility from a limited data pool.
• Teaching a Model to Use Tools: This chapter focuses on post-training, where Linda explains the large-scale synthetic data pipeline for teaching tool use. They also delve into the reinforcement learning framework that combines verifiable rewards with a self-critique mechanism.
• Climbing the Leaderboard: Linda and Professor Norris unpack Kimi K2's impressive benchmark performance, highlighting its state-of-the-art results on agentic and coding tasks. They conclude with final thoughts on what this powerful open-weight model means for the field.