Learning GenAI via SOTA Papers

EP101: Kimi k1.5 Breaks the AI Data Wall


Listen Later

The paper, "TECHNICAL REPORT OF KIMI K1.5," details the development of Kimi k1.5, a state-of-the-art multi-modal Large Language Model (LLM) trained primarily through Reinforcement Learning (RL). The research explores a new axis for scaling AI by allowing models to learn and explore through rewards, bypassing the limitations of relying solely on static, human-generated pre-training datasets.

Here is a short summary of the paper's key contributions and findings:

  • Simplistic but Powerful RL Framework: The developers established a highly effective RL framework that achieves advanced reasoning capabilities without relying on overly complex techniques like Monte Carlo tree search (MCTS), value functions, or process reward models. Instead, they used a variant of online mirror descent alongside length penalties and optimized data sampling.
  • Long Context Scaling & Partial Rollouts: The model's RL context window was scaled up to 128k tokens, allowing it to perform deep planning, reflection, and error correction natively within its Chain-of-Thought (CoT). To handle these extremely long reasoning trajectories efficiently during training, the team introduced "partial rollouts." This technique caps rollout lengths and saves unfinished segments to a replay buffer, preventing long outputs from monopolizing system resources and avoiding redundant computation.
  • Mitigating "Overthinking": To prevent the model from generating unnecessarily lengthy reasoning steps, the researchers integrated a length penalty during RL training. This promotes shorter, more token-efficient responses while maintaining accuracy.
  • Long2Short Context Compression: The paper presents methods to transfer the advanced reasoning priors learned by the long-CoT model into a short-CoT model. Using techniques like Shortest Rejection Sampling, Direct Preference Optimization (DPO), and specific Long2short RL training, they successfully compressed the model's reasoning process for greater inference efficiency.
  • State-of-the-Art Results: The Kimi k1.5 long-CoT model achieves reasoning performance that matches OpenAI's o1 model, scoring 77.5 on AIME, 96.2 on MATH-500, and 74.9 on MathVista. Additionally, the Kimi k1.5 short-CoT model drastically outperforms existing models like GPT-4o and Claude 3.5 Sonnet, offering top-tier reasoning capabilities with a significantly smaller test-time token budget.
...more
View all episodesView all episodes
Download on the App Store

Learning GenAI via SOTA PapersBy Yun Wu