March 01, 2025

【第152期】Kimi k1.5

22 minutes

Seventy3: 用NotebookLM将论文生成播客，让大家跟着AI一起进步。

今天的主题是：Kimi k1.5: Scaling Reinforcement Learning with LLMs

Summary

This technical report introduces Kimi k1.5, a multimodal large language model trained with reinforcement learning (RL). The report highlights the model's training techniques, including long context scaling and policy optimization, emphasizing a simplistic yet effective RL framework. Kimi k1.5 achieves state-of-the-art reasoning performance across several benchmarks, even outperforming models like OpenAI's o1 and GPT-4o in certain short-CoT reasoning tasks. A key aspect is the exploration of long-context RL, with the model trained on sequences up to 128k tokens and improved policy optimization that uses a variant of online mirror descent for robust policy optimization. Furthermore, the report details long2short methods, infrastructure optimization, and ablation studies, showcasing Kimi k1.5's advancements in multi-modal AI capabilities and token efficiency.

这份技术报告介绍了Kimi k1.5，一款通过强化学习（RL）训练的多模态大型语言模型。报告重点讲述了模型的训练技术，包括长上下文扩展和策略优化，强调了一种简洁而有效的RL框架。Kimi k1.5在多个基准测试中达到了最先进的推理表现，甚至在某些短链推理任务中超越了OpenAI的o1和GPT-4o模型。一个关键方面是对长上下文RL的探索，该模型训练时处理的序列长度可达128k个tokens，并采用一种在线镜像下降的变种方法进行强化的策略优化。报告还详细介绍了长2短方法、基础设施优化和消融研究，展示了Kimi k1.5在多模态AI能力和token效率方面的进展。

原文链接：https://arxiv.org/abs/2501.12599

...more