
Sign up to save your podcasts
Or


This paper discusses how a successful RL fine-tuning uncovers an emergent two-phase hierarchical reasoning dynamic in LLMs, mirroring human cognition by separating high-level strategic planning from low-level procedural execution. The authors argue that conventional RL methods, which apply optimization pressure agnostically to all tokens, are inefficient because they fail to concentrate learning efforts on the true bottleneck: mastering strategic planning tokens. The proposed method, HICRA, addresses this by selectively amplifying the learning signal for these high-impact planning tokens, with extensive experimental results demonstrating that this targeted approach significantly outperforms baselines like GRPO across various mathematical and multimodal benchmarks. The paper also introduces Strategic Grams and Semantic Entropy as diagnostic tools to accurately track this strategic exploration, revealing why common metrics like token-level entropy are often misleading.
By Enoch H. KangThis paper discusses how a successful RL fine-tuning uncovers an emergent two-phase hierarchical reasoning dynamic in LLMs, mirroring human cognition by separating high-level strategic planning from low-level procedural execution. The authors argue that conventional RL methods, which apply optimization pressure agnostically to all tokens, are inefficient because they fail to concentrate learning efforts on the true bottleneck: mastering strategic planning tokens. The proposed method, HICRA, addresses this by selectively amplifying the learning signal for these high-impact planning tokens, with extensive experimental results demonstrating that this targeted approach significantly outperforms baselines like GRPO across various mathematical and multimodal benchmarks. The paper also introduces Strategic Grams and Semantic Entropy as diagnostic tools to accurately track this strategic exploration, revealing why common metrics like token-level entropy are often misleading.