The September 25 2035 paper introduces a novel reinforcement learning (RL) algorithm, Controlling Entropy via Gradient-Preserving Policy Optimization (CE-GPPO), designed to fine-tune large language models (LLMs) for complex reasoning tasks. The authors analyze how policy entropy, which represents the balance between exploration and exploitation, becomes unstable in existing methods like Proximal Policy Optimization (PPO) due to the clipping of low-probability tokens. CE-GPPO addresses this by reintroducing gradients from these clipped tokens—specifically Positive-advantage Low-Probability (PA&LP) and Negative-advantage Low-Probability (NA&LP) tokens—in a bounded and controlled manner. The goal is to regulate entropy dynamics and prevent both entropy collapse and entropy explosion. Empirical results on mathematical reasoning benchmarks show that CE-GPPO consistently outperforms strong baselines by maintaining more stable and optimal entropy throughout training. Source: https://arxiv.org/pdf/2509.20712