AI: post transformers

CE-GPPO: Controlling Entropy via Gradient-Preserving Policy Optimization


Listen Later

The September 25 2035 paper introduces a novel reinforcement learning (RL) algorithm, **Controlling Entropy via Gradient-Preserving Policy Optimization (CE-GPPO)**, designed to fine-tune large language models (LLMs) for complex reasoning tasks. The authors analyze how **policy entropy**, which represents the balance between exploration and exploitation, becomes unstable in existing methods like Proximal Policy Optimization (PPO) due to the clipping of **low-probability tokens**. CE-GPPO addresses this by reintroducing gradients from these clipped tokens—specifically **Positive-advantage Low-Probability (PA&LP)** and **Negative-advantage Low-Probability (NA&LP)** tokens—in a bounded and controlled manner. The goal is to regulate entropy dynamics and prevent both **entropy collapse** and **entropy explosion**. Empirical results on mathematical reasoning benchmarks show that CE-GPPO consistently outperforms strong baselines by maintaining more **stable and optimal entropy** throughout training.


Source:

https://arxiv.org/pdf/2509.20712

...more
View all episodesView all episodes
Download on the App Store

AI: post transformersBy mcgrof