
Sign up to save your podcasts
Or


This September 2025 paper source is a research paper from Tencent AI Lab and academic collaborators that introduces EVOL-RL, an Evolution-Oriented and Label-free Reinforcement Learning framework for Large Language Models (LLMs). The paper addresses a critical flaw, termed entropy collapse, in existing label-free self-improvement methods like Test-Time Reinforcement Learning (TTRL), where reliance solely on a majority vote leads to a shrink in solution diversity and poor generalization. EVOL-RL overcomes this by incorporating a novel reward system that explicitly balances majority-based selection (for stability) with a novelty-aware reward (for variation), preventing the model from converging to repetitive, low-entropy solutions. Experimental results on mathematical reasoning benchmarks demonstrate that EVOL-RL significantly improves accuracy and generalization by sustaining diverse, longer chains of thought compared to the TTRL baseline.
Source:
https://arxiv.org/pdf/2509.15194
By mcgrofThis September 2025 paper source is a research paper from Tencent AI Lab and academic collaborators that introduces EVOL-RL, an Evolution-Oriented and Label-free Reinforcement Learning framework for Large Language Models (LLMs). The paper addresses a critical flaw, termed entropy collapse, in existing label-free self-improvement methods like Test-Time Reinforcement Learning (TTRL), where reliance solely on a majority vote leads to a shrink in solution diversity and poor generalization. EVOL-RL overcomes this by incorporating a novel reward system that explicitly balances majority-based selection (for stability) with a novelty-aware reward (for variation), preventing the model from converging to repetitive, low-entropy solutions. Experimental results on mathematical reasoning benchmarks demonstrate that EVOL-RL significantly improves accuracy and generalization by sustaining diverse, longer chains of thought compared to the TTRL baseline.
Source:
https://arxiv.org/pdf/2509.15194