The September 25 2025 paper introduces Tree-based Group Relative Policy Optimization (Tree-GRPO), a new reinforcement learning (RL) method designed to enhance the agentic capabilities of large language models (LLMs) in multi-turn tasks where supervision is typically sparse. Tree-GRPO addresses the challenges of sparse rewards and heavy rollout costs associated with existing chain-based RL by employing a tree-search sampling strategy where each node represents a complete agent interaction step, allowing for prefix sharing and reduced budget use. This tree structure inherently creates finer-grained process supervision signals from outcome rewards, a mechanism shown to be structurally equivalent to step-level direct preference learning. Empirical results across multiple datasets demonstrate that the tree-based approach consistently achieves higher performance with less rollout budget compared to chain-based methods. Source: https://arxiv.org/pdf/2509.21240