The October 7, 2025 joint collaboration between Stanford University, Texas A&M University, UC San Diego, & Lambda paper introduces **AGENTFLOW**, a novel agentic system designed to enhance the reasoning capabilities of Large Language Models (LLMs) by decomposing complex tasks into a multi-turn Markov Decision Process (MDP). This system utilizes specialized, collaborating modules—an **Action Planner**, **Tool Executor**, **Execution Verifier**, and **Solution Generator**—with only the Planner being trainable. Training is performed using **Flow-GRPO**, an on-policy Reinforcement Learning (RL) algorithm that optimizes the planner’s strategy using a final-outcome-based reward, effectively tackling the challenging problem of long-horizon credit assignment in multi-step reasoning. Experiments across diverse domains, including mathematical and scientific reasoning, demonstrate that the Flow-GRPO tuned AGENTFLOW significantly **outperforms baseline LLMs and other specialized systems**, achieving higher accuracy and demonstrating robust, adaptive tool usage and self-correction abilities.
Source:
https://arxiv.org/pdf/2510.05592