The October 7, 2025 joint collaboration between Stanford University, Texas A&M University, UC San Diego, & Lambda paper introduces AGENTFLOW, a novel agentic system designed to enhance the reasoning capabilities of Large Language Models (LLMs) by decomposing complex tasks into a multi-turn Markov Decision Process (MDP). This system utilizes specialized, collaborating modules—an Action Planner, Tool Executor, Execution Verifier, and Solution Generator—with only the Planner being trainable. Training is performed using Flow-GRPO, an on-policy Reinforcement Learning (RL) algorithm that optimizes the planner’s strategy using a final-outcome-based reward, effectively tackling the challenging problem of long-horizon credit assignment in multi-step reasoning. Experiments across diverse domains, including mathematical and scientific reasoning, demonstrate that the Flow-GRPO tuned AGENTFLOW significantly outperforms baseline LLMs and other specialized systems, achieving higher accuracy and demonstrating robust, adaptive tool usage and self-correction abilities. Source: https://arxiv.org/pdf/2510.05592