Learning GenAI via SOTA Papers

EP155: [Agentic Proposing] Small models beat giants with logic bricks


Listen Later

The paper "Agentic Proposing: Enhancing Large Language Model Reasoning via Compositional Skill Synthesis" introduces a novel framework designed to overcome the limitations of traditional synthetic data generation for training reasoning models. While existing methods often struggle with logical inconsistency or limited problem complexity, Agentic Proposing treats problem synthesis as a goal-driven process of compositional logic engineering.

The framework operates through a specialized agent that dynamically selects and orchestrates modular reasoning skills from an autonomous library. The synthesis process is modeled as a sequential decision process involving three main stages:

  1. Skill Acquisition: Extracting and formalizing atomic reasoning modules from diverse corpora into a structured library.
  2. Agentic Supervised Fine-Tuning (SFT): Training the proposer to mimic expert trajectories that include internal reflection, tool use, and dynamic self-correction.
  3. Agentic Reinforcement Learning: Optimizing the proposer using a novel Multi-Granularity Policy Optimization (MGPO) algorithm, which provides fine-grained rewards for both intermediate steps and final outcomes.
  • Iterative Workflow: The agent follows a structured "Draft → Check → Refine → Finalize" loop.
  • Dynamic Self-Correction: Using internal reflection ($\tau_{think}$) and tool execution ($\tau_{exec}$), the agent can proactively prune or update misaligned skills during the generation process to maintain logical integrity.
  • Difficulty Calibration: The framework uses a curriculum-based distribution and solver-based "probers" to ensure problems are precisely targeted at the model's reasoning frontier.

The researchers developed the Agentic-Proposer-4B, which generates high-precision trajectories across mathematics, coding, and science. Key performance highlights include:

  • State-of-the-Art Performance: A 30B solver trained on only 11,000 trajectories achieved a 91.6% accuracy on AIME 2025, rivaling frontier-scale proprietary models like GPT-5 and Gemini-3-Pro.
  • Parameter Efficiency: The framework demonstrates that a small volume of high-quality, high-difficulty synthetic signals can effectively substitute for massive, lower-quality datasets.
  • Robust Generalization: Models trained on this data showed significant gains in multidisciplinary reasoning, including breakthroughs in graduate-level science benchmarks like GPQA.

Ultimately, the paper concludes that the primary bottleneck for advanced reasoning in LLMs is not parameter scale, but the density and precision of high-quality training signals.

Core Framework and MethodologyKey Technical InnovationsEmpirical Results

...more
View all episodesView all episodes
Download on the App Store

Learning GenAI via SOTA PapersBy Yun Wu