Learning GenAI via SOTA Papers

EP105: iStar Autonomous Agents Grading Their Own Homework


Listen Later

This paper introduces iStar (implicit step rewards for agentic RL), a novel credit-assignment strategy designed to improve the training of Large Language Models (LLMs) as autonomous agents.

The Problem:Training LLM agents using traditional Reinforcement Learning (RL) is highly challenging because rewards in interactive environments are typically sparse and delayed, trajectories are long, and the environments are often open-ended with unverifiable rewards. While previous methods have tried integrating process supervision to provide denser feedback, they frequently suffer from biased manual annotations, reward hacking, high variance from overly fine-grained token-level rewards, or failures in open-ended language environments where exact state overlaps are rare.

The iStar Solution:To address these limitations, iStar establishes a stable, self-reinforcing training loop by alternately optimizing an implicit Process Reward Model (PRM) alongside the policy model (the LLM agent).

  • Trajectory-Based DPO: The implicit PRM is trained using a trajectory-based Direct Preference Optimization (DPO) objective. This allows the model to generate implicit step rewards for each action directly from trajectory preferences, entirely removing the need for costly step labels or additional rollouts.
  • Combined Advantages: During policy training, iStar calculates a combined advantage that merges episode-level advantages (derived from final outcome rewards) with step-level advantages (derived from the implicit step rewards). This dual-level approach captures both global task success and the specific contributions of individual actions, providing dense feedback to guide exploration while controlling variance.

Key Results:The researchers evaluated iStar on three complex agent benchmarks: WebShop (web navigation), VisualSokoban (spatial reasoning and planning), and SOTOPIA (open-ended social interactions). The results demonstrate that iStar:

  • Achieves state-of-the-art performance, significantly outperforming frontier LLMs and strong RL baselines (such as PPO, GRPO, RLOO, and REINFORCE++) across all tested domains.
  • Exhibits superior sample efficiency and training stability compared to vanilla RL and recent token-level PRM methods.
  • Drives efficient exploration, allowing agents to achieve task success in fewer steps while increasing both step- and episode-level rewards.
  • Is highly versatile and can be seamlessly integrated into various existing RL algorithms to consistently boost their performance.
...more
View all episodesView all episodes
Download on the App Store

Learning GenAI via SOTA PapersBy Yun Wu