December 09, 2025

Natural language actor-critic: Scalable off-policy learning in language space

13 minutes

This paper introduces Natural Language Actor-Critic (NLAC), a novel off-policy reinforcement learning algorithm designed to train Large Language Model (LLM) agents for complex, multi-turn tasks. NLAC addresses the limitations of traditional methods, which rely on sparse scalar rewards and unstable on-policy training, by employing a generative LLM critic that outputs training signals as natural language critiques rather than scalar values. This textual feedback, which explains why an action is suboptimal through the prediction and analysis of future rollouts, allows the LLM policy to improve its actions through a self-refinement paradigm. The system leverages a language Bellman backup to train a language successor model off-policy and demonstrates superior empirical performance and data efficiency across various benchmarks, including reasoning, dialogue, and tool-use tasks.

...more

View all episodes

By Enoch H. Kang

December 09, 2025

Natural language actor-critic: Scalable off-policy learning in language space

13 minutes

...more

Share Natural language actor-critic: Scalable off-policy learning in language space

Sign up to save your podcasts

Natural language actor-critic: Scalable off-policy learning in language space

Natural language actor-critic: Scalable off-policy learning in language space