
Sign up to save your podcasts
Or


This paper introduces Natural Language Actor-Critic (NLAC), a novel off-policy reinforcement learning algorithm designed to train Large Language Model (LLM) agents for complex, multi-turn tasks. NLAC addresses the limitations of traditional methods, which rely on sparse scalar rewards and unstable on-policy training, by employing a generative LLM critic that outputs training signals as natural language critiques rather than scalar values. This textual feedback, which explains why an action is suboptimal through the prediction and analysis of future rollouts, allows the LLM policy to improve its actions through a self-refinement paradigm. The system leverages a language Bellman backup to train a language successor model off-policy and demonstrates superior empirical performance and data efficiency across various benchmarks, including reasoning, dialogue, and tool-use tasks.
By Enoch H. KangThis paper introduces Natural Language Actor-Critic (NLAC), a novel off-policy reinforcement learning algorithm designed to train Large Language Model (LLM) agents for complex, multi-turn tasks. NLAC addresses the limitations of traditional methods, which rely on sparse scalar rewards and unstable on-policy training, by employing a generative LLM critic that outputs training signals as natural language critiques rather than scalar values. This textual feedback, which explains why an action is suboptimal through the prediction and analysis of future rollouts, allows the LLM policy to improve its actions through a self-refinement paradigm. The system leverages a language Bellman backup to train a language successor model off-policy and demonstrates superior empirical performance and data efficiency across various benchmarks, including reasoning, dialogue, and tool-use tasks.