March 11, 2026

EP117: AI agents learn through textual reflection

18 minutes

The paper addresses the limitation that Large Language Model (LLM) agents trained with standard reinforcement learning (RL) often struggle to actively explore their environments and adapt from trial-and-error experiences in multi-turn, long-horizon tasks.

To solve this, the authors introduce LAMER (LLM Agent with Meta-RL), a general Meta-RL framework designed to help agents actively explore and learn from environmental feedback at test time. LAMER achieves this balance between exploration and exploitation through two key components:

Cross-episode training: Instead of maximizing immediate single-episode returns, LAMER treats a trial as a sequence of multiple episodes and maximizes the long-term, cross-episode return. This incentivizes the agent to gather diverse information and explore in early episodes, and then exploit that knowledge to succeed in later attempts.
In-context policy adaptation via self-reflection: Rather than relying on computationally expensive gradient updates during evaluation, LAMER prompts the agent to generate textual self-reflections on past mistakes. The agent then uses these reflections as an in-context memory to adjust its strategy for the next episode.

Extensive evaluations across complex environments—including Sokoban, MineSweeper, Webshop, and ALFWorld—demonstrate that LAMER significantly outperforms both prompting-based methods and standard RL baselines. By internalizing exploration strategies, LAMER produces more diverse trajectories, exhibits much stronger test-time scaling across multiple attempts, and generalizes significantly better to harder and out-of-distribution tasks.

...more

View all episodes

By Yun Wu

March 11, 2026

EP117: AI agents learn through textual reflection

18 minutes

Cross-episode training: Instead of maximizing immediate single-episode returns, LAMER treats a trial as a sequence of multiple episodes and maximizes the long-term, cross-episode return. This incentivizes the agent to gather diverse information and explore in early episodes, and then exploit that knowledge to succeed in later attempts.
In-context policy adaptation via self-reflection: Rather than relying on computationally expensive gradient updates during evaluation, LAMER prompts the agent to generate textual self-reflections on past mistakes. The agent then uses these reflections as an in-context memory to adjust its strategy for the next episode.

...more

Share EP117: AI agents learn through textual reflection

Sign up to save your podcasts

EP117: AI agents learn through textual reflection

EP117: AI agents learn through textual reflection