Daily Tech Feed: From the Labs

ΔBelief-RL: Rethinking How AI Learns to Act


Listen Later

DTF:FTL Episode 0007 — ΔBelief-RL: Intrinsic Credit Assignment for Long Horizon Interaction
Paper
  • Title: Intrinsic Credit Assignment for Long Horizon Interaction
  • Authors: Ilze Amanda Auzina, Joschka Strüber, Sergio Hernández-Gutiérrez, Shashwat Goel, Ameya Prabhu, Matthias Bethge
  • Institution: University of Tübingen / Tübingen AI Center / MPI for Intelligent Systems / ELLIS Institute Tübingen
  • arXiv: https://arxiv.org/abs/2602.12342
  • Project Page: https://bethgelab.github.io/delta-belief-rl/
  • Code: https://github.com/bethgelab/delta-belief-rl/
  • Models: https://huggingface.co/collections/bethgelab/delta-belief-rl
  • What It Does

    Uses an LLM's own change in belief (ΔBelief) about the correct answer as a per-turn reward signal for RL training. Instead of sparse outcome rewards (right/wrong at the end), each intermediate action is rewarded based on whether the model's confidence in the right answer increased. Trains information-seeking behavior that generalizes across domains and scales beyond the training horizon.

    Key Results
    • 1.7B parameter model outperforms DeepSeek-V3.2 (670B) by 10.45% on 20 Questions
    • 4B parameter model outperforms DeepSeek-V3.2 by 19.37%
    • Generalizes to unseen tasks: customer service, user personalization, murder mystery, city guessing
    • Scales beyond training horizon: trained at 20 turns, continues improving up to 50 turns
    • Critical Framing — Richard Sutton's Perspective

      Richard Sutton (2024 Turing Award) would credit the paper for addressing credit assignment — the core RL problem — with an intrinsic reward signal. But he would identify fundamental limitations inherited from the LLM substrate:

      1. Frozen weights — no continual learning during interaction
      2. Fixed context window — belief changes are ephemeral, not permanent learning
      3. No ground truth — measures confidence in text-space, not against real-world outcomes
      4. No sensation-action-reward stream — synthetic text interaction, not embodied experience
      5. Imitation substrate — RL grafted onto an imitation learner

      References
      • The Bitter Lesson — Richard Sutton (2019)
      • Richard Sutton interview — Dwarkesh Patel (Sep 2025)
      • Sutton on continual learning — NextBigFuture
      • Voices
        • FRY (stephen_fry) — Mentor/explainer
        • BOB (aiden) — Sharp provocateur
        • Episode Info
          • Date: 2026-02-16
          • Runtime: ~15 minutes
          • Tone: Supportive but straightforward
          • ...more
            View all episodesView all episodes
            Download on the App Store

            Daily Tech Feed: From the LabsBy Daily Tech Feed