February 17, 2026

ΔBelief-RL: Rethinking How AI Learns to Act

20 minutes

DTF:FTL Episode 0007 — ΔBelief-RL: Intrinsic Credit Assignment for Long Horizon Interaction

Paper

Title: Intrinsic Credit Assignment for Long Horizon Interaction

Authors: Ilze Amanda Auzina, Joschka Strüber, Sergio Hernández-Gutiérrez, Shashwat Goel, Ameya Prabhu, Matthias Bethge

Institution: University of Tübingen / Tübingen AI Center / MPI for Intelligent Systems / ELLIS Institute Tübingen

arXiv: https://arxiv.org/abs/2602.12342

Project Page: https://bethgelab.github.io/delta-belief-rl/

Code: https://github.com/bethgelab/delta-belief-rl/

Models: https://huggingface.co/collections/bethgelab/delta-belief-rl

What It Does

Uses an LLM's own change in belief (ΔBelief) about the correct answer as a per-turn reward signal for RL training. Instead of sparse outcome rewards (right/wrong at the end), each intermediate action is rewarded based on whether the model's confidence in the right answer increased. Trains information-seeking behavior that generalizes across domains and scales beyond the training horizon.

Key Results

1.7B parameter model outperforms DeepSeek-V3.2 (670B) by 10.45% on 20 Questions

4B parameter model outperforms DeepSeek-V3.2 by 19.37%

Generalizes to unseen tasks: customer service, user personalization, murder mystery, city guessing

Scales beyond training horizon: trained at 20 turns, continues improving up to 50 turns

Critical Framing — Richard Sutton's Perspective

Richard Sutton (2024 Turing Award) would credit the paper for addressing credit assignment — the core RL problem — with an intrinsic reward signal. But he would identify fundamental limitations inherited from the LLM substrate:

1. Frozen weights — no continual learning during interaction

2. Fixed context window — belief changes are ephemeral, not permanent learning

3. No ground truth — measures confidence in text-space, not against real-world outcomes

4. No sensation-action-reward stream — synthetic text interaction, not embodied experience

5. Imitation substrate — RL grafted onto an imitation learner

References

The Bitter Lesson — Richard Sutton (2019)

Richard Sutton interview — Dwarkesh Patel (Sep 2025)

Sutton on continual learning — NextBigFuture

Voices

FRY (stephen_fry) — Mentor/explainer

BOB (aiden) — Sharp provocateur

Episode Info

Date: 2026-02-16

Runtime: ~15 minutes

Tone: Supportive but straightforward

...more

View all episodes

By Daily Tech Feed

February 17, 2026

ΔBelief-RL: Rethinking How AI Learns to Act

20 minutes

DTF:FTL Episode 0007 — ΔBelief-RL: Intrinsic Credit Assignment for Long Horizon Interaction

Paper

Title: Intrinsic Credit Assignment for Long Horizon Interaction

Authors: Ilze Amanda Auzina, Joschka Strüber, Sergio Hernández-Gutiérrez, Shashwat Goel, Ameya Prabhu, Matthias Bethge

Institution: University of Tübingen / Tübingen AI Center / MPI for Intelligent Systems / ELLIS Institute Tübingen

arXiv: https://arxiv.org/abs/2602.12342

Project Page: https://bethgelab.github.io/delta-belief-rl/

Code: https://github.com/bethgelab/delta-belief-rl/

Models: https://huggingface.co/collections/bethgelab/delta-belief-rl

What It Does

Key Results

1.7B parameter model outperforms DeepSeek-V3.2 (670B) by 10.45% on 20 Questions

4B parameter model outperforms DeepSeek-V3.2 by 19.37%

Generalizes to unseen tasks: customer service, user personalization, murder mystery, city guessing

Scales beyond training horizon: trained at 20 turns, continues improving up to 50 turns

Critical Framing — Richard Sutton's Perspective

1. Frozen weights — no continual learning during interaction

2. Fixed context window — belief changes are ephemeral, not permanent learning

3. No ground truth — measures confidence in text-space, not against real-world outcomes

4. No sensation-action-reward stream — synthetic text interaction, not embodied experience

5. Imitation substrate — RL grafted onto an imitation learner

References

The Bitter Lesson — Richard Sutton (2019)

Richard Sutton interview — Dwarkesh Patel (Sep 2025)

Sutton on continual learning — NextBigFuture

Voices

FRY (stephen_fry) — Mentor/explainer

BOB (aiden) — Sharp provocateur

Episode Info

Date: 2026-02-16

Runtime: ~15 minutes

Tone: Supportive but straightforward

...more

Share ΔBelief-RL: Rethinking How AI Learns to Act

Sign up to save your podcasts

ΔBelief-RL: Rethinking How AI Learns to Act

ΔBelief-RL: Rethinking How AI Learns to Act