April 23, 2026

Ep#75: TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

1 hour 1 minute

Reinforcement on robots is highly limited by our ability to design good reward functions; this means that designing strong, generalizable reward functions is a key enabler to progress on real-world reinforcement learning.

But we already have a very general class of models: VLMs. Wouldn’t it be great if you could just use a VLM to generate rewards, then? TOPReward directly generates rewards from the probability of the “True” token of a VLM question-answering response; this makes it easy to implement, incredibly general, and surprisingly powerful. We talked to Shirui Chen and Cole Harrison to learn more.

Watch Episode#75 of RoboPapers now to learn more, with Chris Paxton and Jiafei Duan!

Abstract

While Vision-Language-Action (VLA) models have seen rapid progress in pretraining, their advancement in Reinforcement Learning (RL) remains hampered by low sample efficiency and sparse rewards in real-world settings. Developing generalizable process reward models is essential for providing the fine-grained feedback necessary to bridge this gap, yet existing temporal value functions often fail to generalize beyond their training domains. We introduce TOPReward, a novel, probabilistically grounded temporal value function that leverages the latent world knowledge of pretrained video Vision-Language Models (VLMs) to estimate robotic task progress. Unlike prior methods that prompt VLMs to directly output progress values, which are prone to numerical misrepresentation, TOPReward extracts task progress directly from the VLM's internal token logits. In zero-shot evaluations across 130+ distinct real-world tasks and multiple robot platforms (e.g., Franka, YAM, SO-100/101), TOPReward achieves 0.947 mean Value-Order Correlation (VOC) on Qwen3-VL, dramatically outperforming the state-of-the-art GVL baseline which achieves near-zero correlation on the same open-source model. We further demonstrate that TOPReward serves as a versatile tool for downstream applications, including success detection and reward-aligned behavior cloning.

Learn More

Project Page: https://topreward.github.io/webpage/

ArXiV: https://arxiv.org/abs/2602.19313

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

...more