November 24, 2025

Ep#44: From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models

1 hour 5 minutes

Reasoning over long horizons would allow robots to generalize better to unseen environments and settings zero-shot. One mechanism for this kind of reasoning would be world models, but traditional video world models still tend to struggle with long horizons, and are very data intensive to train. But what if instead of predicting images about the future, we predicted just the symbolic information necessary for reasoning?

Nishanth Kumar tells us about Pixels to Predicates, a method for symbol grounding which allows a VLM to plan sequences of robot skills to achieve unseen goals in previously unseen settings.

To find out more, watch episode #44 of RoboPapers with Michael Cho and Chris Paxton now!

Abstract:

Our aim is to learn to solve long-horizon decision-making problems in complex robotics domains given low-level skills and a handful of short-horizon demonstrations containing sequences of images. To this end, we focus on learning abstract symbolic world models that facilitate zero-shot generalization to novel goals via planning. A critical component of such models is the set of symbolic predicates that define properties of and relationships between objects. In this work, we leverage pretrained vision language models (VLMs) to propose a large set of visual predicates potentially relevant for decision-making, and to evaluate those predicates directly from camera images. At training time, we pass the proposed predicates and demonstrations into an optimization-based model-learning algorithm to obtain an abstract symbolic world model that is defined in terms of a compact subset of the proposed predicates. At test time, given a novel goal in a novel setting, we use the VLM to construct a symbolic description of the current world state, and then use a search-based planning algorithm to find a sequence of low-level skills that achieves the goal. We demonstrate empirically across experiments in both simulation and the real world that our method can generalize aggressively, applying its learned world model to solve problems with a wide variety of object types, arrangements, numbers of objects, and visual backgrounds, as well as novel goals and much longer horizons than those seen at training time.

Project Page

ArXiV

Thread on X

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

...more

View all episodes

By Chris Paxton and Michael Cho

November 24, 2025

Ep#44: From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models

1 hour 5 minutes

Nishanth Kumar tells us about Pixels to Predicates, a method for symbol grounding which allows a VLM to plan sequences of robot skills to achieve unseen goals in previously unseen settings.

To find out more, watch episode #44 of RoboPapers with Michael Cho and Chris Paxton now!

Abstract:

Project Page

ArXiV

Thread on X

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

...more

Share Ep#44: From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models

Sign up to save your podcasts

Ep#44: From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models

Ep#44: From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models