The Daily ML

Ep31. ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting


Listen Later

This paper describes the development of ROCKET-1, a novel hierarchical agent architecture that leverages visual-temporal context prompting to enable agents to master open-world interaction in Minecraft. This system utilizes a low-level policy (ROCKET-1) that predicts actions based on concatenated visual observations and segmentation masks, guided by a high-level reasoner. The key innovation lies in the visual-temporal context prompting protocol, which uses object segmentation from both past and present observations to effectively communicate spatial information. This approach allows for the successful completion of complex tasks, such as crafting and mining, in Minecraft, which were previously unattainable using traditional language-based prompting methods. The paper highlights the potential of visual-temporal context prompting to overcome the limitations of existing approaches and unlock the full potential of vision-language models for embodied decision-making.
...more
View all episodesView all episodes
Download on the App Store

The Daily MLBy The Daily ML