Share Ep31. ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting

Copy link

October 29, 2024

Ep31. ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting

16 minutes

This paper describes the development of ROCKET-1, a novel hierarchical agent architecture that leverages visual-temporal context prompting to enable agents to master open-world interaction in Minecraft. This system utilizes a low-level policy (ROCKET-1) that predicts actions based on concatenated visual observations and segmentation masks, guided by a high-level reasoner. The key innovation lies in the visual-temporal context prompting protocol, which uses object segmentation from both past and present observations to effectively communicate spatial information. This approach allows for the successful completion of complex tasks, such as crafting and mining, in Minecraft, which were previously unattainable using traditional language-based prompting methods. The paper highlights the potential of visual-temporal context prompting to overcome the limitations of existing approaches and unlock the full potential of vision-language models for embodied decision-making.

...more

View all episodes

By The Daily ML

October 29, 2024

Ep31. ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting

16 minutes

...more

Sign up to save your podcasts