This paper describes the development of ROCKET-1, a novel hierarchical agent architecture that leverages visual-temporal context prompting to enable agents to master open-world interaction in Minecraft. This system utilizes a low-level policy (ROCKET-1) that predicts actions based on concatenated visual observations and segmentation masks, guided by a high-level reasoner. The key innovation lies in the visual-temporal context prompting protocol, which uses object segmentation from both past and present observations to effectively communicate spatial information. This approach allows for the successful completion of complex tasks, such as crafting and mining, in Minecraft, which were previously unattainable using traditional language-based prompting methods. The paper highlights the potential of visual-temporal context prompting to overcome the limitations of existing approaches and unlock the full potential of vision-language models for embodied decision-making.