
Sign up to save your podcasts
Or


Robots must often be able to move around and interact with objects in previously-unseen environments to be useful. And the interaction part is important; to do this, they must be able to perceive and interact with the world using onboard sensing.
Enter VisualMimic. Shaofeng Yin and Yanjie Ze show us how to use visual sim-to-real to train diverse loco-manipulation tasks, which can even handle diverse outdoor environments.
Learn more in Episode #48 of RoboPapers today, hosted by Michael Cho and Chris Paxton.
Abstract:
Humanoid loco-manipulation in unstructured environments demands tight integration of egocentric perception and whole-body control. However, existing approaches either depend on external motion capture systems or fail to generalize across diverse tasks. We introduce VisualMimic, a visual sim-to-real framework that unifies egocentric vision with hierarchical whole-body control for humanoid robots. VisualMimic combines a task-agnostic low-level keypoint tracker -- trained from human motion data via a teacher-student scheme -- with a task-specific high-level policy that generates keypoint commands from visual and proprioceptive input. To ensure stable training, we inject noise into the low-level policy and clip high-level actions using human motion statistics. VisualMimic enables zero-shot transfer of visuomotor policies trained in simulation to real humanoid robots, accomplishing a wide range of loco-manipulation tasks such as box lifting, pushing, football dribbling, and kicking. Beyond controlled laboratory settings, our policies also generalize robustly to outdoor environments. Videos are available at: this https URL .
Project Page: https://visualmimic.github.io/
ArXiV: https://arxiv.org/abs/2509.20322
By Chris Paxton and Michael ChoRobots must often be able to move around and interact with objects in previously-unseen environments to be useful. And the interaction part is important; to do this, they must be able to perceive and interact with the world using onboard sensing.
Enter VisualMimic. Shaofeng Yin and Yanjie Ze show us how to use visual sim-to-real to train diverse loco-manipulation tasks, which can even handle diverse outdoor environments.
Learn more in Episode #48 of RoboPapers today, hosted by Michael Cho and Chris Paxton.
Abstract:
Humanoid loco-manipulation in unstructured environments demands tight integration of egocentric perception and whole-body control. However, existing approaches either depend on external motion capture systems or fail to generalize across diverse tasks. We introduce VisualMimic, a visual sim-to-real framework that unifies egocentric vision with hierarchical whole-body control for humanoid robots. VisualMimic combines a task-agnostic low-level keypoint tracker -- trained from human motion data via a teacher-student scheme -- with a task-specific high-level policy that generates keypoint commands from visual and proprioceptive input. To ensure stable training, we inject noise into the low-level policy and clip high-level actions using human motion statistics. VisualMimic enables zero-shot transfer of visuomotor policies trained in simulation to real humanoid robots, accomplishing a wide range of loco-manipulation tasks such as box lifting, pushing, football dribbling, and kicking. Beyond controlled laboratory settings, our policies also generalize robustly to outdoor environments. Videos are available at: this https URL .
Project Page: https://visualmimic.github.io/
ArXiV: https://arxiv.org/abs/2509.20322