For robots to be useful, they must be able to interact with a wide variety of environments; and yet, scaling interaction data is difficult, expensive, and time consuming. Instead, much research revolves around sim-to-real manipulation — but mostly this has not been mobile manipulation.
Recently, though, this has begun to change. Two recent papers from Tairen He and Haoru Xue show us how to unlock the potential of this technique, building policies which, without any real data at all, can move objects around in the world and open doors in the real world with a humanoid robot.
Watch Episode #60 of RoboPapers now to learn more, hosted by Chris Paxton and Jiafei Duan. In this episode, we cover two papers:. First is VIRAL: Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation; and second is DoorMan: Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer.
Paper #1: VIRAL
Abstract:
A key barrier to the real-world deployment of humanoid robots is the lack of autonomous loco-manipulation skills. We introduce VIRAL, a visual sim-to-real framework that learns humanoid loco-manipulation entirely in simulation and deploys it zero-shot to real hardware. VIRAL follows a teacher-student design: a privileged RL teacher, operating on full state, learns long-horizon loco-manipulation using a delta action space and reference state initialization. A vision-based student policy is then distilled from the teacher via large-scale simulation with tiled rendering, trained with a mixture of online DAgger and behavior cloning. We find that compute scale is critical: scaling simulation to tens of GPUs (up to 64) makes both teacher and student training reliable, while low-compute regimes often fail. To bridge the sim-to-real gap, VIRAL combines large-scale visual domain randomization over lighting, materials, camera parameters, image quality, and sensor delays--with real-to-sim alignment of the dexterous hands and cameras. Deployed on a Unitree G1 humanoid, the resulting RGB-based policy performs continuous loco-manipulation for up to 54 cycles, generalizing to diverse spatial and appearance variations without any real-world fine-tuning, and approaching expert-level teleoperation performance. Extensive ablations dissect the key design choices required to make RGB-based humanoid loco-manipulation work in practice.
Project page: https://viral-humanoid.github.io/
ArXiV: https://arxiv.org/abs/2511.15200
Original thread on X:
Paper #2: Doorman
Abstract:
Recent progress in GPU-accelerated, photorealistic simulation has opened a scalable data-generation path for robot learning, where massive physics and visual randomization allow policies to generalize beyond curated environments. Building on these advances, we develop a teacher-student-bootstrap learning framework for vision-based humanoid loco-manipulation, using articulated-object interaction as a representative high-difficulty benchmark. Our approach introduces a staged-reset exploration strategy that stabilizes long-horizon privileged-policy training, and a GRPO-based fine-tuning procedure that mitigates partial observability and improves closed-loop consistency in sim-to-real RL. Trained entirely on simulation data, the resulting policy achieves robust zero-shot performance across diverse door types and outperforms human teleoperators by up to 31.7% in task completion time under the same whole-body control stack. This represents the first humanoid sim-to-real policy capable of diverse articulated loco-manipulation using pure RGB perception.
Project page: https://doorman-humanoid.github.io/
ArXiV: https://arxiv.org/abs/2512.01061
This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com