
Sign up to save your podcasts
Or


Evaluating robot policies is hard. Ideally, instead of testing every new policy on a real robot, you could test in simulation; but simulations rarely correlate well with real-world performance. In order to make good, useful simulations, you need to spend a great deal of time and effort.
That’s where PolaRiS comes in: it’s a toolkit that lets you take a short video of a real scene and turn it into a high-fidelity simulation. It provides what you need to build a good evaluation environment, and it “ships” with off-the-shelf environments that already show strong sim-to-real correlation, meaning that they can be used to inform policy performance.
Arhan Jain and Karl Pertsch join us to talk about what they have built, why, and how you can use it.
Watch Episode #62 of RoboPapers, with Chris Paxton and Jiafei Duan, now!
Abstract:
A significant challenge for robot learning research is our ability to accurately measure and compare the performance of robot policies. Benchmarking in robotics is historically challenging due to the stochasticity, reproducibility, and time-consuming nature of real-world rollouts. This challenge is exacerbated for recent generalist policies, which has to be evaluated across a wide variety of scenes and tasks. Evaluation in simulation offers a scalable complement to real world evaluations, but the visual and physical domain gap between existing simulation benchmarks and the real world has made them an unreliable signal for policy improvement. Furthermore, building realistic and diverse simulated environments has traditionally required significant human effort and expertise. To bridge the gap, we introduce Policy Evaluation and Environment Reconstruction in Simulation (PolaRiS), a scalable real-to-sim framework for high-fidelity simulated robot evaluation. PolaRiS utilizes neural reconstruction methods to turn short video scans of real-world scenes into interactive simulation environments. Additionally, we develop a simple simulation data co-training recipe that bridges remaining real-to-sim gaps and enables zero-shot evaluation in unseen simulation environments. Through extensive paired evaluations between simulation and the real world, we demonstrate that PolaRiS evaluations provide a much stronger correlation to real world generalist policy performance than existing simulated benchmarks. Its simplicity also enables rapid creation of diverse simulated environments. As such, this work takes a step towards distributed and democratized evaluation for the next generation of robotic foundation models.
Learn More:
Project Page: https://polaris-evals.github.io/
ArXiV: https://arxiv.org/abs/2512.16881
This post on X
By Chris Paxton and Michael ChoEvaluating robot policies is hard. Ideally, instead of testing every new policy on a real robot, you could test in simulation; but simulations rarely correlate well with real-world performance. In order to make good, useful simulations, you need to spend a great deal of time and effort.
That’s where PolaRiS comes in: it’s a toolkit that lets you take a short video of a real scene and turn it into a high-fidelity simulation. It provides what you need to build a good evaluation environment, and it “ships” with off-the-shelf environments that already show strong sim-to-real correlation, meaning that they can be used to inform policy performance.
Arhan Jain and Karl Pertsch join us to talk about what they have built, why, and how you can use it.
Watch Episode #62 of RoboPapers, with Chris Paxton and Jiafei Duan, now!
Abstract:
A significant challenge for robot learning research is our ability to accurately measure and compare the performance of robot policies. Benchmarking in robotics is historically challenging due to the stochasticity, reproducibility, and time-consuming nature of real-world rollouts. This challenge is exacerbated for recent generalist policies, which has to be evaluated across a wide variety of scenes and tasks. Evaluation in simulation offers a scalable complement to real world evaluations, but the visual and physical domain gap between existing simulation benchmarks and the real world has made them an unreliable signal for policy improvement. Furthermore, building realistic and diverse simulated environments has traditionally required significant human effort and expertise. To bridge the gap, we introduce Policy Evaluation and Environment Reconstruction in Simulation (PolaRiS), a scalable real-to-sim framework for high-fidelity simulated robot evaluation. PolaRiS utilizes neural reconstruction methods to turn short video scans of real-world scenes into interactive simulation environments. Additionally, we develop a simple simulation data co-training recipe that bridges remaining real-to-sim gaps and enables zero-shot evaluation in unseen simulation environments. Through extensive paired evaluations between simulation and the real world, we demonstrate that PolaRiS evaluations provide a much stronger correlation to real world generalist policy performance than existing simulated benchmarks. Its simplicity also enables rapid creation of diverse simulated environments. As such, this work takes a step towards distributed and democratized evaluation for the next generation of robotic foundation models.
Learn More:
Project Page: https://polaris-evals.github.io/
ArXiV: https://arxiv.org/abs/2512.16881
This post on X