Generative Depth Supervision for Embodied Vision-Language Models
Vision-language model that adds generative depth prediction during pre-training for physical grounding; achieves SOTA on embodied benchiments and transfers directly to real-robot tasks.
Generative Depth Supervision for Embodied Vision-Language Models
Vision-language model that adds generative depth prediction during pre-training for physical grounding; achieves SOTA on embodied benchiments and transfers directly to real-robot tasks.