RoboPapers

Ep#8: VGGT: Visual Geometry Grounded Transformer


Listen Later

3D spatial information provides a really strong signal for robotics policies, something we’ve discussed in previous episodes. But computing this 3D structure is hard, and often relies on imperfect, low-quality depth sensors. It would be great if we could reconstruct this information from cameras alone, with little prior information.

Well, that’s exactly what VGGT does!

We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views. This approach is a step forward in 3D computer vision, where models have typically been constrained to and specialized for single tasks. It is also simple and efficient, reconstructing images in under one second, and still outperforming alternatives without their post-processing utilizing visual geometry optimization techniques. The network achieves state-of-the-art results in multiple 3D tasks, including camera parameter estimation, multi-view depth estimation, dense point cloud reconstruction, and point tracking. We also show that using pretrained VGGT as a feature backbone significantly enhances downstream tasks, such as non-rigid point tracking and feed-forward novel view synthesis.

Project Page



This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
...more
View all episodesView all episodes
Download on the App Store

RoboPapersBy Chris Paxton and Michael Cho