Embodied AI 101

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding


Listen Later

NVIDIA Research presents LocateAnything, a unified generative grounding and detection framework based on Parallel Box Decoding (PBD). Unlike prior VLMs that serialize bounding boxes into sequential coordinate tokens, PBD treats each box as an atomic unit and predicts all coordinates in a single forward pass. This preserves intra-box geometric coherence while achieving 2.5x faster decoding throughput. The model supports diverse localization tasks including document understanding, GUI grounding, dense object detection, and OCR localization. Built on Moon-ViT vision encoder and Qwen2.5 language decoder. Trained on LocateAnything-Data with 138M language queries and 785M bounding boxes. Achieves state-of-the-art on LVIS, M6Doc, and ScreenSpot-Pro benchmarks. Models and demo available on HuggingFace.
...more
View all episodesView all episodes
Download on the App Store

Embodied AI 101By Shaoqing Tan