May 31, 2026

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

32 minutes

NVIDIA Research presents LocateAnything, a unified generative grounding and detection framework based on Parallel Box Decoding (PBD). Unlike prior VLMs that serialize bounding boxes into sequential coordinate tokens, PBD treats each box as an atomic unit and predicts all coordinates in a single forward pass. This preserves intra-box geometric coherence while achieving 2.5x faster decoding throughput. The model supports diverse localization tasks including document understanding, GUI grounding, dense object detection, and OCR localization. Built on Moon-ViT vision encoder and Qwen2.5 language decoder. Trained on LocateAnything-Data with 138M language queries and 785M bounding boxes. Achieves state-of-the-art on LVIS, M6Doc, and ScreenSpot-Pro benchmarks. Models and demo available on HuggingFace.

...more

View all episodes

By Shaoqing Tan

May 31, 2026

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

32 minutes

...more

Share LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

Sign up to save your podcasts

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding