
Sign up to save your podcasts
Or
This paper introduces AIMV2, a family of large-scale vision encoders pre-trained using a novel multimodal autoregressive method. Unlike previous methods, AIMV2 simultaneously predicts image patches and text tokens, leading to improved performance across various downstream tasks, including image recognition, object detection, and multimodal understanding. The approach is notably scalable and simpler to implement than comparable models. AIMV2 consistently outperforms state-of-the-art contrastive models on many benchmarks, showcasing its effectiveness as a generalist vision encoder. Extensive experiments demonstrate its strong scaling properties and compatibility with different model architectures and training techniques.
https://arxiv.org/pdf/2411.14402
This paper introduces AIMV2, a family of large-scale vision encoders pre-trained using a novel multimodal autoregressive method. Unlike previous methods, AIMV2 simultaneously predicts image patches and text tokens, leading to improved performance across various downstream tasks, including image recognition, object detection, and multimodal understanding. The approach is notably scalable and simpler to implement than comparable models. AIMV2 consistently outperforms state-of-the-art contrastive models on many benchmarks, showcasing its effectiveness as a generalist vision encoder. Extensive experiments demonstrate its strong scaling properties and compatibility with different model architectures and training techniques.
https://arxiv.org/pdf/2411.14402