We show that CLIP (Contrastive Language-Image Pre-training) significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown. We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks.
2021: Sheng Shen, Liunian Harold Li, Hao Tan, M. Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, K. Keutzer
https://arxiv.org/pdf/2107.06383v1.pdf