Papers Read on AI

Efficient Self-supervised Vision Transformers for Representation Learning


Listen Later

This paper investigates two techniques for developing efficient self-supervised vision transformers (EsViT) for visual representation learning. First, we show through a comprehensive empirical study that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity but with a cost of losing the ability to capture fine-grained correspondences between image regions. Second, we propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies and as a result significantly improves the quality of the learned vision representations.
2021: Chunyuan Li, Jianwei Yang, Pengchuan Zhang, Mei Gao, Bin Xiao, Xiyang Dai, Lu Yuan, Jianfeng Gao
https://arxiv.org/pdf/2106.09785v1.pdf
...more
View all episodesView all episodes
Download on the App Store

Papers Read on AIBy Rob

  • 3.7
  • 3.7
  • 3.7
  • 3.7
  • 3.7

3.7

3 ratings