January 06, 2022

Vision Transformer for Small-Size Datasets

30 minutes

Recently, the Vision Transformer (ViT), which applied the transformer structure to the image classification task, has outperformed convolutional neural networks. However, the high performance of the ViT results from pre-training using a large-size dataset such as JFT-300M, and its dependence on a large dataset is interpreted as due to low locality inductive bias. This paper proposes Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA), which effectively solve the lack of locality inductive bias and enable it to learn from scratch even on small-size datasets.

2021: Seung Hoon Lee, Seunghyun Lee, Byung Cheol Song

https://arxiv.org/pdf/2112.13492v1.pdf

...more