Transformer emerges as a powerful tool for visual recognition. In addition to demonstrating competitive performance on a broad range of visual benchmarks, recent works also argue that Transformers are much more robust than Convolutions Neural Networks (CNNs). Nonetheless, surprisingly, we find these conclusions are drawn from unfair experimental settings, where Transformers and CNNs are compared at different scales and are applied with distinct training frameworks. In this paper, we aim to provide the first fair & in-depth comparisons between Transformers and CNNs, focusing on robustness evaluations.
2021: Yutong Bai, Jieru Mei, A. Yuille, Cihang Xie
Ranked #1 on Adversarial Robustness on ImageNet-A
https://arxiv.org/pdf/2111.05464v1.pdf