In this work, we design a novel Transformer-style module, i.e., Contextual Transformer (CoT) block, for visual recognition. Such design fully capitalizes on the contextual information among input keys to guide the learning of dynamic attention matrix and thus strengthens the capacity of visual representation. Technically, CoT block first contextually encodes input keys via a convolution, leading to a static contextual representation of inputs.
2021: Yehao Li, Ting Yao, Yingwei Pan, Tao Mei
Keywords: Instance segmentation, Object detection
https://arxiv.org/pdf/2107.12292v1.pdf