Search by item HOME > Access full text > Search by item

JBE, vol. 27, no. 4, pp.538-547, July, 2022


Window Attention Module Based Transformer for Image Classification

Sanghoon Kim and Wonjun Kim

C.A E-mail:


Recently introduced image classification methods using Transformers show remarkable performance improvements over conventional neural network-based methods. In order to effectively consider regional features, research has been actively conducted on how to apply transformers by dividing image areas into multiple window areas, but learning of inter-window relationships is still insufficient. In this paper, to overcome this problem, we propose a transformer structure that can reflect the relationship between windows in learning. The proposed method computes the importance of each window region through compression and a fully connected layer based on self-attention operations for each window region. The calculated importance is scaled to each window area as a learned weight of the relationship between the window areas to re-calibrate the feature value. Experimental results show that the proposed method can effectively improve the performance of existing transformer-based methods.

Keyword: Image classification, Transformer, Self-attention, Window-attention

[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Conf. Neural Inf. Process. Syst., pp. 5998-6008, Dec. 2017. doi:
[2] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Represent., May 2021. doi:
[3] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, "Swin transformer: Hierarchical vision transformer using shifted windows," in Proc. IEEE Int. Conf. Comput. Vis., pp. 10012-10022, Oct. 2021. doi:
[4] X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen, and B. Guo, "CSWin transformer: A general vision transformer backbone with cross-shaped windows," 2021, arXiv:2107.00652. [Online]. Available: doi:
[5] J. Yang, C. Li, P. Zhang, X. Dai, B. Xiao, L. Yuan, and J. Gao, “Focal self-attention for local-global interactions in vision transformers,” 2021, arXiv:2107.00641. [Online]. Available: 2107.00641 doi:
[6] X. Chu, Z.Tian, Y. Wang, B. Zhang, H. Ren, X. Wei, H. Xia, and C. Shen, “Twins: Revisiting the design of spatial attention in vision transformers,” in Proc. Conf. Neural Inf. Process. Syst., pp. 9355-9366, Dec. 2021. doi:
[7] J. Deng, W. Dong, R. Socher, LJ. Li, K. Li, and Li Fei-Fei, "ImageNet: A large-scale hierarchical image database," in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp. 248-255, Jun. 2009. doi:
[8] J. Hu, L. Shen, and G. Sun, “Squeeze-and-Excitation Networks,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp. 7132-7141, Jun. 2018. doi:
[9] R. Muller, S. Kornblith, and G. E. Hinton, “When does label smoothing help?,” in Proc. Conf. Neural Inf. Process. Syst., pp. 4696–4705, Dec. 2019. doi:
[10] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S.Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, “PyTorch: An imperative style, high- performance deep learning library,” in Proc. Conf. Neural Inf. Process. Syst., pp. 8024–8035, Dec. 2019. doi:
[11] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016, arXiv:1607.06450. [Online]. Available: 06450 doi:
[12] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” 2017, arXiv:1711.05101. [Online]. Available: 1711.05101 doi:


Editorial Office
1108, New building, 22, Teheran-ro 7-gil, Gangnam-gu, Seoul, Korea
Homepage: TEL: +82-2-568-3556 FAX: +82-2-568-3557
Copyrightⓒ 2012 The Korean Institute of Broadcast and Media Engineers
All Rights Reserved