In the vision transformer(ViT) architecture, image data are transformed into sequential data for processing, which may result in the loss of spatial positional information. While the self-attention mechanism enhances the capacity of ViT to capture global features, it compromises the preservation of fine-grained local feature information. To address these challenges, we propose a spatial positional enhancement module and a wavelet transform enhancement module tailored for ViT models. These modules aim to reduce spatial positional information loss during the patch embedding process and enhance the model's feature extraction capabilities. The spatial positional enhancement module reinforces spatial information in sequential data through convolutional operations and multi-scale feature extraction. Meanwhile, the wavelet transform enhancement module utilizes the multi-scale analysis and frequency decomposition to improve the ViT's understanding of global and local image structures. This enhancement also improves the ViT's ability to process complex structures and intricate image details. Experiments on CIFAR-10, CIFAR-100 and ImageNet-1k datasets are done to compare the proposed method with advanced classification methods. The results show that the proposed model achieves a higher classification accuracy, confirming its effectiveness and competitive advantage.
| [1] |
WANG K P, ZHAO M B. Region-aware fashion contrastive learning for unified attribute recognition and composed retrieval[J]. Journal of Donghua University (English Edition), 2024, 41(4):405-415.
|
| [2] |
HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2016:770-778.
|
| [3] |
SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL].(2014-09-04)[2024-11-20]. https://arxiv.org/abs/1409.1556v6.
|
| [4] |
LIU Z, MAO H Z, WU C Y, et al. A ConvNet for the 2020s[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2022:11966-11976.
|
| [5] |
REDMON J, DIVVALA S, GIRSHICK R, et al.You only look once:unified,real-time object detection[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2016:779-788.
|
| [6] |
GIRSHICK R. Fast R-CNN[EB/OL].(2015-09-27)[2024-11-20]. https://arxiv.org/abs/1504.08083.
|
| [7] |
REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6):1137-1149.
|
| [8] |
HE K M, GKIOXARI G, DOLLAR P, et al. Mask R-CNN[C]//2017 IEEE International Conference on Computer Vision (ICCV). New York: IEEE, 2017:2980-2988.
|
| [9] |
WENG W H, ZHU X. INet:convolutional networks for biomedical image segmentation[J]. IEEE Access, 2021,9:16591-16603.
|
| [10] |
LONG J, SHELHAMER E, DARRELL T.Fully convolutional networks for semantic segmentation[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2015:3431-3440.
|
| [11] |
QIN X B, ZHANG Z C, HUANG C Y, et al. U2-Net:Going deeper with nested U-structure for salient object detection[J]. Pattern Recognition, 2020,106:107404.
|
| [12] |
VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems. Long Beach: Curran Associates,Inc., 2017.
|
| [13] |
DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words:transformers for image recongnition at scale[EB/OL].(2020-09-04)[2024-11-20]. https://arxiv.org/abs/12010.11929.
|
| [14] |
DAUBECHIES I. Ten lectures on wavelets[M]. [S.l.]: Society for Industrial and Applied Mathematics, 1992.
|
| [15] |
DENG J, DONG W, SOCHER R, et al. ImageNet:a large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2009:248-255.
|
| [16] |
ZHU X Z, SU W J, LU L W, et al. Deformable DETR:deformable transformers for end-to-end object detection[EB/OL].(2020-10-08)[2024-11-20]. https://arxiv.org/abs/2010.04159v4.
|
| [17] |
HASSANI A, WALTON S, SHAH N, et al. Escaping the big data paradigm with compact transformers[EB/OL].(2021-04-12)[2024-11-20]. https://arxiv.org/abs/2104.05704v4.
|
| [18] |
TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & distillation through attention[C]//International conference on machine learning. New York: PMLR,2021:10347-10357.
|
| [19] |
YUAN K, GUO S P, LIU Z W, et al. MoIncorporating convolution designs into visual transformers[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV).Montreal, Canada: IEEE, 2021 :559-568.
|
| [20] |
CHU X X, TIAN Z, ZHANG B, et al. Conditional positional encodings for vision transformers[EB/OL].(2021-02-22)[2024-11-20]. https://arxiv.org/abs/2102.10882v3.
|
| [21] |
GUO J Y, HAN K, WU H, et al.CMT:convolutional neural networks meet vision transformers[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2022:12165-12175.
|
| [22] |
FAN Q H, HUANG H B, CHEN M R, et al.RMT:retentive networks meet vision transformers[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2024:5641-5651.
|
| [23] |
XIA C L, WANG X L, LV F, et al.ViT-CoMer:vision transformer with convolutional multi-scale feature interaction for dense predictions[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2024:5493-5502.
|
| [24] |
WANG A, CHEN H, LIN Z J, et al.Rep ViT:revisiting mobile CNN from ViT perspective[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2024:15909-15920.
|
| [25] |
LI F, PAN H S, SHENG S X, et al. Image retrieval based on vision transformer and masked learning[J]. Journal of Donghua University(English Edition), 2023, 40(5):539-547.
|
| [26] |
TAN M, LE Q. Efficientnet:Rethinking model scaling for convolutional neural networks[C/OL]//International Conference on Machine Learning. [S.l.]: PMLR, 2019:6105-6114[2024-11-20]. https://proceedings.mlr.press/v97.
|
| [27] |
TOUVRON H, CORD M, SABLAYROLLES A, et al. Going deeper with image transformers[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). New York: IEEE, 2021:32-42.
|
| [28] |
CHEN M Z, LIN M B, LI K, et al. CF-ViT:a general coarse-to-fine method for vision transformer[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37(6):7042-7052.
|
| [29] |
XU Y F, ZHANG Z J, ZHANG M D, et al. Evo-ViT:slow-fast token evolution for dynamic vision transformer[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(3):2964-2972.
|
| [30] |
SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-CAM:visual explanations from deep networks via gradient-based localization[C]//2017 IEEE International Conference on Computer Vision (ICCV). New York: IEEE, 2017:618-626.
|
Funding
National Natural Science Foundation of China(62176052)