A Wavelet Transform and Spatial Positional Enhanced Method for Vision Transformer

Runyu HU , Xuesong TANG , Kuangrong HAO

Journal of Donghua University(English Edition) ›› 2025, Vol. 42 ›› Issue (3) : 330 -338.

PDF (4333KB)
Journal of Donghua University(English Edition) ›› 2025, Vol. 42 ›› Issue (3) :330 -338. DOI: 10.19884/j.1672-5220.202412001
Information Technology and Artificial Intelligence
research-article

A Wavelet Transform and Spatial Positional Enhanced Method for Vision Transformer

Author information +
History +
PDF (4333KB)

Abstract

In the vision transformer(ViT) architecture, image data are transformed into sequential data for processing, which may result in the loss of spatial positional information. While the self-attention mechanism enhances the capacity of ViT to capture global features, it compromises the preservation of fine-grained local feature information. To address these challenges, we propose a spatial positional enhancement module and a wavelet transform enhancement module tailored for ViT models. These modules aim to reduce spatial positional information loss during the patch embedding process and enhance the model's feature extraction capabilities. The spatial positional enhancement module reinforces spatial information in sequential data through convolutional operations and multi-scale feature extraction. Meanwhile, the wavelet transform enhancement module utilizes the multi-scale analysis and frequency decomposition to improve the ViT's understanding of global and local image structures. This enhancement also improves the ViT's ability to process complex structures and intricate image details. Experiments on CIFAR-10, CIFAR-100 and ImageNet-1k datasets are done to compare the proposed method with advanced classification methods. The results show that the proposed model achieves a higher classification accuracy, confirming its effectiveness and competitive advantage.

Keywords

transformer / wavelet transform / image classification / computer vision

Cite this article

Download citation ▾
Runyu HU, Xuesong TANG, Kuangrong HAO. A Wavelet Transform and Spatial Positional Enhanced Method for Vision Transformer. Journal of Donghua University(English Edition), 2025, 42(3): 330-338 DOI:10.19884/j.1672-5220.202412001

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

WANG K P, ZHAO M B. Region-aware fashion contrastive learning for unified attribute recognition and composed retrieval[J]. Journal of Donghua University (English Edition), 2024, 41(4):405-415.

[2]

HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2016:770-778.

[3]

SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL].(2014-09-04)[2024-11-20]. https://arxiv.org/abs/1409.1556v6.

[4]

LIU Z, MAO H Z, WU C Y, et al. A ConvNet for the 2020s[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2022:11966-11976.

[5]

REDMON J, DIVVALA S, GIRSHICK R, et al.You only look once:unified,real-time object detection[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2016:779-788.

[6]

GIRSHICK R. Fast R-CNN[EB/OL].(2015-09-27)[2024-11-20]. https://arxiv.org/abs/1504.08083.

[7]

REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6):1137-1149.

[8]

HE K M, GKIOXARI G, DOLLAR P, et al. Mask R-CNN[C]//2017 IEEE International Conference on Computer Vision (ICCV). New York: IEEE, 2017:2980-2988.

[9]

WENG W H, ZHU X. INet:convolutional networks for biomedical image segmentation[J]. IEEE Access, 2021,9:16591-16603.

[10]

LONG J, SHELHAMER E, DARRELL T.Fully convolutional networks for semantic segmentation[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2015:3431-3440.

[11]

QIN X B, ZHANG Z C, HUANG C Y, et al. U2-Net:Going deeper with nested U-structure for salient object detection[J]. Pattern Recognition, 2020,106:107404.

[12]

VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems. Long Beach: Curran Associates,Inc., 2017.

[13]

DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words:transformers for image recongnition at scale[EB/OL].(2020-09-04)[2024-11-20]. https://arxiv.org/abs/12010.11929.

[14]

DAUBECHIES I. Ten lectures on wavelets[M]. [S.l.]: Society for Industrial and Applied Mathematics, 1992.

[15]

DENG J, DONG W, SOCHER R, et al. ImageNet:a large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2009:248-255.

[16]

ZHU X Z, SU W J, LU L W, et al. Deformable DETR:deformable transformers for end-to-end object detection[EB/OL].(2020-10-08)[2024-11-20]. https://arxiv.org/abs/2010.04159v4.

[17]

HASSANI A, WALTON S, SHAH N, et al. Escaping the big data paradigm with compact transformers[EB/OL].(2021-04-12)[2024-11-20]. https://arxiv.org/abs/2104.05704v4.

[18]

TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & distillation through attention[C]//International conference on machine learning. New York: PMLR,2021:10347-10357.

[19]

YUAN K, GUO S P, LIU Z W, et al. MoIncorporating convolution designs into visual transformers[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV).Montreal, Canada: IEEE, 2021 :559-568.

[20]

CHU X X, TIAN Z, ZHANG B, et al. Conditional positional encodings for vision transformers[EB/OL].(2021-02-22)[2024-11-20]. https://arxiv.org/abs/2102.10882v3.

[21]

GUO J Y, HAN K, WU H, et al.CMT:convolutional neural networks meet vision transformers[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2022:12165-12175.

[22]

FAN Q H, HUANG H B, CHEN M R, et al.RMT:retentive networks meet vision transformers[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2024:5641-5651.

[23]

XIA C L, WANG X L, LV F, et al.ViT-CoMer:vision transformer with convolutional multi-scale feature interaction for dense predictions[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2024:5493-5502.

[24]

WANG A, CHEN H, LIN Z J, et al.Rep ViT:revisiting mobile CNN from ViT perspective[C]//2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2024:15909-15920.

[25]

LI F, PAN H S, SHENG S X, et al. Image retrieval based on vision transformer and masked learning[J]. Journal of Donghua University(English Edition), 2023, 40(5):539-547.

[26]

TAN M, LE Q. Efficientnet:Rethinking model scaling for convolutional neural networks[C/OL]//International Conference on Machine Learning. [S.l.]: PMLR, 2019:6105-6114[2024-11-20]. https://proceedings.mlr.press/v97.

[27]

TOUVRON H, CORD M, SABLAYROLLES A, et al. Going deeper with image transformers[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). New York: IEEE, 2021:32-42.

[28]

CHEN M Z, LIN M B, LI K, et al. CF-ViT:a general coarse-to-fine method for vision transformer[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37(6):7042-7052.

[29]

XU Y F, ZHANG Z J, ZHANG M D, et al. Evo-ViT:slow-fast token evolution for dynamic vision transformer[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(3):2964-2972.

[30]

SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-CAM:visual explanations from deep networks via gradient-based localization[C]//2017 IEEE International Conference on Computer Vision (ICCV). New York: IEEE, 2017:618-626.

Funding

National Natural Science Foundation of China(62176052)

PDF (4333KB)

119

Accesses

0

Citation

Detail

Sections
Recommended

/