Multi-Scale Transformer for Image Restoration

Wuzhen Shi; Youwei Pan; Chun Zhao; Yuqing Liu; Shaobo Zhang; Heng Zhang; Yang Wen

doi:10.1049/cit2.70079

CAAI Transactions on Intelligence Technology ›› 2026, Vol. 11 ›› Issue (1) :41 -54. DOI: 10.1049/cit2.70079

ORIGINAL RESEARCH

research-article

Multi-Scale Transformer for Image Restoration

Author information +

History +

PDF (2839KB)

Abstract

Although Transformer-based image restoration methods have demonstrated impressive performance, existing Transformers still insufficiently exploit multiscale information. Previous non-Transformer-based studies have shown that incorporating multiscale features is crucial for improving restoration results. In this paper, we propose a multiscale Transformer (MST) that captures cross-scale attention among tokens, thereby effectively leveraging the multiscale patch recurrence prior of natural images. Furthermore, we introduce a channel-gate feed-forward network (CGFN) to enhance inter-channel information aggregation and reduce channel redundancy. To simultaneously utilise global, local and multiscale features, we design a multitype feature integration block (MFIB). Extensive experiments on both image super-resolution and HEVC compressed video artefact reduction demonstrate that the proposed MST achieves state-of-the-art performance. Ablation studies further verify the effectiveness of each proposed module.

Keywords

computer vision / image enhancement / image processing / image reconstruction / image resolution

Cite this article

Download citation ▾

Wuzhen Shi, Youwei Pan, Chun Zhao, Yuqing Liu, Shaobo Zhang, Heng Zhang, Yang Wen. Multi-Scale Transformer for Image Restoration. CAAI Transactions on Intelligence Technology, 2026, 11(1): 41-54 DOI:10.1049/cit2.70079

登录浏览全文

4963

注册一个新账户忘记密码

Funding

This work was supported in part by the National Natural Science Foun-dation of China under Grants 62101346 and 62301330, in part by the Guangdong Basic and Applied Basic Research Foundation under Grants 2021A1515011702 and 2022A1515110101, in part by the Shenzhen Sci-ence and Technology Programme under Grants JCYJ20240813141358076 and 20231121103807001 and in part by the Guangdong Provincial Key Laboratory under Grant 2023B1212060076.

Conflicts of Interest

The authors declare no confiicts of interest.

Data Availability Statement

The authors declare that the data supporting the findings of this study are available within the paper.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., “An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale,” in ICLR 2021—9th International Conference on Learning Representations (2021), 1-22.

[2]	J. Liang, J. Cao, G. Sun, K. Zhang,L. Van Gool, and R. Timofte, “SwinIR: Image Restoration Using Swin Transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (2021), 1833-1844.

[3]	X. Chen, X. Wang, J. Zhou,Y. Qiao, and C. Dong, “Activating More Pixels in Image Super-Resolution Transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023), 22367-22377.

[4]	Z. Chen, Y. Zhang, J. Gu, L. Kong,X. Yang, and F. Yu, “Dual Ag-gregation Transformer for Image Super-Resolution,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (2023), 12312-12321.

[5]	J. Zhang, Y. Zhang, J. Gu, Y. Zhang,L. Kong, and X. Yuan, “Accurate Image Restoration With Attention Retractable Transformer,” in ICLR (2023), 1-13.

[6]	Y. Li, Y. Fan, X. Xiang, et al., “Efficient and Explicit Modelling of Image Hierarchies for Image Restoration,” in Proceedings of the IEEE/ CVF Conference on Computer Vision and Pattern Recognition (2023), 18278-18289.

[7]	A. Ray,G. Kumar, and M. H. Kolekar, “Cfat: Unleashing Triangular Windows for Image Super-resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024), 26120-26129.

[8]	K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image Denoising by Sparse 3-D Transform-Domain Collaborative Filtering,” IEEE Transactions on Image Processing 16, no. 8 (2007): 2080-2095, https://doi.org/10.1109/tip.2007.901238.

[9]	D. Glasner, S. Bagon and M. Irani, “Super-Resolution From a Single Image,” in 2009 IEEE 12th International Conference on Computer Vision IEEE, 2009), 349-356.

[10]	S. Ren, D. Zhou, S. He,J. Feng, and X. Wang, “Shunted Self- Attention via Multi-Scale Token Aggregation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022), 10853-10862.

[11]	S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M.-H. Yang, “Restormer: Efficient Transformer for High-Resolution Image Restoration,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022), 5728-5739.

[12]	C. Dong, C. C. Loy,K. He, and X. Tang, “Learning a Deep Con-volutional Network for Image Super-Resolution,” in Computer Vision-ECCV 2014: 13Th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV 13Springer, 2014),184-199.

[13]	Y. Zhang, K. Li, K. Li, L. Wang,B. Zhong, and Y. Fu, “Image Super- Resolution Using Very Deep Residual Channel Attention Networks,” in Proceedings of the European Conference on Computer Vision (ECCV) (2018), 286-301.

[14]	C. Tian, X. Zhang, Q. Zhang, M. Yang, and Z. Ju, “Image Super- Resolution via Dynamic Network,” CAAI Transactions on Intelligence Technology 9, no. 4 (2024): 837-849, https://doi.org/10.1049/cit2.12297.

[15]	Z. Liu and X. -H. Han, “Hyperspectral Image Super Resolution Using Deep Internal and self-supervised Learning,” CAAI Transactions on Intelligence Technology 9, no. 1 (2024): 128-141, https://doi.org/10.1049/cit2.12285.

[16]	T. Zhang, Y. Fu, L. Huang, S. Li, S. You, and C. Yan, “RGB-Guided Hyperspectral Image Super-Resolution With Deep Progressive Learning,” CAAI Transactions on Intelligence Technology 9, no. 3 (2024): 679-694, https://doi.org/10.1049/cit2.12256.

[17]	C. Ledig, L. Theis, F. Huszár, et al., “Photo-Realistic Single Im-age Super-Resolution Using a Generative Adversarial Network,” in Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), 4681-4690.

[18]	X. Wang, L. Xie,C. Dong, and Y. Shan, “Real-Esrgan: Training Real- World Blind Super-Resolution With Pure Synthetic Data,” in Pro-ceedings of the IEEE/CVF International Conference on Computer Vision (2021), 1905-1914.

[19]	X. Liu, J. Liu,J. Tang, and G. Wu, “CataNet: Efficient Content- Aware Token Aggregation for Lightweight Image Super-Resolution,” in Proceedings of the Computer Vision and Pattern Recognition Confer-ence (2025), 17902-17912.

[20]	W. Long, X. Zhou,L. Zhang, and S. Gu, “Progressive Focused Transformer for Single Image Super-Resolution,” in Proceedings of the Computer Vision and Pattern Recognition Conference (2025), 2279-2288.

[21]	Z. Guan, Q. Xing, M. Xu, R. Yang, T. Liu, and Z. Wang, “Mfqe 2.0: A New Approach for Multi-Frame Quality Enhancement on Compressed Video,” IEEE Transactions on Pattern Analysis and Machine Intelligence 43, no. 3 (March 2021): 949-963, https://doi.org/10.1109/tpami.2019.2944806.

[22]	M. Zhao, Y. Xu, and S. Zhou, “Recursive Fusion and Deformable Spatiotemporal Attention for Video Compression Artifact Reduction,” in Proceedings of the 29th ACM International Conference on Multimedia (ACM, 2021), 5646-5654.

[23]	J. Deng, L. Wang,S. Pu, and C. Zhuo, “Spatio-Temporal Deformable Convolution for Compressed Video Quality Enhancement,” Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 7 (2020): 10696-10703, https://doi.org/10.1609/aaai.v34i07.6697.

[24]	D. Luo, M. Ye, S. Li, C. Zhu, and X. Li, “Spatio-Temporal Detail Information Retrieval for Compressed Video Quality Enhancement,” IEEE Transactions on Multimedia 25 (2023): 6808-6820, https://doi.org/10.1109/tmm.2022.3214775.

[25]	D. Luo, M. Ye, S. Li, and X. Li, “Coarse-to-Fine Spatio-Temporal Information Fusion for Compressed Video Quality Enhancement,” IEEE Signal Processing Letters 29 (2022): 543-547, https://doi.org/10.1109/lsp.2022.3147441.

[26]	Q. Zhu, J. Hao, Y. Ding, et al., “Cpga: Coding Priors-Guided Ag-gregation Network for Compressed Video Quality Enhancement,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024), 2964-2974.

[27]	T. Michaeli and M. Irani, “Nonparametric Blind Super-Resolution ” in Proceedings of the IEEE International Conference on Computer Vision (2013), 945-952.

[28]	J.-B. Huang A. Singh and N. Ahuja, “Single Image Super- Resolution From Transformed Self-Exemplars,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), 5197-5206.

[29]	J. Yang,Z. Lin, and S. Cohen, “Fast Image Super-Resolution Based on in-Place Example Regression,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2013), 1059-1066.

[30]	C. Dong,C. C. Loy, and X. Tang, “Accelerating the Super-Resolution Convolutional Neural Network,” in Computer Vision-ECCV 2016: 14Th European Conference, Amsterdam, the Netherlands, October 11-14, 2016, Proceedings, Part II 14 Springer, 2016),391-407.

[31]	Y. Zhou, Z. Li, C.-L. Guo, S. Bai, M.-M. Cheng and Q. Hou, “Srformer: Permuted Self-Attention for Single Image Super-Resolution,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (2023), 12780-12791.

[32]	Z. Liu, Y. Lin, Y. Cao, et al., “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (2021), 10012-10022.

[33]	D. Hendrycks and K. Gimpel, “Gaussian Error Linear Units (Gelus),” preprint, arXiv:1606.08415 (2016).

[34]	B. Lim, S. Son, H. Kim,S. Nah, and K. Mu Lee, “Enhanced Deep Residual Networks for Single Image Super-Resolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (2017), 136-144.

[35]	R. Timofte, E. Agustsson, L. Van Gool, M.-H. Yang and L. Zhang,“Ntire 2017 Challenge on Single Image Super-Resolution: Methods and Results,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (2017), 114-125.

[36]	M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel, Low-Complexity Single-Image Super-Resolution Based on Nonnegative Neighbor Embedding (BMVA Press, 2012).

[37]	R. Zeyde, M. Elad, and M. Protter, “On Single Image Scale-Up Using Sparse-Representations,” in Curves and Surfaces: 7th International Conference, Avignon, France, June 24-30, 2010, Revised Selected Papers 7 (Springer 2012),711-730.

[38]

D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A Database of Human Segmented Natural Images and Its Application to Evaluating Segmen-tation Algorithms and Measuring Ecological Statistics,” in Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, Vol. 2 (IEEE, 2001), 416-423, https://doi.org/10.1109/iccv.2001.937655.

[39]	Y. Matsui, K. Ito, Y. Aramaki, et al., “Sketch-Based Manga Retrieval Using manga109 Dataset,” Multimedia Tools and Applications 76, no. 20 (2017): 21811-21838, https://doi.org/10.1007/s11042-016-4020-z.

[40]	D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimi-zation,” in 3rd International Conference on Learning Representations, ICLR 2015 (2015).

[41]	J. Cao, S. Zhang, Y. Liu, et al., “Multi-Scale Local and Global Feature Fusion for Blind Quality Assessment of Enhanced Images,” IEEE Transactions on Circuits and Systems for Video Technology 35, no. 9 (2025): 8917-8928, https://doi.org/10.1109/tcsvt.2025.3552086.

[42]	T. Dai, J. Cai, Y. Zhang, S.-T. Xia and L. Zhang,“Second-Order Attention Network for Single Image Super-Resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), 11065-11074.

[43]	S. Zhou, J. Zhang, W. Zuo, and C. C. Loy, “Cross-Scale Internal Graph Neural Network for Image Super-Resolution,” Advances in Neural Information Processing Systems 33 (2020): 3499-3509, https://dl. acm.org/doi/10.5555/3495724.3496019.

[44]	B. Niu, W. Wen, W. Ren, et al., “Single Image Super-Resolution via a Holistic Attention Network,” in Computer Vision-ECCV 2020: 16Th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XII 16 Springer, 2020),191-207.

[45]	Y. Mei,Y. Fan, and Y. Zhou, “Image Super-Resolution With Non- Local Sparse Attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021), 3517-3526.

[46]	W. Li, X. Lu,S. Qian, and J. Lu, “On Efficient Transformer-Based Image Pre-Training for Low-Level Vision,” in Proceedings of the Inter-national Joint Conference on Artificial Intelligence (IJCAI) (2023), 1089-1097.

[47]	L. Zhang, Y. Li, X. Zhou,X. Zhao, and S. Gu, “Transcending the Limit of Local Window: Advanced Super-Resolution Transformer With Adaptive Token Dictionary,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024), 2856-2865.

[48]	X. Ding, X. Zhang,J. Han, and G. Ding, “Scaling up Your Kernels to 31 × 31:Revisiting Large Kernel Design in Cnns,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022), 11963-11975.