2SWUNet: small window SWinUNet based on tansformer for building extraction from high-resolution remote sensing images

Jiamin Yu; Sixian Chan; Yanjing Lei; Wei Wu; Yuan Wang; Xiaolong Zhou

doi:10.1007/s11801-024-3179-1

Optoelectronics Letters ›› 2024, Vol. 20 ›› Issue (10) : 599-606. DOI: 10.1007/s11801-024-3179-1

Article

2SWUNet: small window SWinUNet based on tansformer for building extraction from high-resolution remote sensing images

Jiamin Yu¹ ,
Sixian Chan¹^,²^,³^,^b ,
Yanjing Lei¹ ,
Wei Wu¹^,³ ,
Yuan Wang¹ ,
Xiaolong Zhou⁴

Author information +

History +

Abstract

Models dedicated to building long-range dependencies often exhibit degraded performance when transferred to remote sensing images. Vision transformer (ViT) is a new paradigm in computer vision that uses multi-head self-attention (MSA) rather than convolution as the main computational module, with global modeling capabilities. However, its performance on small datasets is usually far inferior to that of convolutional neural networks (CNNs). In this work, we propose a small window SWinUNet (2SWUNet) for building extraction from high-resolution remote sensing images. Firstly, the 2SWUNet is trained based on swin transformer by designing a fully symmetric encoder-decoder U-shaped architecture. Secondly, to construct a reasonable U-shaped architecture for building extraction from high-resolution remote sensing images, different forms of patch expansion are explored to simulate up-sampling operations and recover feature map resolution. Then, the small window-based multi-head self-attention (W-MSA) is designed to reduce the computational and memory burden, which is more appropriate for the features of remote sensing images. Meanwhile, the pre-training mechanism is advanced to make up for the lack of decoder parameters. Finally, comparison experiments with other mainstream CNNs and ViTs validate the superiority of the proposed model.

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Jiamin Yu, Sixian Chan, Yanjing Lei, Wei Wu, Yuan Wang, Xiaolong Zhou. 2SWUNet: small window SWinUNet based on tansformer for building extraction from high-resolution remote sensing images. Optoelectronics Letters, 2024, 20(10): 599‒606 https://doi.org/10.1007/s11801-024-3179-1

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	ZhuF, CuiJ, ZhuB, et al.. Semantic segmentation of urban street scene images based on improved U-Net network. Optoelectronics letters, 2023, 19(3):179-185 J] CrossRef Google scholar

[2]	RonnebergerO, FischerP, BroxT. U-net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention, October 5–9, 2015, Munich, Germany, 2015, Berlin, Heidelberg, Springer: 234-241[C]

[3]	BadrinarayananV, KendallA, CipollaR. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 2017, 39(12):2481-2495 J] CrossRef Google scholar

[4]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[EB/OL]. (2020-10-22) [2023-05-10]. https://arxiv.org/abs/2010.11929.

[5]	YuanL, ChenY P, WangT, et al.. Tokens-to-token vit: training vision transformers from scratch on imagenet. Proceedings of the IEEE/CVF International Conference on Computer Vision, October 10–17, 2021, Montreal, Canada, 2021, New York, IEEE: 558-567[C]

[6]	WangW H, XieE, LiX, et al.. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. Proceedings of the IEEE/CVF International Conference on Computer Vision, October 10–17, 2021, Montreal, Canada, 2021, New York, IEEE: 568-578[C]

[7]	GuoL Z, LinY T, CaoY, et al.. Swin transformer: hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision, October 10–17, 2021, Montreal, Canada, 2021, New York, IEEE: 10012-10022[C]

[8]	ChanS, LiuP, ZhangZ. WeBox: locating small objects from weak edges. Optoelectronics letters, 2021, 17(6):349-353 J] CrossRef Google scholar

[9]	PARK N, KIM S. How do vision transformers work?[EB/OL]. (2022-02-14) [2023-05-10]. https://arxiv.org/abs/2202.06709.

[10]	CORDONNIER J B, LOUKAS A, JAGGI M. On the relationship between self-attention and convolutional layers[EB/OL]. (2019-11-08) [2023-05-10]. https://arxiv.org/abs/1911.03584.

[11]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. (2014-09-04) [2023-05-10]. https://arxiv.org/abs/1911.03584.

[12]	HeK M, ZhangX Y, RenS Q, et al.. Deep residual learning for image recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 26–July 1, 2016, Las Vegas, NV, USA, 2016, New York, IEEE: 770-778[C]

[13]	XiaL G, ZhangX B, ZhangJ X, et al.. Building extraction from very-high-resolution remote sensing images using semi-supervised semantic edge detection. Remote sensing, 2021, 13(11):2187 J] CrossRef Google scholar

[14]	LeiY J, YuJ M, ChanS X, et al.. SNLRUX++ for building extraction from high-resolution remote sensing images. IEEE journal of selected topics in applied earth observations and remote sensing, 2021, 15: 409-421 J] CrossRef Google scholar

[15]	DengJ, DongW, SocherR, et al.. Imagenet: a large-scale hierarchical image database. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 20–25, 2009, Miami Beach, FL, USA, 2009, New York, IEEE: 248-255 C] CrossRef Google scholar

[16]	XieE, WangW H, YuZ D, et al.. Segformer: simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems, 2021, 34: 12077-12090[J]

[17]	ZhouB, ZhaoH, PuigX, et al.. Scene parsing through ade20k dataset. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, July 21–26, 2017, Honolulu, Hawaii, USA, 2017, New York, IEEE: 633-641[C]

[18]	CHEN L C, PAPANDREOU G, SCHROFF F, et al. Rethinking atrous convolution for semantic image segmentation[EB/OL]. (2017-06-17) [2023-05-10]. https://arxiv.org/abs/1706.05587.

[19]	ChaurasiaA, CulurcielloE. Linknet: exploiting encoder representations for efficient semantic segmentation. IEEE Visual Communications and Image Processing, December 10–13, 2017, St. Petersburg, FL, USA, 2017, New York, IEEE: 1-4[C]

[20]	LinT Y, DollarP, GirshickR, et al.. Feature pyramid networks for object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, July 21–26, 2017, Honolulu, Hawaii, USA, 2017, New York, IEEE: 2117-2125[C]

[21]	ZhaoH, ShiJ, QiX, et al.. Pyramid scene parsing network. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, July 21–26, 2017, Honolulu, Hawaii, USA, 2017, New York, IEEE: 2881-2890[C]

[22]	LI H C, XIONG P F, AN J, et al. Pyramid attention network for semantic segmentation[EB/OL]. (2018-05-25) [2023-05-10]. https://arxiv.org/abs/1805.10180.

[23]

Zhou

Z W

, Siddiquee

M M R

, Tajbakhsh

, et al.. Unet++: a nested u-net architecture for medical image segmentation. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, September 20, 2018, Granada, Spain, 2018, Berlin, Heidelberg, Springer: 3-11 C]

CrossRef Google scholar

[24]	StrudelR, GarciaR, LaptevI, et al.. Segmenter: transformer for semantic segmentation. Proceedings of the IEEE/CVF International Conference on Computer Vision, October 10–17, 2021, Montreal, Canada, 2021, New York, IEEE: 7262-7272[C]

[25]	CHEN J N, LU Y Y, YU Q H, et al. TransUNet: transformers make strong encoders for medical image segmentation[EB/OL]. (2021-02-08) [2023-05-10]. https://arxiv.org/abs/2102.04306.

[26]	CaoH, WangY, ChenJ, et al.. Swin-unet: unet-like pure transformer for medical image segmentation. European Conference on Computer Vision, October 23–27, 2022, Tel-Aviv, Israel, 2022, Berlin, Springer: 205-218[C]

[27]	XiaoT, DollarP, SinghM, et al.. Early convolutions help transformers see better. Advances in neural information processing systems, 2021, 34: 30392-30400[J]