2SWUNet: small window SWinUNet based on tansformer for building extraction from high-resolution remote sensing images
Jiamin Yu, Sixian Chan, Yanjing Lei, Wei Wu, Yuan Wang, Xiaolong Zhou
2SWUNet: small window SWinUNet based on tansformer for building extraction from high-resolution remote sensing images
Models dedicated to building long-range dependencies often exhibit degraded performance when transferred to remote sensing images. Vision transformer (ViT) is a new paradigm in computer vision that uses multi-head self-attention (MSA) rather than convolution as the main computational module, with global modeling capabilities. However, its performance on small datasets is usually far inferior to that of convolutional neural networks (CNNs). In this work, we propose a small window SWinUNet (2SWUNet) for building extraction from high-resolution remote sensing images. Firstly, the 2SWUNet is trained based on swin transformer by designing a fully symmetric encoder-decoder U-shaped architecture. Secondly, to construct a reasonable U-shaped architecture for building extraction from high-resolution remote sensing images, different forms of patch expansion are explored to simulate up-sampling operations and recover feature map resolution. Then, the small window-based multi-head self-attention (W-MSA) is designed to reduce the computational and memory burden, which is more appropriate for the features of remote sensing images. Meanwhile, the pre-training mechanism is advanced to make up for the lack of decoder parameters. Finally, comparison experiments with other mainstream CNNs and ViTs validate the superiority of the proposed model.
[[1]] |
|
[[2]] |
|
[[3]] |
|
[[4]] |
DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[EB/OL]. (2020-10-22) [2023-05-10]. https://arxiv.org/abs/2010.11929.
|
[[5]] |
|
[[6]] |
|
[[7]] |
|
[[8]] |
|
[[9]] |
PARK N, KIM S. How do vision transformers work?[EB/OL]. (2022-02-14) [2023-05-10]. https://arxiv.org/abs/2202.06709.
|
[[10]] |
CORDONNIER J B, LOUKAS A, JAGGI M. On the relationship between self-attention and convolutional layers[EB/OL]. (2019-11-08) [2023-05-10]. https://arxiv.org/abs/1911.03584.
|
[[11]] |
SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. (2014-09-04) [2023-05-10]. https://arxiv.org/abs/1911.03584.
|
[[12]] |
|
[[13]] |
|
[[14]] |
|
[[15]] |
|
[[16]] |
|
[[17]] |
|
[[18]] |
CHEN L C, PAPANDREOU G, SCHROFF F, et al. Rethinking atrous convolution for semantic image segmentation[EB/OL]. (2017-06-17) [2023-05-10]. https://arxiv.org/abs/1706.05587.
|
[[19]] |
|
[[20]] |
|
[[21]] |
|
[[22]] |
LI H C, XIONG P F, AN J, et al. Pyramid attention network for semantic segmentation[EB/OL]. (2018-05-25) [2023-05-10]. https://arxiv.org/abs/1805.10180.
|
[[23]] |
|
[[24]] |
|
[[25]] |
CHEN J N, LU Y Y, YU Q H, et al. TransUNet: transformers make strong encoders for medical image segmentation[EB/OL]. (2021-02-08) [2023-05-10]. https://arxiv.org/abs/2102.04306.
|
[[26]] |
|
[[27]] |
|
/
〈 | 〉 |