Detection using mask adaptive transformers in unmanned aerial vehicle imagery

Huibiao Ye; Weiming Fan; Yuping Guo; Xuna Wang; Dalin Zhou

doi:10.1007/s11801-025-4185-7

Optoelectronics Letters ›› 2025, Vol. 21 ›› Issue (2) :113 -120. DOI: 10.1007/s11801-025-4185-7

Article

research-article

Detection using mask adaptive transformers in unmanned aerial vehicle imagery

Author information +

History +

PDF

Abstract

Drone photography is an essential building block of intelligent transportation, enabling wide-ranging monitoring, precise positioning, and rapid transmission. However, the high computational cost of transformer-based methods in object detection tasks hinders real-time result transmission in drone target detection applications. Therefore, we propose mask adaptive transformer (MAT) tailored for such scenarios. Specifically, we introduce a structure that supports collaborative token sparsification in support windows, enhancing fault tolerance and reducing computational overhead. This structure comprises two modules: a binary mask strategy and adaptive window self-attention (A-WSA). The binary mask strategy focuses on significant objects in various complex scenes. The A-WSA mechanism is employed to self-attend for balance performance and computational cost to select objects and isolate all contextual leakage. Extensive experiments on the challenging CarPK and VisDrone datasets demonstrate the effectiveness and superiority of the proposed method. Specifically, it achieves a mean average precision (mAP@0.5) improvement of 1.25% over car detector based on you only look once version 5 (CD-YOLOv5) on the CarPK dataset and a 3.75% average precision (AP@0.5) improvement over cascaded zoom-in detector (CZ Det) on the VisDrone dataset.

Keywords

Cite this article

Download citation ▾

Huibiao Ye, Weiming Fan, Yuping Guo, Xuna Wang, Dalin Zhou. Detection using mask adaptive transformers in unmanned aerial vehicle imagery. Optoelectronics Letters, 2025, 21(2): 113-120 DOI:10.1007/s11801-025-4185-7

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Le C Y, Bengio Y, Hinton G. Deep learning. Nature. 2015, 521(7553): 436-444 J]

[2]	REDMON J. YOLOv3: an incremental improvement[EB/OL]. (2018-04-08) [2024-03-12]. https://arxiv.org/abs/1804.02767.

[3]	Carion N, Massa F, Synnaeve G, et al. . End-to-end object detection with transformers. European Conference on Computer Vision, August 23–28, 2020, Glasgow, UK. 2020, Berlin, Springer: 213229[C]

[4]	Huang G, Liu Z, Van Der Maaten L, et al. . Densely connected convolutional networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, July 21–26, 2017, Hawaii, USA. 2017, New York, IEEE: 47004708[C]

[5]	Yu J, Gao H, Chen Y, et al. . Deep object detector with attentional spatiotemporal LSTM for space human-robot interaction. IEEE transactions on human-machine systems. 2022, 52(4): 784-793 J]

[6]	CHEN X, FAN H, GIRSHICK R, et al. Improved base-lines with momentum contrastive learning[EB/OL]. (2020-03-09) [2024-03-12]. https://arxiv.org/abs/2003.04297.

[7]	Zhang H, Lu C, Chen E. Obstacle detection: improved YOLOX-S based on swin transformer-tiny. Optoelectronics letters. 2023, 19(11): 698-704 J]

[8]

Du B, Huang Y, Chen J, et al. . Adaptive sparse convolutional networks with global context enhancement for faster object detection on drone images. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 18–22, 2023, Vancouver, Canada. 2023, New York, IEEE: 1343513444[C]

[9]	Bao F, Nie S, Xue K, et al. . All are worth words: a VIT backbone for diffusion models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 18–22, 2023, Vancouver, Canada. 2023, New York, IEEE: 2266922679[C]

[10]	Liu Z, Lin Y, Cao Y, et al. . Swin transformer: hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision, October 11–17, 2021, Montreal, Canada. 2021, New York, IEEE: 1001210022[C]

[11]	Li T, Wang J, Zhang T. L-DETR: a light-weight detector for end-to-end object detection with transformers. IEEE access. 2022, 10: 105685-105692 J]

[12]	Dong X, Bao J, Chen D, et al. . Cswin transformer: a general vision transformer backbone with cross-shaped windows. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 19–24, 2022, Louisiana, USA. 2022, New York, IEEE: 1212412134[C]

[13]	Wang W, Xie E, Li X, et al. . Pvt v2: improved baselines with pyramid vision transformer. Computational visual media. 2022, 8(3): 415-424 J]

[14]	Wang W, Xie E, Li X, et al. . Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. Proceedings of the IEEE/CVF International Conference on Computer Vision, October 11–17, 2021, Montreal, Canada. 2021, New York, IEEE: 568578[C]

[15]	Hsieh M R, Lin Y L, Hsu W H. Drone-based object counting by spatially regularized regional proposal network. Proceedings of the IEEE International Conference on Computer Vision, October 22–29, 2017, Venice, Italy. 2017, New York, IEEE: 41454153[C]

[16]	Du D, Zhu P, Wen L, et al. . VisDrone-DET2019: the vision meets drone object detection in image challenge results. Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, October 27–November 2, 2019, Seoul, South Korea. 2019, New York, IEEE[C]

[17]	Mo N, Yan L. Oriented vehicle detection in high-resolution remote sensing images based on feature amplification and category balance by oversampling data augmentation. The international archives of the photogrammetry, remote sensing and spatial information sciences. 2020, 43: 153-159 J]

[18]	Tang T, Zhou S, Deng Z, et al. . Arbitrary-oriented vehicle detection in aerial imagery with single convolutional neural networks. Remote sensing. 2017, 9(11): 1170 J]

[19]	Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 7–12, 2015, Boston, USA. 2015, New York, IEEE: 3431-3440[C]

[20]	Badrinarayanan V, Kendall A, Cipolla R. Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence. 2017, 39(12): 2481-2495 J]

[21]	Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention-MICCAI 2015, October 5–9, 2015, Munich, Germany. 2015, Berlin, Springer International Publishing: 234241[C]

[22]	Yu J, Gao H, Sun J, et al. . Spatial cognition-driven deep learning for car detection in unmanned aerial vehicle imagery. IEEE transactions on cognitive and developmental systems. 2021, 14(4): 1574-1583 J]

[23]	Yang C, Huang Z, Wang N. QueryDet: cascaded sparse query for accelerating high-resolution small object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 19–24, 2022, Louisiana, USA. 2022, New York, IEEE: 1366813677[C]

[24]	Meethal A, Granger E, Pedersoli M. Cascaded zoom-in detector for high resolution aerial images. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 18–22, 2023, Vancouver, Canada. 2023, New York, IEEE: 20462055[C]

[25]	Nguyen D L, Vo X T, Priadana A, et al. . Car Detector Based on YOLOv5 for Parking Management. Conference on Information Technology and Its Applications, July 28–29, 2023, Da Nang, Vietnam. 2023, Cham, Springer Nature Switzerland: 102113[C]

[26]	Zhu C, He Y, Savvides M. Feature selective anchor-free module for single-shot object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, October 27–November 2, 2019, Seoul, South Korea. 2019, New York, IEEE: 840849[C]

[27]	Zhang H, Wang Y, Dayoub F, et al. . Varifocalnet: an IoU-aware dense object detector. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 19–25, 2021, Nashville, TN, USA. 2021, New York, IEEE: 85148523[C]

[28]	Feng C, Zhong Y, Gao Y, et al. . TOOD: task-aligned one-stage object detection. Proceedings of the IEEE/CVF International Conference on Computer Vision, October 11–17, 2021, Montreal, Canada. 2021, New York, IEEE: 34903499[C]

[29]	Chen Z, Yang C, Li Q, et al. . Disentangle your dense object detector. Proceedings of the 29th ACM International Conference on Multimedia, October 21–25, 2021, Chengdu, China. 2021, New York, ACM: 49394948[C]

[30]	JOCHER G, CHAURASIA A, QIU J. YOLO by Ultralytics[EB/OL]. (2023-01-01) [2024-03-12]. https://github.com/ultralytics/ultralytics/blob/main/CITATION.cff.

[31]	Wang X, Yao F, Li A, et al. . DroneNet: rescue drone-view object detection. Drones. 2023, 7(7): 441 J]

[32]	WEI Z, DUAN C, SONG X, et al. AMRNet: chips augmentation in aerial images object detection[EB/OL]. (2020-09-15) [2024-03-12]. https://arxiv.org/abs/2009.07168.

[33]	ZHANG H, LI F, LIU S, et al. DINO: DETR with improved denoising anchor boxes for end-to-end object detection[EB/OL]. (2022-03-07) [2024-03-12]. https://arxiv.org/abs/2203.03605.