Detection using mask adaptive transformers in unmanned aerial vehicle imagery
Huibiao Ye , Weiming Fan , Yuping Guo , Xuna Wang , Dalin Zhou
Optoelectronics Letters ›› 2025, Vol. 21 ›› Issue (2) : 113 -120.
Drone photography is an essential building block of intelligent transportation, enabling wide-ranging monitoring, precise positioning, and rapid transmission. However, the high computational cost of transformer-based methods in object detection tasks hinders real-time result transmission in drone target detection applications. Therefore, we propose mask adaptive transformer (MAT) tailored for such scenarios. Specifically, we introduce a structure that supports collaborative token sparsification in support windows, enhancing fault tolerance and reducing computational overhead. This structure comprises two modules: a binary mask strategy and adaptive window self-attention (A-WSA). The binary mask strategy focuses on significant objects in various complex scenes. The A-WSA mechanism is employed to self-attend for balance performance and computational cost to select objects and isolate all contextual leakage. Extensive experiments on the challenging CarPK and VisDrone datasets demonstrate the effectiveness and superiority of the proposed method. Specifically, it achieves a mean average precision (mAP@0.5) improvement of 1.25% over car detector based on you only look once version 5 (CD-YOLOv5) on the CarPK dataset and a 3.75% average precision (AP@0.5) improvement over cascaded zoom-in detector (CZ Det) on the VisDrone dataset.
| [1] |
|
| [2] |
REDMON J. YOLOv3: an incremental improvement[EB/OL]. (2018-04-08) [2024-03-12]. https://arxiv.org/abs/1804.02767. |
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
CHEN X, FAN H, GIRSHICK R, et al. Improved base-lines with momentum contrastive learning[EB/OL]. (2020-03-09) [2024-03-12]. https://arxiv.org/abs/2003.04297. |
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
JOCHER G, CHAURASIA A, QIU J. YOLO by Ultralytics[EB/OL]. (2023-01-01) [2024-03-12]. https://github.com/ultralytics/ultralytics/blob/main/CITATION.cff. |
| [31] |
|
| [32] |
WEI Z, DUAN C, SONG X, et al. AMRNet: chips augmentation in aerial images object detection[EB/OL]. (2020-09-15) [2024-03-12]. https://arxiv.org/abs/2009.07168. |
| [33] |
ZHANG H, LI F, LIU S, et al. DINO: DETR with improved denoising anchor boxes for end-to-end object detection[EB/OL]. (2022-03-07) [2024-03-12]. https://arxiv.org/abs/2203.03605. |
Tianjin University of Technology
/
| 〈 |
|
〉 |