Multi-level temporal feature fusion with feature exchange strategy for multiple object tracking

Yisu Ge; Wenjie Ye; Guodao Zhang; Mengying Lin

doi:10.1007/s11801-024-4139-5

Optoelectronics Letters ›› 2024, Vol. 20 ›› Issue (8) : 505-512. DOI: 10.1007/s11801-024-4139-5

Article

Multi-level temporal feature fusion with feature exchange strategy for multiple object tracking

Yisu Ge¹^,² ,
Wenjie Ye¹ ,
Guodao Zhang³ ,
Mengying Lin⁴^,^d

Author information +

History +

Abstract

With the deepening of neural network research, object detection has been developed rapidly in recent years, and video object detection methods have gradually attracted the attention of scholars, especially frameworks including multiple object tracking and detection. Most current works prefer to build the paradigm for multiple object tracking and detection by multi-task learning. Different with others, a multi-level temporal feature fusion structure is proposed in this paper to improve the performance of framework by utilizing the constraint of video temporal consistency. For training the temporal network end-to-end, a feature exchange training strategy is put forward for training the temporal feature fusion structure efficiently. The proposed method is tested on several acknowledged benchmarks, and encouraging results are obtained compared with the famous joint detection and tracking framework. The ablation experiment answers the problem of a good position for temporal feature fusion.

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Yisu Ge, Wenjie Ye, Guodao Zhang, Mengying Lin. Multi-level temporal feature fusion with feature exchange strategy for multiple object tracking. Optoelectronics Letters, 2024, 20(8): 505‒512 https://doi.org/10.1007/s11801-024-4139-5

References

Publishing order | Descend order by publishing year | Descend order by cited within

[[1]]

Feichtenhofer

, Pinz

, Zisserman

. Detect to track and track to detect. Proceedings of the IEEE International Conference on Computer Vision, October 22–29, 2017, Venice, Italy, 2017 New York IEEE 3038-3046 [C]

[[2]]

Zhang

, Wang

, et al.. FairMOT: on the fairness of detection and re-identification in multiple object tracking. International journal of computer vision, 2021, 129: 3069-3087, J]

CrossRef Google scholar

[[3]]

Peng

J L

, Wang

. Chained-tracker: chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. 16th European Conference on Computer Vision, August 23–28, 2020, Glasgow, UK, 2020 Heidelberg Springer 145-161 [C]

[[4]]

Zhou

, Koltun

, Krähenbühl

. Tracking objects as points. 16th European Conference on Computer Vision, August 23–28, 2020, Glasgow, UK, 2020 Heidelberg Springer 474-490 [C]

[[5]]

Zhang

, Wang

, et al.. Bytetrack: multi-object tracking by associating every detection box. 17th European Conference on Computer Vision, October 24–28, 2022, Tel Aviv, Israel, 2022 Heidelberg Springer 1-21 [C]

[[6]]

Chen

, Peng

, Wang

, et al.. SeqTrack: sequence to sequence learning for visual object tracking. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 18–22, 2023, Vancouver, Canada, 2023 New York IEEE 14572-14581 [C]

[[7]]

Liu

, Zhu

. Mobile video object detection with temporally-aware feature maps. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 18–22, 2018, Salt Lake City, UT, USA, 2018 New York IEEE 5686-5695 [C]

[[8]]

Bertasius

, Torresani

, Shi

. Object detection in video with spatiotemporal sampling networks. 15th European Conference on Computer Vision, September 8–14, 2018, Munich, Germany, 2018 Heidelberg Springer 331-346 [C]

[[9]]

Guo

, Zheng

, Tan

, et al.. Progressive sparse local attention for video object detection. Proceedings of the IEEE/CVF International Conference on Computer Vision, October 27–November 2, 2019, Seoul, Korea, 2019 New York IEEE 3909-3918 [C]

[[10]]

Tang

, Wang

, et al.. Object detection in videos by high quality object linking. IEEE transactions on pattern analysis and machine intelligence, 2019, 42(5): 1272-1278, J]

CrossRef Google scholar

[[11]]

, Ban

, Delorme

, et al.. TransCenter: transformers with dense representations for multiple-object tracking. IEEE transactions on pattern analysis and machine intelligence, 2022, 45(6): 7820-7835, J]

CrossRef Google scholar

[[12]]

, Wang

, Shelhamer

, et al.. Deep layer aggregation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 18–22, 2018, Salt Lake City, UT, USA, 2018 New York IEEE 2403-2412 [C]

[[13]]

LEAL-TAIXÉ L, MILAN A, REID I, et al. Motchallenge 2015: towards a benchmark for multi-target tracking[EB/OL]. (2015-04-01) [2023-12-23]. https://arxiv.org/abs/1504.01942.

[[14]]

MILAN A, LEAL-TAIXÉ L, REID I, et al. MOT16: a benchmark for multi-object tracking[EB/OL]. (2016-03-01) [2023-12-23]. https://arxiv.org/abs/1603.00831.

[[15]]

SHAO S, ZHANG Y, ZENG W, et al. Crowdhuman: a benchmark for detecting human in a crowd[EB/OL]. (2018-05-01) [2023-12-23]. https://arxiv.org/abs/1805.00123.

[[16]]

Geiger

, Lenz

, Urtasun

. Are we ready for autonomous driving? The KITTI vision benchmark suite. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 16–21, 2012, Providence, RI, USA, 2012 New York IEEE 3354-3361, C]

CrossRef Google scholar

[[17]]

Caesar

, Bankiti

, Lang

, et al.. Nuscenes: a multimodal dataset for autonomous driving. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 14–19, 2020, Seattle, WA, USA, 2020 New York IEEE 11621-11631 [C]

[[18]]

Dollár

, Wojek

, Schiele

, et al.. Pedestrian detection: a benchmark. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 20–25, 2009, Miami, FL, USA, 2009 New York IEEE 304-311, C]

CrossRef Google scholar

[[19]]

Zhang

, Benenson

, Schiele

. Citypersons: a diverse dataset for pedestrian detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, July 21–26, 2017, Honolulu, HI, USA, 2017 New York IEEE 3213-3221 [C]

[[20]]

Xiao

, Li

, Wang

, et al.. Joint detection and identification feature learning for person search. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, July 21–26, 2017, Honolulu, HI, USA, 2017 New York IEEE 3415-3424 [C]

[[21]]

Zheng

, Zhang

, Sun

, et al.. Person re-identification in the wild. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, July 21–26, 2017, Honolulu, HI, USA, 2017 New York IEEE 1367-1376 [C]

[[22]]

Ess

, Leibe

, Schindler

, et al.. A mobile vision system for robust multi-person tracking. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 23–28, 2008, Anchorage, AK, USA, 2008 New York IEEE 1-8 [C]