PDF
(7009KB)
Abstract
Action recognition, a fundamental task in the field of video understanding, has been extensively researched and applied.In contrast to an image, a video introduces an extra temporal dimension.However, many existing action recognition networks either perform simple temporal fusion through averaging or rely on pre-trained models from image recognition, resulting in limited temporal information extraction capabilities.This work proposes a highly efficient temporal decoding module that can be seamlessly integrated into any action recognition backbone network to enhance the focus on temporal relationships between video frames.Firstly, the decoder initializes a set of learnable queries, termed video-level action category prediction queries.Then, they are combined with the video frame features extracted by the backbone network after self-attention learning to extract video context information.Finally, these prediction queries with rich temporal features are used for category prediction.Experimental results on HMDB51, MSRDailyAct3D, Diving48 and Breakfast datasets show that using TokShift-Transformer and VideoMAE as encoders results in a significant improvement in Top-1 accuracy compared to the original models(TokShift-Transformer and VideoMAE), after introducing the proposed temporal decoder.The introduction of the temporal decoder results in an average performance increase exceeding 11% for TokShift-Transformer and nearly 5% for VideoMAE across the four datasets.Furthermore, the work explores the combination of the decoder with various action recognition networks, including Timesformer, as encoders.This results in an average accuracy improvement of more than 3.5% on the HMDB51 dataset.The code is available at https://github.com/huangturbo/TempDecoder.
Keywords
action recognition
/
video understanding
/
temporal relationship
/
temporal decoder
/
Transformer
Cite this article
Download citation ▾
Qiubo HUANG, Jianmin MEI, Wupeng ZHAO, Yiru LU, Mei WANG, Dehua CHEN.
An Efficient Temporal Decoding Module for Action Recognition.
Journal of Donghua University(English Edition), 2025, 42(2): 187-196 DOI:10.19884/j.1672-5220.202403011
| [1] |
SOOMRO K, ZAMIR A R, SHAH M. UCF101:a dataset of 101 human actions classes from videos in the wild[EB/OL].(2012-12-03)[2024-03-01]. https://arxiv.org/pdf/1212.0402v1.
|
| [2] |
ZHU Y, LI X Y, LIU C H, et al. A comprehensive study of deep video action recognition[EB/OL].(2020-12-11)[2024-03-01]. https://arxiv.org/pdf/2012.06567.
|
| [3] |
KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York,USA: IEEE, 2014:1725-1732.
|
| [4] |
JI S W, XU W, YANG M, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1):221-231.
|
| [5] |
HORN B K P, SCHUNCK B G. Determining optical flow[J]. Artificial Intelligence, 1981, 17(1/2/3):185-203.
|
| [6] |
VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL].(2023-08-02)[2024-03-01]. https://arxiv.org/pdf/1212.0402v1.
|
| [7] |
CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[M]//Lecture Notes in Computer Science. Cham,Switzerland: Springer International Publishing AG, 2020:213-229.
|
| [8] |
KUEHNE H, JHUANG H, GARROTE E, et al. HMDB:a large video database for human motion recognition[C]// 2011 International Conference on Computer Vision. New York,USA: IEEE, 2011:2556-2563.
|
| [9] |
WANG J, LIU Z C, WU Y, et al. Mining actionlet ensemble for action recognition with depth cameras[C]// 2012 IEEE Conference on Computer Vision and Pattern Recognition. New York,USA: IEEE, 2012:1290-1297.
|
| [10] |
LI Y W, LI Y, VASCONCELOS N. RESOUND:towards action recognition without representation bias[M]//Lecture Notes in Computer Science. Cham,Switzerland: Springer International Publishing AG, 2018:520-535.
|
| [11] |
KUEHNE H, GALL J, SERRE T.An end-to-end generative framework for video segmentation and recognition[C]//2016 IEEE Winter Conference on Applications of Computer Vision (WACV). New York,USA: IEEE, 2016:1-8.
|
| [12] |
TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]// Proceedings of the IEEE International Conference on Computer Vision. New York,USA: IEEE, 2015:4489-4497.
|
| [13] |
GESSERT N, SCHLüTER M, SCHLAEFER A. A deep learning approach for pose estimation from volumetric OCT data[J]. Medical Image Analysis, 2018,46:162-179.
|
| [14] |
SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[J]. Advances in Neural Information Processing Systems, 2014,1:568-576.
|
| [15] |
DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words:Transformers for image recognition at scale[EB/OL].(2020-10-22)[2024-03-01]. https://arxiv.org/pdf/2010.11929v1.
|
| [16] |
BERTASIUS G, WANG H, TORRESANI L. Is space-time attention all you need for video understanding?[EB/OL].(2021-06-09)[2024-03-01]. https://arxiv.org/pdf/2102.05095.
|
| [17] |
ZHANG H, HAO Y B, NGO C W. Token shift transformer for video classification[C]// Proceedings of the 29th ACM International Conference on Multimedia. New York,USA: ACM, 2021:917-925.
|
| [18] |
TONG Z, SONG Y B, WANG J, et al. VideoMAE:masked autoencoders are data-efficient learners for self-supervised video pre-training[EB/OL].(2022-10-18)[2024-03-01]. https://arxiv.org/pdf/2203.12602v3.
|
| [19] |
ISLAM M M, BERTASIUS G. Long movie clip classification with state-space video models[M]//Lecture Notes in Computer Science. Cham,Switzerland: Springer Nature Switzerland AG, 2022:87-104.
|
| [20] |
LIN Z Y, GENG S J, ZHANG R R, et al. Frozen CLIP models are efficient video learners[M]//Lecture Notes in Computer Science. Cham,Switzerland: Springer Nature Switzerland AG, 2022:388-404.
|
| [21] |
NEIMARK D, BAR O, ZOHAR M, et al. Video transformer network[EB/OL].(2021-08-17)[2024-03-01]. https://arxiv.org/pdf/2102.00719.
|
| [22] |
BELTAGY I, PETERS M E, COHAN A. Longformer:the long-document transformer[EB/OL].(2020-12-02)[2024-03-01]. https://arxiv.org/pdf/2004.05150.
|
| [23] |
KALFAOGLU M E, KALKAN S, ALATAN A A. Late temporal modeling in 3D CNN architectures with BERT for action recognition[M]//BARTOLI A,FUSIELLO A,eds.Lecture Notes in Computer Science. Cham,Switzerland: Springer International Publishing AG,2020:731-747.
|
| [24] |
DEVLIN J, CHANG M W, LEE K, et al. BERT:pre-training of deep bidirectional transformers for language understanding[EB/OL].(2020-12-02)[2024-03-01]. https://arxiv.org/pdf/1810.04805.
|
Funding
Shanghai Municipal Commission of Economy and Information Technology, China(202301054)