Weakly supervised temporal action localization with proxy metric modeling

Hongsheng XU; Zihan CHEN; Yu ZHANG; Xin GENG; Siya MI; Zhihong YANG

doi:10.1007/s11704-022-1154-1

Front. Comput. Sci. ›› 2023, Vol. 17 ›› Issue (2) : 172309 DOI: 10.1007/s11704-022-1154-1

Artificial Intelligence

RESEARCH ARTICLE

Weakly supervised temporal action localization with proxy metric modeling

Author information +

History +

PDF (12255KB)

Abstract

Temporal localization is crucial for action video recognition. Since the manual annotations are expensive and time-consuming in videos, temporal localization with weak video-level labels is challenging but indispensable. In this paper, we propose a weakly-supervised temporal action localization approach in untrimmed videos. To settle this issue, we train the model based on the proxies of each action class. The proxies are used to measure the distances between action segments and different original action features. We use a proxy-based metric to cluster the same actions together and separate actions from backgrounds. Compared with state-of-the-art methods, our method achieved competitive results on the THUMOS14 and ActivityNet1.2 datasets.

Graphical abstract

Keywords

temporal action localization / weakly supervised videos / proxy metric

Cite this article

Download citation ▾

Hongsheng XU, Zihan CHEN, Yu ZHANG, Xin GENG, Siya MI, Zhihong YANG. Weakly supervised temporal action localization with proxy metric modeling. Front. Comput. Sci., 2023, 17(2): 172309 DOI:10.1007/s11704-022-1154-1

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Ronchetti F, Quiroga F, Lanzarini L, Estrebou C . Distribution of action movements (DAM): a descriptor for human action recognition. Frontiers of Computer Science, 2015, 9( 6): 956– 965

[2]	Chen K, Ding G, Han J . Attribute-based supervised deep learning model for action recognition. Frontiers of Computer Science, 2017, 11( 2): 219– 229

[3]	Wang J, Chen D, Yang J . Human behavior classification by analyzing periodic motions. Frontiers of Computer Science, 2010, 4( 4): 580– 588

[4]	Zhu X, Liu Z . Human behavior clustering for anomaly detection. Frontiers of Computer Science in China, 2011, 5( 3): 279– 289

[5]	Chebieb A, Ameur Y A . A formal model for plastic human computer interfaces. Frontiers of Computer Science, 2018, 12( 2): 351– 375

[6]	Chen W, Zhu S, Wan H, Feng J . Dual quaternion based virtual hand interaction modeling. Science China Information Sciences, 2013, 56( 3): 1– 11

[7]	Shou Z, Wang D, Chang S F. Temporal action localization in untrimmed videos via multi-stage CNNs. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, 1049– 1058

[8]	Shou Z, Chan J, Zareian A, Miyazawa K, Chang S F. CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 1417– 1426

[9]	Xu H, Das A, Saenko K. R-C3D: Region convolutional 3D network for temporal activity detection. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017, 5794– 5803

[10]	Chao Y W, Vijayanarasimhan S, Seybold B, Ross D A, Deng J, Sukthankar R. Rethinking the faster R-CNN architecture for temporal action localization. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 1130– 1139

[11]	Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D. Temporal action detection with structured segment networks. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017, 2933– 2942

[12]	Lin T, Liu X, Li X, Ding E, Wen S. BMN: boundary-matching network for temporal action proposal generation. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, 3888– 3897

[13]	Nguyen P, Han B, Liu T, Prasad G. Weakly supervised action localization by sparse temporal pooling network. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 6752− 6761

[14]	Islam A, Radke R J. Weakly supervised temporal action localization using deep metric learning. In: Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). 2020, 536– 545

[15]	Paul S, Roy S, Roy-Chowdhury A K. W-TALC: weakly-supervised temporal activity localization and classification. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 588– 607

[16]	Liu D, Jiang T, Wang Y. Completeness modeling and context separation for weakly supervised temporal action localization. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019, 1298– 1307

[17]	Shi B, Dai Q, Mu Y, Wang J. Weakly-supervised action localization by generative attention modeling. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020, 1006– 1016

[18]	Fernando B, Chet C T Y, Bilen H. Weakly supervised Gaussian networks for action detection. In: Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). 2020, 526– 535

[19]	Huang L, Huang Y, Ouyang W, Wang L. Relational prototypical network for weakly supervised temporal action localization. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11053– 11060

[20]	Rashid M, Kjellström H, Lee Y J. Action graphs: weakly-supervised action localization with graph convolution networks. In: Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). 2020, 604– 613

[21]	Wang L, Xiong Y, Lin D, Van Gool L. UntrimmedNets for weakly supervised action recognition and detection. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 6402– 6411

[22]	Narayan S, Cholakkal H, Khan F S, Shao L. 3C-Net: category count and center loss for weakly-supervised action localization. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, 8678− 8686

[23]	Kim S, Kim D, Cho M, Kwak S. Proxy anchor loss for deep metric learning. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020, 3235– 3244

[24]	Carreira J, Zisserman A. Quo Vadis, action recognition? A new model and the Kinetics dataset. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 4724– 4733

[25]	Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, 1933– 1941

[26]	Bendale A, Boult T E. Towards open set deep networks. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, 1563– 1572

[27]	Lakshminarayanan B, Pritzel A, Blundell C. Simple and scalable predictive uncertainty estimation using deep ensembles. In: Proceedings of Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems. 2017, 6402– 6413

[28]	Lee P, Wang J, Lu Y, Byun H. Weakly-supervised temporal action localization by uncertainty modeling. 2020, arXiv preprint arXiv: 2006.07006

[29]	Movshovitz-Attias Y, Toshev A, Leung T K, Ioffe S, Singh S. No fuss distance metric learning using proxies. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017, 360– 368

[30]	Idrees H, Zamir A R, Jiang Y G, Gorban A, Laptev I, Sukthankar R, Shah M . The THUMOS challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding, 2017, 155: 1– 23

[31]	Heilbron F C, Escorcia V, Ghanem B, Niebles J C. ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015, 961– 970

[32]	Shou Z, Gao H, Zhang L, Miyazawa K, Chang S F. AutoLoc: weakly-supervised temporal action localization in untrimmed videos. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 162– 179

[33]	Lee P, Uh Y, Byun H. Background suppression network for weakly-supervised temporal action localization. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11320– 11327

[34]	McInnes L, Healy J, Melville J. UMAP: uniform Manifold Approximation and Projection for Dimension Reduction, 2018, arXiv preprint arXiv:1802.03426v2