Weakly supervised temporal action localization with proxy metric modeling
Hongsheng XU, Zihan CHEN, Yu ZHANG, Xin GENG, Siya MI, Zhihong YANG
Weakly supervised temporal action localization with proxy metric modeling
Temporal localization is crucial for action video recognition. Since the manual annotations are expensive and time-consuming in videos, temporal localization with weak video-level labels is challenging but indispensable. In this paper, we propose a weakly-supervised temporal action localization approach in untrimmed videos. To settle this issue, we train the model based on the proxies of each action class. The proxies are used to measure the distances between action segments and different original action features. We use a proxy-based metric to cluster the same actions together and separate actions from backgrounds. Compared with state-of-the-art methods, our method achieved competitive results on the THUMOS14 and ActivityNet1.2 datasets.
temporal action localization / weakly supervised videos / proxy metric
Hongsheng Xu received the BSc degree from the Southeast University, Nanjing, China in 2009, and the PhD degree in electrical engineering research from the Iowa State University, USA in 2015. He is currently a Machine Learning Scientist with NARI Research Institute, NARI Group Corporation, China. His current research interests include development and application of deep reinforcement learning in smart grids and energy markets as well as deep learning approaches for the application of opera tion and maintenance in power systems
Zihan Chen received the BS degree in computer science and technology from University of Electronic Science and Technology of China, China. Now he is a master student at School of Computer Science and Engineering, Southeast University, China. His research interests include machine learning and computer vision
Yu Zhang received the BS and MS degrees in telecommunications engineering from Xidian University, China, and his PhD degree in computer engineering from Nanyang Technological University, Singapore. He has been a postdoctoral fellow in the Bioinformatics Institute, A*STAR, Singapore. He is now an Associate Professor in Southeast University, China. His research interest is computer vision
Xin Geng is currently a professor and the dean of School of Computer Science and Engineering at Southeast University, China. He received the BSc (2001) and MSc (2004) degrees in computer science from Nanjing University, China, and the PhD (2008) degree in computer science from Deakin University, Australia. His research interests include machine learning, pattern recognition, and computer vision
Siya Mi received the double BS degree from the Beijing University of Posts and Telecoms, China, and the University of London, UK in 2010, and the MS and PhD degrees from Nanyang Technological University, Singapore in 2011 and 2018, respectively. She is currently a lecturer in the Southeast University, China. Her research interests include the data processing and computer vision for cyber security
Zhihong Yang received the BSc degree from the Nanjing University, China in 1990, and the MSc degree from the Southeast University, China in 1998, all in Computer Science. He was with the NARI Group Corporation, China, for 22 years. He has been the vice president of NARI Research Institute, NARI Group Corporation, China since 2018. He led the development of novel automation technologies that have been developed as series products extensively used in grid dispatching industry. His research interests include power system automation, integrated energy system, big data analysis and AI application in power system. He is also a member of National Power System Management and Information Exchange Standardization Technical Committee
[1] |
Ronchetti F, Quiroga F, Lanzarini L, Estrebou C . Distribution of action movements (DAM): a descriptor for human action recognition. Frontiers of Computer Science, 2015, 9( 6): 956– 965
|
[2] |
Chen K, Ding G, Han J . Attribute-based supervised deep learning model for action recognition. Frontiers of Computer Science, 2017, 11( 2): 219– 229
|
[3] |
Wang J, Chen D, Yang J . Human behavior classification by analyzing periodic motions. Frontiers of Computer Science, 2010, 4( 4): 580– 588
|
[4] |
Zhu X, Liu Z . Human behavior clustering for anomaly detection. Frontiers of Computer Science in China, 2011, 5( 3): 279– 289
|
[5] |
Chebieb A, Ameur Y A . A formal model for plastic human computer interfaces. Frontiers of Computer Science, 2018, 12( 2): 351– 375
|
[6] |
Chen W, Zhu S, Wan H, Feng J . Dual quaternion based virtual hand interaction modeling. Science China Information Sciences, 2013, 56( 3): 1– 11
|
[7] |
Shou Z, Wang D, Chang S F. Temporal action localization in untrimmed videos via multi-stage CNNs. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, 1049– 1058
|
[8] |
Shou Z, Chan J, Zareian A, Miyazawa K, Chang S F. CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 1417– 1426
|
[9] |
Xu H, Das A, Saenko K. R-C3D: Region convolutional 3D network for temporal activity detection. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017, 5794– 5803
|
[10] |
Chao Y W, Vijayanarasimhan S, Seybold B, Ross D A, Deng J, Sukthankar R. Rethinking the faster R-CNN architecture for temporal action localization. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 1130– 1139
|
[11] |
Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D. Temporal action detection with structured segment networks. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017, 2933– 2942
|
[12] |
Lin T, Liu X, Li X, Ding E, Wen S. BMN: boundary-matching network for temporal action proposal generation. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, 3888– 3897
|
[13] |
Nguyen P, Han B, Liu T, Prasad G. Weakly supervised action localization by sparse temporal pooling network. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 6752− 6761
|
[14] |
Islam A, Radke R J. Weakly supervised temporal action localization using deep metric learning. In: Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). 2020, 536– 545
|
[15] |
Paul S, Roy S, Roy-Chowdhury A K. W-TALC: weakly-supervised temporal activity localization and classification. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 588– 607
|
[16] |
Liu D, Jiang T, Wang Y. Completeness modeling and context separation for weakly supervised temporal action localization. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019, 1298– 1307
|
[17] |
Shi B, Dai Q, Mu Y, Wang J. Weakly-supervised action localization by generative attention modeling. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020, 1006– 1016
|
[18] |
Fernando B, Chet C T Y, Bilen H. Weakly supervised Gaussian networks for action detection. In: Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). 2020, 526– 535
|
[19] |
Huang L, Huang Y, Ouyang W, Wang L. Relational prototypical network for weakly supervised temporal action localization. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11053– 11060
|
[20] |
Rashid M, Kjellström H, Lee Y J. Action graphs: weakly-supervised action localization with graph convolution networks. In: Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). 2020, 604– 613
|
[21] |
Wang L, Xiong Y, Lin D, Van Gool L. UntrimmedNets for weakly supervised action recognition and detection. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 6402– 6411
|
[22] |
Narayan S, Cholakkal H, Khan F S, Shao L. 3C-Net: category count and center loss for weakly-supervised action localization. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, 8678− 8686
|
[23] |
Kim S, Kim D, Cho M, Kwak S. Proxy anchor loss for deep metric learning. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020, 3235– 3244
|
[24] |
Carreira J, Zisserman A. Quo Vadis, action recognition? A new model and the Kinetics dataset. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 4724– 4733
|
[25] |
Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, 1933– 1941
|
[26] |
Bendale A, Boult T E. Towards open set deep networks. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, 1563– 1572
|
[27] |
Lakshminarayanan B, Pritzel A, Blundell C. Simple and scalable predictive uncertainty estimation using deep ensembles. In: Proceedings of Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems. 2017, 6402– 6413
|
[28] |
Lee P, Wang J, Lu Y, Byun H. Weakly-supervised temporal action localization by uncertainty modeling. 2020, arXiv preprint arXiv: 2006.07006
|
[29] |
Movshovitz-Attias Y, Toshev A, Leung T K, Ioffe S, Singh S. No fuss distance metric learning using proxies. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017, 360– 368
|
[30] |
Idrees H, Zamir A R, Jiang Y G, Gorban A, Laptev I, Sukthankar R, Shah M . The THUMOS challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding, 2017, 155: 1– 23
|
[31] |
Heilbron F C, Escorcia V, Ghanem B, Niebles J C. ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015, 961– 970
|
[32] |
Shou Z, Gao H, Zhang L, Miyazawa K, Chang S F. AutoLoc: weakly-supervised temporal action localization in untrimmed videos. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 162– 179
|
[33] |
Lee P, Uh Y, Byun H. Background suppression network for weakly-supervised temporal action localization. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11320– 11327
|
[34] |
McInnes L, Healy J, Melville J. UMAP: uniform Manifold Approximation and Projection for Dimension Reduction, 2018, arXiv preprint arXiv:1802.03426v2
|
/
〈 | 〉 |