Soft video parsing by label distribution learning

Miaogen LING, Xin GENG

PDF(860 KB)
PDF(860 KB)
Front. Comput. Sci. ›› 2019, Vol. 13 ›› Issue (2) : 302-317. DOI: 10.1007/s11704-018-8015-y
RESEARCH ARTICLE

Soft video parsing by label distribution learning

Author information +
History +

Abstract

In this paper, we tackle the problem of segmenting out a sequence of actions from videos. The videos contain background and actions which are usually composed of ordered sub-actions. We refer the sub-actions and the background as semantic units. Considering the possible overlap between two adjacent semantic units, we propose a bidirectional sliding window method to generate the label distributions for various segments in the video. The label distribution covers a certain number of semantic unit labels, representing the degree to which each label describes the video segment. The mapping from a video segment to its label distribution is then learned by a Label Distribution Learning (LDL) algorithm. Based on the LDL model, a soft video parsing method with segmental regular grammars is proposed to construct a tree structure for the video. Each leaf of the tree stands for a video clip of background or sub-action. The proposed method shows promising results on the THUMOS’14, MSR-II and UCF101 datasets and its computational complexity is much less than the compared state-of-the-art video parsing method.

Keywords

video parsing / label distribution learning / subactions / graduality

Cite this article

Download citation ▾
Miaogen LING, Xin GENG. Soft video parsing by label distribution learning. Front. Comput. Sci., 2019, 13(2): 302‒317 https://doi.org/10.1007/s11704-018-8015-y

References

[1]
Pirsiavash H, Ramanan D. Parsing videos of actions with segmental grammars. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, 612–619
CrossRef Google scholar
[2]
Caba Heilbron F, Carlos Niebles J, Ghanem B. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 1914–1923
CrossRef Google scholar
[3]
Oneata D, Verbeek J, Schmid C. The LEAR submission at thumos 2014. 2014, hal-01074442
[4]
Shou Z, Wang D, Chang S F. Temporal action localization in untrimmed videos via multi-stage CNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 1049–1058
CrossRef Google scholar
[5]
Wang H, Oneata D, Verbeek J, Schmid C. A robust and efficient video representation for action recognition. International Journal of Computer Vision, 2016, 119(3): 219–238
CrossRef Google scholar
[6]
Yuan J, Ni B, Yang X, Kassim A A. Temporal action localization with pyramid of score distribution features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 3093–3102
CrossRef Google scholar
[7]
Geng X. Label distribution learning. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(7): 1734–1748
CrossRef Google scholar
[8]
Geng X, Hou P. Pre-release prediction of crowd opinion on movies by label distribution learning. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence. 2015, 3511–3517
[9]
Geng X, Luo L. Multilabel ranking with inconsistent rankers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, 3742–3747
CrossRef Google scholar
[10]
Geng X, Xia Y. Head pose estimation based on multivariate label distribution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, 1837–1842
CrossRef Google scholar
[11]
Geng X, Yin C, Zhou Z H. Facial age estimation by learning from label distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(10): 2401–2412
CrossRef Google scholar
[12]
Geng X, Zhou Z H, Smith-Miles K. Automatic age estimation based on facial aging patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(12): 2234–2240
CrossRef Google scholar
[13]
Zhou D, Zhou Y, Zhang X, Zhao Q, Geng X. Emotion distribution learning from texts. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2016, 638–647
CrossRef Google scholar
[14]
Zhou Y, Xue H, Geng X. Emotion distribution recognition from facial expressions. In: Proceedings of the 23rd Annual ACM Conference on Multimedia Conference. 2015, 1247–1250
CrossRef Google scholar
[15]
Xing C, Geng X, Xue H. Logistic boosting regression for label distribution learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 4489–4497
CrossRef Google scholar
[16]
Shen W, Zhao K, Guo Y, Yuille A L. Label distribution learning forests. Advances in Neural Information Processing Systems. 2017, 834–843
[17]
Geng X, Ling M. Soft video parsing by label distribution learning. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017, 1331–1337
[18]
Neubeck A, Van Gool L. Efficient non-maximum suppression. In: Proceedings of the 18th IEEE International Conference on Pattern Recognition. 2006, 850–855
CrossRef Google scholar
[19]
Hoai M, Lan Z Z, De la Torre F. Joint segmentation and classification of human actions in video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2011, 3265–3272
CrossRef Google scholar
[20]
Shi Q, Cheng L, Wang L, Smola A. Human action segmentation and recognition using discriminative semi-markov models. International Journal of Computer Vision, 2011, 93(1): 22–32
CrossRef Google scholar
[21]
Shi Q, Wang L, Cheng L, Smola A. Discriminative human action segmentation and recognition using semi-markov model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2008, 1–8
[22]
Tang K, Li F F, Koller D. Learning latent temporal structure for complex event detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2012, 1250–1257
CrossRef Google scholar
[23]
Xiong Y, Zhao Y, Wang L, Lin D, Tang X. A pursuit of temporal accuracy in general activity detection. 2017, arXiv preprint arXiv:1703.02716
[24]
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L. Temporal segment networks: towards good practices for deep action recognition. In: Proceedings of the European Conference on Computer Vision. 2016, 20–36
CrossRef Google scholar
[25]
Gao J, Yang Z, Sun C, Chen K, Nevatia R. Turn tap: temporal unit regression network for temporal action proposals. 2017, arXiv preprint arXiv:1703.06189
[26]
Shou Z, Chan J, Zareian A, Miyazawa K, Chang S F. CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. 2017, arXiv preprint arXiv:1703.01515
[27]
Elman J L. Finding structure in time. Cognitive Science, 1990, 14(2): 179–211
CrossRef Google scholar
[28]
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735–1780
CrossRef Google scholar
[29]
Chomsky N. Three models for the description of language. IEEE Transactions on Information Theory, 1956, 2(3): 113–124
CrossRef Google scholar
[30]
Datar M, Immorlica N, Indyk P, Mirrokni V S. Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the 20th Annual Symposium on Computational Geometry. 2004, 253–262
CrossRef Google scholar
[31]
Belongie S, Malik J, Puzicha J. Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(4): 509–522
CrossRef Google scholar
[32]
Boyd S, Vandenberghe L. Convex Optimization. Cambridge: Cambridge University Press, 2004
CrossRef Google scholar
[33]
Berger A L, Pietra V J D, Pietra S A D. A maximum entropy approach to natural language processing. Computational Linguistics, 1996, 22(1): 39–71
[34]
Liu D C, Nocedal J. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 1989, 45(1-3): 503–528
CrossRef Google scholar
[35]
Manning C D, Schütze H. Foundations of Statistical Natural Language Processing. Mass: MIT Press, 1999
[36]
Jiang Y G, Liu J, Zamir A R, Toderici G, Laptev I, Shah M, Sukthankar R. THUMOS challenge: action recognition with a large number of classes. In: Proceedings of the 1st International Workshop on Action Recognition with a large Number of Classes. 2014
[37]
Yuan J, Liu Z, Wu Y. Discriminative video pattern search for efficient action detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(9): 1728–1743
CrossRef Google scholar
[38]
Soomro K, Zamir A R, Shah M. UCF101: a dataset of 101 human actions classes from videos in the wild. 2012, arXiv preprint arXiv:1212.0402
[39]
Laptev I, Marszałek M, Schmid C, Rozenfeld B. Learning realistic human actions from movies. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2008, 1–8
CrossRef Google scholar
[40]
Vedaldi A, Zisserman A.Efficient additive kernels via explicit feature maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(3): 480–492
CrossRef Google scholar
[41]
Everingham M, Winn J. The pascal visual object classes challenge 2012 (VOC2012) development kit. Pattern Analysis, Statistical Modelling and Computational Learning, Technical Report, 2011
[42]
Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems. 2014, 568–576
[43]
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, 4489–4497
CrossRef Google scholar

RIGHTS & PERMISSIONS

2018 Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature
AI Summary AI Mindmap
PDF(860 KB)

Accesses

Citations

Detail

Sections
Recommended

/