Attribute-based supervised deep learning model for action recognition

Kai CHEN, Guiguang DING, Jungong HAN

PDF(413 KB)
PDF(413 KB)
Front. Comput. Sci. ›› 2017, Vol. 11 ›› Issue (2) : 219-229. DOI: 10.1007/s11704-016-6066-5
RESEARCH ARTICLE

Attribute-based supervised deep learning model for action recognition

Author information +
History +

Abstract

Deep learning has been the most popular feature learning method used for a variety of computer vision applications in the past 3 years. Not surprisingly, this technique, especially the convolutional neural networks (ConvNets) structure, is exploited to identify the human actions, achieving great success. Most algorithms in existence directly adopt the basic ConvNets structure, which works pretty well in the ideal situation, e.g., under stable lighting conditions. However, its performance degrades significantly when the intra-variation in relation to image appearance occurs within the same category. To solve this problem, we propose a new method, integrating the semantically meaningful attributes into deep learning’s hierarchical structure. Basically, the idea is to add simple yet effective attributes to the category level of ConvNets such that the attribute information is able to drive the learning procedure. The experimental results based on three popular action recognition databases show that the embedding of auxiliary multiple attributes into the deep learning framework improves the classification accuracy significantly.

Keywords

action recognition / convolutional neural network / attribute

Cite this article

Download citation ▾
Kai CHEN, Guiguang DING, Jungong HAN. Attribute-based supervised deep learning model for action recognition. Front. Comput. Sci., 2017, 11(2): 219‒229 https://doi.org/10.1007/s11704-016-6066-5

References

[1]
Lao W L, Han J G. Automatic video-based human motion analyzer for consumer surveillance system. IEEE Transactions on Consumer Electronics, 2009, 55(2): 591–598
CrossRef Google scholar
[2]
Zhang B C, Alessandro P, Li Z G, Vittorio M, Liu J Z, Ji R R. Bounding multiple gaussians uncertainty with application to object tracking. International Journal of Computer Vision, 2016, 1–16
CrossRef Google scholar
[3]
Chen C, Liu M Y, Zhang B C, Han J G, Jiang J J, Liu H. 3D action recognition using multi-temporal depth motion maps and fisher vector. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence. 2016, 3331–3337
[4]
Han J G, Dirk F, De With P H N. Broadcast court-net sports video analysis using fast 3-D camera modeling. IEEE Transactions on Circuits and Systems for Video Technology, 2008, 18(11): 1628–1638
CrossRef Google scholar
[5]
Ding G G, Guo Y C, Zhou J L, Gao Y. Large-scale cross-modality search via collective matrix factorization hashing. IEEE Transactions on Image Processing, 2016, 25(11): 5427–5440
CrossRef Google scholar
[6]
Lin Z J, Ding G G, Han J G, Wang J M. Cross-view retrieval via probability-based semantics-preserving hashing. IEEE Transactions on Cybernetics, 2016
CrossRef Google scholar
[7]
Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2005, 886–893
CrossRef Google scholar
[8]
Laptev I, Marszałek M, Schmid C, Rozenfeld B. Learning realistic human actions from movies. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2008, 1–8
CrossRef Google scholar
[9]
Dalal N, Triggs B, Schmid C. Human detection using oriented histograms of flow and appearance. In: Proceedings of European Conference on Computer Vision. 2006, 428–441
CrossRef Google scholar
[10]
Wang H, Schmid C. Action recognition with improved trajectories. In: Proceedings of IEEE International Conference on Computer Vision. 2013, 3551–3558
CrossRef Google scholar
[11]
Li F F, Pietro P. A bayesian hierarchical model for learning natural scene categories. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2005, 524–531
[12]
Lee H, Battle A, Raina R, Ng A Y. Efficient sparse coding algorithms. In: Proceedings of Advances in Neural Information Processing Systems. 2006, 801–808
[13]
Yang Y, Wang X, Liu Q, Xu M L, Yu L. A bundled-optimization model of multiview dense depth map synthesis for dynamic scene reconstruction. Information Sciences, 2015, 320: 306–319
CrossRef Google scholar
[14]
Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems. 2012, 1097–1105
[15]
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Li F F. Large-scale video classification with convolutional neural networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2014, 1725–1732
CrossRef Google scholar
[16]
Price A L, Patterson N J, Plenge R M, Weinblatt M E, Shadick N A, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 2006, 38(8): 904–909
CrossRef Google scholar
[17]
Liu A A, Su Y T, Jia P P, Gao Z, Hao T, Yang Z X. Multipe/singleview human action recognition via part-induced multitask structural learning. IEEE Transactions on Cybernetics, 2015, 45(6): 1194–1208
CrossRef Google scholar
[18]
Liu A A, Xu N, Su Y T, Lin H, Hao T, Yang Z X. Single/multi-view human action recognition via regularized multi-task learning. Neurocomputing, 2015, 151: 544–553
CrossRef Google scholar
[19]
Xu N, Liu A A, Nie W Z, Wong Y Y, Li F W, Su Y T. Multi-modal & multi-view & interactive benchmark dataset for human action recognition. In: Proceedings of the 23rd ACM International Conference on Multimedia. 2015, 1195–1198
CrossRef Google scholar
[20]
Liu A A, Nie W Z, Su Y T, Ma L, Hao T, Yang Z X. Coupled hidden conditional random fields for RGB-D human action recognition. Signal Processing, 2015, 112: 74–82
CrossRef Google scholar
[21]
Yang Y, Wang X, Guan T, Shen J L, Yu L. A multi-dimensional image quality prediction model for user-generated images in social networks. Information Sciences, 2014, 281: 601–610
CrossRef Google scholar
[22]
Zhu Y M, Li K, Jiang J M. Video super-resolution based on automatic key-frame selection and feature-guided variational optical flow. Signal Processing: Image Communication, 2014, 29(8): 875–886
CrossRef Google scholar
[23]
Gao Y, Wang M, Tao D C, Ji R R, Dai Q H. 3-D object retrieval and recognition with hypergraph analysis. IEEE Transactions on Image Processing, 2012, 21(9): 4290–4303
CrossRef Google scholar
[24]
Gao Y, Wang M, Ji R R, Wu X D, Dai Q H. 3-D object retrieval with hausdorff distance learning. IEEE Transactions on Industrial Electronics, 2014, 61(4): 2088–2098
CrossRef Google scholar
[25]
Ji R R, Gao Y, Hong R C, Liu Q, Tao D C, Li X L. Spectral-spatial constraint hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 2014, 52(3): 1811–1824
CrossRef Google scholar
[26]
Lu X Q, Zheng X T, Li X L. Latent semantic minimal hashing for image retrieval. IEEE Transactions on Image Processing, 2016, 26(1): 355–368
CrossRef Google scholar
[27]
Lu X Q, Li X L, Mou L C. Semi-supervised multitask learning for scene recognition. IEEE Transactions on Cybernetics, 2015, 45(9): 1967–1976
CrossRef Google scholar
[28]
Zhang D W, Han J W, Han J G, Shao L. Cosaliency detection based on intrasaliency prior transfer and deep intersaliency mining. IEEE Transactions on Neural Networks and Learning Systems, 2016, 27(6): 1163–1176
CrossRef Google scholar
[29]
Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. In: Proceedings of Advances in Neural Information Processing Systems. 2014, 568–576
[30]
Ryoo M S, Rothrock B, Matthies L. Pooled motion features for firstperson videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, 896–904
[31]
Wang L M, Qiao Y, Tang X O. Action recognition with trajectorypooled deep-convolutional descriptors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, 4305–4314
[32]
Liu J G, Yu Q, Javed O, Ali S, Tamrakar A, Divakaran A, Cheng H, Sawhney H. Video event recognition using concept attributes. In: Proceedings of IEEE Workshop on Applications of Computer Vision. 2013, 339–346
CrossRef Google scholar
[33]
Soomro K, Zamir A R, Shah M. Ucf101: a dataset of 101 human actions classes from videos in the wild. 2012, arXiv preprint arXiv:1212.0402
[34]
Deng J, Dong W, Socher R, Li L J, Li K, Li F F. Imagenet: A largescale hierarchical image database. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2009, 248–255
[35]
Jia Y Q, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T. Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia. 2014, 675–678
CrossRef Google scholar
[36]
Wang H, Kläser A, Schmid C, Liu C L. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 2013, 103(1): 60–79
CrossRef Google scholar
[37]
Ng J Y H, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G. Beyond short snippets: deep networks for video classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2015, 4694–4702
[38]
Schuldt C, Laptev I, Caputo B. Recognizing human actions: a local svm approach. In: Proceedings of the 17th International Conference on Pattern Recognition. 2004, 32–36
CrossRef Google scholar
[39]
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T. Hmdb: a large video database for human motion recognition. In: Proceedings of IEEE International Conference on Computer Vision. 2011, 2556–2563
CrossRef Google scholar
[40]
Chang C C, Lin C J. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2011, 2(3): 27
CrossRef Google scholar
[41]
Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S. Dynamic image networks for action recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition. 2016
CrossRef Google scholar
[42]
Bagheri M, Gao Q G, Escalera S, Clapes A, Nasrollahi K, Holte M, Moeslund T. Keep it accurate and diverse: enhancing action recognition performance by ensemble learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2015, 22–29
CrossRef Google scholar
[43]
Ho T K. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(8): 832–844
CrossRef Google scholar

RIGHTS & PERMISSIONS

2016 Higher Education Press and Springer-Verlag Berlin Heidelberg
AI Summary AI Mindmap
PDF(413 KB)

Accesses

Citations

Detail

Sections
Recommended

/