ResLNet: deep residual LSTM network with longer input for action recognition
Tian WANG, Jiakun LI, Huai-Ning WU, Ce LI, Hichem SNOUSSI, Yang WU
ResLNet: deep residual LSTM network with longer input for action recognition
Action recognition is an important research topic in video analysis that remains very challenging. Effective recognition relies on learning a good representation of both spatial information (for appearance) and temporal information (for motion). These two kinds of information are highly correlated but have quite different properties, leading to unsatisfying results of both connecting independent models (e.g., CNN-LSTM) and direct unbiased co-modeling (e.g., 3DCNN). Besides, a long-lasting tradition on this task with deep learning models is to just use 8 or 16 consecutive frames as input, making it hard to extract discriminative motion features. In this work, we propose a novel network structure called ResLNet (Deep Residual LSTM network), which can take longer inputs (e.g., of 64 frames) and have convolutions collaborate with LSTM more effectively under the residual structure to learn better spatial-temporal representations than ever without the cost of extra computations with the proposed embedded variable stride convolution. The superiority of this proposal and its ablation study are shown on the three most popular benchmark datasets: Kinetics, HMDB51, and UCF101. The proposed network could be adopted for various features, such as RGB and optical flow. Due to the limitation of the computation power of our experiment equipment and the real-time requirement, the proposed network is tested on the RGB only and shows great performance.
action recognition / deep learning / neural network
[1] |
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, 1– 9
|
[2] |
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770– 778
|
[3] |
Huang G, Liu Z, Van Der Maaten L, Weinberger K Q. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 4700−4708
|
[4] |
Hara K, Kataoka H, Satoh Y. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, 6546−6555
|
[5] |
Qiu Z, Yao T, Mei T. Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 5533−5541
|
[6] |
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M. A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, 6450−6459
|
[7] |
Hara K, Kataoka H, Satoh Y. Learning spatio-temporal features with 3d residual networks for action recognition. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 3154−3160
|
[8] |
Tran D, Ray J, Shou Z, Chang S F, Paluri M. Convnet architecture search for spatiotemporal feature learning. 2017, arXiv preprint arXiv: 1708.05038
|
[9] |
Ye M , Li J , Ma A J , Zheng L , Yuen P C . Dynamic graph co-matching for unsupervised video-based person re-identification. IEEE Transactions on Image Processing, 2019, 28( 6): 2976– 2990
|
[10] |
Ye M, Lan X, Yuen P C. Robust anchor embedding for unsupervised video person re-identification in the wild. In: Proceedings of the European Conference on Computer Vision. 2018, 170– 186
|
[11] |
Ye M, Shen J, Lin G, Xiang T, Shao L, Hoi S C. Deep learning for person re-identification: a survey and outlook. 2020, arXiv preprint arXiv: 2001.04193
|
[12] |
Shi X, Chen Z, Wang H, Yeung D Y, Wong W K, Woo W C. Convolutional lstm network: a machine learning approach for precipitation nowcasting. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015, 802– 810
|
[13] |
Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of International Conference on International Conference on Machine Learning. 2015
|
[14] |
Laptev I . On space-time interest points. International Journal of Computer Vision, 2005, 64( 2–3): 107– 123
|
[15] |
Scovanner P, Ali S, Shah M. A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th ACM International Conference on Multimedia. 2007, 357– 360
|
[16] |
Klaser A, Marszałek M, Schmid C. A spatio-temporal descriptor based on 3d-gradients. In: Proceedings of the British Machine Vision Conference. 2008
|
[17] |
Wang T , Snoussi H . Detection of abnormal visual events via global optical flow orientation histogram. IEEE Transactions on Information Forensics and Security, 2014, 9( 6): 988– 998
|
[18] |
Dalal N, Triggs B, Schmid C. Human detection using oriented histograms of flow and appearance. In: Proceedings of the European Conference on Computer Vision. 2006, 428– 441
|
[19] |
Wang H, Schmid C. Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision. 2013, 3551−3558
|
[20] |
Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2017, 4724−4733
|
[21] |
He K, Zhang X, Ren S, Sun J. Identity mappings in deep residual networks. In: Proceedings of European Conference on Computer Vision. 2016, 630– 645
|
[22] |
Xie S, Girshick R, Dollár P, Tu Z, He K. Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 1492−1500
|
[23] |
Wang L, Li W, Li W, Van Gool L. Appearance-and-relation networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, 1430−1439
|
[24] |
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proceedings of the International Conference on Learning Representations. 2015, 1– 14
|
[25] |
Kong C, Lucey S. Take it in your stride: do we need striding in CNNs? 2017, arXiv preprint arXiv: 1712.02502
|
[26] |
Guo C , Liu Y l , Jiao X . Study on the influence of variable stride scale change on image recognition in CNN. Multimedia Tools and Applications, 2019, 78( 21): 30027– 30037
|
[27] |
Zaniolo L , Marques O . On the use of variable stride in convolutional neural networks. Multimedia Tools and Applications, 2020, 79( 19): 13581– 13598
|
[28] |
Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2014, 568– 576
|
[29] |
Wang T , Qiao M , Zhu A , Shan G , Snoussi H . Abnormal event detection via the analysis of multi-frame optical flow information. Frontiers of Computer Science, 2020, 14( 2): 304– 313
|
[30] |
Zhang L, Zhu G, Shen P, Song J, Shah S A, Bennamoun M. Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 3120−3128
|
[31] |
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, et al. The kinetics human action video dataset. 2017, arXiv preprint arXiv: 1705.06950
|
[32] |
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G. Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, 4694−4702
|
[33] |
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T. Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, 2625−2634
|
[34] |
Shen L , Hong R , Hao Y . Advance on large scale near-duplicate video retrieval. Frontiers of Computer Science, 2020, 14( 5): 145702–
|
[35] |
Zhu G , Zhang L , Shen P , Song J . Multimodal gesture recognition using 3-d convolution and convolutional LSTM. IEEE Access, 2017, 5
|
[36] |
Laurent C, Pereyra G, Brakel P, Zhang Y, Bengio Y. Batch normalized recurrent neural networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2016, 2657−2661
|
[37] |
Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, Casper J, Catanzaro B, Cheng Q, Chen G, et al. Deep speech 2: end-to-end speech recognition in english and mandarin. In: Proceedings of International Conference on Machine Learning. 2016, 173–182
|
[38] |
Cooijmans T, Ballas N, Laurent C, Gülçehre Ç, Courville A. Recurrent batch normalization. 2016, arXiv preprint arXiv: 1603.09025
|
[39] |
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T. Hmdb: a large video database for human motion recognition. In: Proceedings of the IEEE International Conference on Computer Vision. 2011, 2556−2563
|
[40] |
Soomro K, Zamir A R, Shah M. Ucf101: a dataset of 101 human actions classes from videos in the wild. 2012, arXiv preprint arXiv: 1212.0402
|
[41] |
Varol G , Laptev I , Schmid C . Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40( 6): 1510– 1517
|
[42] |
Zheng J , Cao X , Zhang B , Zhen X , Su X . Deep ensemble machine for video classification. IEEE Transactions on Neural Networks and Learning Systems, 2019, 30( 2): 553– 565
|
[43] |
Tu N A , Huynh-The T , Khan K U , Lee Y K . Ml-hdp: a hierarchical bayesian nonparametric model for recognizing human actions in video. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 29( 3): 800– 814
|
/
〈 | 〉 |