Multipath affinage stacked—hourglass networks for human pose estimation

Guoguang HUA; Lihong LI; Shiguang LIU

doi:10.1007/s11704-019-8266-2

Front. Comput. Sci. ›› 2020, Vol. 14 ›› Issue (4) :144701 DOI: 10.1007/s11704-019-8266-2

RESEARCH ARTICLE

Multipath affinage stacked—hourglass networks for human pose estimation

Author information +

History +

PDF (1736KB)

Abstract

Recently, stacked hourglass network has shown outstanding performance in human pose estimation. However, repeated bottom-up and top-down stride convolution operations in deep convolutional neural networks lead to a significant decrease in the initial image resolution. In order to address this problem, we propose to incorporate affinage module and residual attention module into stacked hourglass network for human pose estimation. This paper introduces a novel network architecture to replace the stacked hourglass network of up-sampling operation for getting high-resolution features. We refer to the architecture as an affinage module which is critical to improve the performance of the stacked hourglass network. Additionally, we also propose a novel residual attention module to increase the supervision of upsample process. The effectiveness of the introduced module is evaluated on standard benchmarks. Various experimental results demonstrated that our method can achieve more accurate and more robust human pose estimation results in images with complex background.

Keywords

human pose estimation / stacked hourglass network / affinage module / residual attention module

Cite this article

Download citation ▾

Guoguang HUA, Lihong LI, Shiguang LIU. Multipath affinage stacked—hourglass networks for human pose estimation. Front. Comput. Sci., 2020, 14(4): 144701 DOI:10.1007/s11704-019-8266-2

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Chen K, Ding G, Han J. Attribute-based supervised deep learning model for action recognition. Frontiers of Computer Science, 2017, 11(2): 219–229

[2]	Varior R R, Shuai B, Lu J. A siamese long short-term memory architecture for human re-identification. In: Proceedings of European Conference on Computer Vision. 2016, 135–153

[3]	Sapp B, Taskar B. MODEC: multimodal decomposable models for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2013, 3674–3681

[4]	Felzenszwalb P, Mcallester D, Ramanan D. A discriminatively trained, multiscale, deformable part model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2008

[5]	Pishchulin L, Andriluka M, Gehler P. Strong appearance and expressive spatial models for human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision. 2014, 3487–3494

[6]	Johnson S, Everingham M. Learning effective human pose estimation from inaccurate annotation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2011, 1465–1472

[7]	Ouyang W, Chu X, Wang X.Multi-source deep learning for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, 2329–2336

[8]	Ladicky L, Torr P H S, Zisserman A.Human pose estimation using a joint pixel-wise and part-wise formulation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2013, 3578–3585

[9]	Liu S G, Li Y, Hua G. Human pose estimation in video via structured space learning and halfway temporal evaluation. IEEE Transactions on Circuits and Systems for Video Technology. 2018, 1

[10]	Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. 2012, 1097–1105

[11]	Ioffe S,Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of International Conference on Machine Learning. 2015, 448–456

[12]	Szegedy C, Liu W, Jia Y. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, 1–9

[13]	Li Y,Liu S G. Temporal-coherency-aware human pose estimation in video via pre-trained res-net and flow-CNN. In: Proceedings of International Conference on Computer Animation and Social Agents. 2017, 150–159

[14]	Johnson S, Everingham M. Clustered pose and nonlinear appearance models for human pose estimation. In: Proceedings of the British Machine Vision Conference. 2010, 1–11

[15]	Andriluka M, Pishchulin L, Gehler P. 2D Human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, 3686–3693

[16]	Newell A, Yang K, Deng J. Stacked hourglass networks for human pose estimation. In: Proceedings of European Conference on Computer Vision. 2016, 483–499

[17]	Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, 3431–3440

[18]	Andriluka M, Roth S,Schiele B. Pictorial structures revisited: people detection and articulated pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2009, 1014–1021

[19]	Andriluka M, Roth S, Schiele B. Monocular 3D pose estimation and tracking by detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2010, 623–630

[20]	Lopez Q, Manuel I. Mixing body-parts model for 2D human pose estimation in stereo videos. IET Computer Vision, 2017, 11(6): 426–433

[21]	Dalal N,Triggs B. Histograms of oriented gradients for human detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2005, 886–893

[22]	Dogan E, Eren G, Wolf C.Multi-view pose estimation with mixturesof- parts and adaptive viewpoint selection. IET Computer Vision, 2018, 12(4): 403–411

[23]	Toshev A, Szegedy C. DeepPose: human pose estimation via deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, 1653–1660

[24]	Tompson J, Goroshin R, Jain A. Efficient object localization using convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, 648–656

[25]	Tompson J, Jain A, LeCun Y. Joint training of a convolutional network and a graphical model for human pose estimation. In: Proceedings of the 28th Annual Conference on Neural Information Processing Systems. 2014, 1799–1807

[26]	Carreira J, Agrawal P,Fragkiadaki K. Human pose estimation with iterative error feedback. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 4733–4742

[27]	Wei S E, Ramakrishna V, Kanade T. Convolutional pose machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 4724–4732

[28]	Cao Z,Simon T, ShihEn W. Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 1302–1310

[29]	Noh H, Hong S, Han B. Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 1520–1528

[30]	Rematas K, Ritschel T, Fritz M. Deep reflectance maps. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 4508–4516

[31]	He K M, Zhang X,Ren S. Deep residual learning for image recogni tion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770–778

[32]	Jaderberg M, Simonyan K, Zisserman A. Spatial transformer networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015, 2017–2025

[33]	Ferrari V, Marin M, Zisserman A. Progressive search space reduction for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2008, 1–8

[34]	Yang W, Li S,Ouyang W. Learning feature pyramids for human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 1281–1290

[35]	Yang Y, Ramanan D. Articulated human detection with flexible mixtures of parts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(12): 2878–2890

[36]	Yu X, Zhou F, Chandraker M. Deep deformation network for object landmark localization. In: Proceedings of European Conference on Computer Vision. 2016, 52–70

[37]	Belagiannis V, Zisserman A. Recurrent human pose estimation. In: Proceedings of the International Conference on Automatic Face and Gesture Recognition. 2017, 468–475

[38]	Lifshitz I,Fetaya E, Ullman S. Human pose estimation using deep consensus voting. In: Proceedings of European Conference on Computer Vision. 2016, 246–260

[39]	Pishchulin L, Insafutdinov E, Tang S. Deepcut: joint subset partition and labeling for multi person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, 4929–4937

[40]	Insafutdinov E, Pishchulin L, Andres B. Deepercut: a deeper, stronger, and faster multi-person pose estimation model. In: Proceedings of the 14th European Conference on Computer Vision. 2016, 34–50

[41]	Hu P,Ramanan D. Bottom-up and top-down reasoning with hierarchical rectified gaussians. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 5600–5609