Graph-based method for human-object interactions detection

Li-min Xia; Wei Wu

doi:10.1007/s11771-021-4597-x

Journal of Central South University ›› 2021, Vol. 28 ›› Issue (1) :205 -218. DOI: 10.1007/s11771-021-4597-x

Article

Graph-based method for human-object interactions detection

Li-min Xia ¹^,^a
, Wei Wu ¹

Author information +

History +

PDF

Abstract

Human-object interaction (HOIs) detection is a new branch of visual relationship detection, which plays an important role in the field of image understanding. Because of the complexity and diversity of image content, the detection of HOIs is still an onerous challenge. Unlike most of the current works for HOIs detection which only rely on the pairwise information of a human and an object, we propose a graph-based HOIs detection method that models context and global structure information. Firstly, to better utilize the relations between humans and objects, the detected humans and objects are regarded as nodes to construct a fully connected undirected graph, and the graph is pruned to obtain an HOI graph that only preserving the edges connecting human and object nodes. Then, in order to obtain more robust features of human and object nodes, two different attention-based feature extraction networks are proposed, which model global and local contexts respectively. Finally, the graph attention network is introduced to pass messages between different nodes in the HOI graph iteratively, and detect the potential HOIs. Experiments on V-COCO and HICO-DET datasets verify the effectiveness of the proposed method, and show that it is superior to many existing methods.

Keywords

human-object interactions / visual relationship / context information / graph attention network

Cite this article

Download citation ▾

Li-min Xia, Wei Wu. Graph-based method for human-object interactions detection. Journal of Central South University, 2021, 28(1): 205-218 DOI:10.1007/s11771-021-4597-x

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	LIN T Y, DOLLÁR P, GIRSHICK R, HE K M, HARIHARAN B, BELONGIE S. Feature pyramid networks for object detection [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 2117–2125. DOI:https://doi.org/10.1109/cvpr.2017.106.

[2]	HE K, ZHANG X, REN S, SUN J. Deep residual learning for image recognition [C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2016: 770–778. DOI: https://doi.org/10.1109/cvpr.2016.90.

[3]	RenS, HeK, GirshickR, SunJ. Faster R-CNN: Towards real-time object detection with region proposal networks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149

[4]	WangJ, XiaL, HuX, XiaoY. Abnormal event detection with semi-supervised sparse topic model [J]. Neural Computing and Applications, 2019, 31(5): 1607-1617

[5]	WANG P, CHEN P, YUAN Y, HUANG Z, HOU X, COTTRELL G. Understanding Convolution for Semantic Segmentation [C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018: 1451–1460. DOI: https://doi.org/10.1109/WACV.2018.00163.

[6]	GAO R, XIONG B, GRAUMAN K. Im2flow: Motion hallucination from static images for action recognition [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 5937–5947. DOI: https://doi.org/10.1109/cvpr.2018.00622.

[7]	CHÉRON G, LAPTEV I, SCHMID C. P-CNN: Pose-based CNN features for action recognition [C]//Proceedings of the IEEE International Conference on Computer Vision. 2015: 3218–3226. DOI: https://doi.org/10.1109/iccv.2015.368.

[8]	LiuJ, WangG, DuanL Y, abdiyevaK, KotA C. Skeleton-based human action recognition with global context-aware attention LSTM networks [J]. IEEE Transactions on Image Processing, 2017, 27(4): 1586-1599

[9]	MajdM, SafabakhshR. A motion-aware ConvLSTM network for action recognition [J]. Applied Intelligence, 2019, 49(7): 2515-2521

[10]	XiaL M, guoW T, wangH. Interaction behavior recognition from multiple views [J]. Journal of Central South University, 2020, 27(1): 101-113

[11]	LI Y, OUYANG W, ZHOU B, WANG K, WANG X. Scene graph generation from objects, phrases and region captions [C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 1261–1270. DOI: https://doi.org/10.1109/iccv.2017.142.

[12]	XU D, ZHU Y, CHOY C B, LI F F. Scene graph generation by iterative message passing [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 5410–5419. DOI: https://doi.org/10.1109/cvpr.2017.330.

[13]	LuC, KrishnaR, BernsteinM, LiF F. Visual relationship detection with language priors [C]. European Conference on Computer Vision, 2016, Cham, Springer, 852869

[14]	DAI Y, WANG C, DONG J, SUN C Y. Visual relationship detection based on bidirectional recurrent neural network [J]. Multimedia Tools and Applications, 2019: 1–17. DOI: https://doi.org/10.1007/s11042-019-7732-z.

[15]	TENEY D, LIU L, van den HENGEL A. Graph-structured representations for visual question answering [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 1–9. DOI: https://doi.org/10.1109/cvpr.2017.344.

[16]	PengL, YangY, BinY, XieN, ShenF M, jiY L, xuX. Word-to-region attention network for visual question answering [J]. Multimedia Tools and Applications, 2019, 78(3): 3843-3858

[17]	CHEN X, ZITNICK C L. Mind’s eye: A recurrent visual representation for image caption generation [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 2422–2431. DOI: https://doi.org/10.1109/cvpr.2015.7298856.

[18]	JOHNSON J, HARIHARAN B, van der MAATEN L, HOFFMAN J, LI F F, ZITNICK C L, GIRSHICK R. Inferring and executing programs for visual reasoning [C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 2989–2998. DOI: https://doi.org/10.1109/iccv.2017.325.

[19]	GuptaA, KembhaviA, DavisL S. Observing human-object interactions: Using spatial and functional compatibility for recognition [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(10): 1775-1789

[20]	YAO B, LI F F. Modeling mutual context of object and human pose in human-object interaction activities [C]//2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 2010: 17–24. DOI: https://doi.org/10.1109/cvpr.2010.5540235.

[21]	CHAO Y W, LIU Y, LIU X, ZENG H. Learning to detect human-object interactions [C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018: 381–389. DOI: https://doi.org/10.1109/wacv.2018.00048.

[22]	FANG H S, CAO J, TAI Y W, LU C. Pairwise body-part attention for recognizing human-object interactions [J]. Lecture Notes in Computer Science, 2018: 52–68. DOI: https://doi.org/10.1007/978-3-030-01249-6_4.

[23]	XiaL, LiR. Multi-stream neural network fused with local information and global information for HOI detection [J]. Applied Intelligence, 2020, 50(12): 4495-4505

[24]	HU J F, ZHENG W S, LAI J, GONG S G. Recognising human-object interaction via exemplar based modelling [C]//Proceedings of the IEEE International Conference on Computer Vision. 2013: 3144–3151. DOI: https://doi.org/10.1109/iccv.2013.390.

[25]	GUPTA S, MALIK J. Visual semantic role labeling [J]. Computer Science: Computer Vision and Pattern Recognition, 2015: arXiv:1505.04474.

[26]	SHEN L, YEUNG S, HOFFMAN J, MORIG, LI F F. Scaling human-object interaction recognition through zero-shot learning [C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018: 1568–1576. DOI: https://doi.org/10.1109/wacv.2018.00181.

[27]	QI S, WANG W, JIA B, SHEN J, ZHU S C. Learning human-object interactions by graph parsing neural networks [C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 401–417. DOI: https://doi.org/10.1007/978-3-030-01240-3_25.

[28]	OlivaA, TorralbaA. The role of context in object recognition [J]. Trends in Cognitive Sciences, 2007, 11(12): 520-527

[29]	VELICKOVIC P, CUCURULL G, CASANOVA A, ROMERO A, LIÒ P, BENGIO Y. Graph attention networks [C]//International Conference on Learning Representations, 2018. DOI: https://doi.org/10.17863/CAM.48429.

[30]	PrestA, SchmidC, FerrariV. Weakly supervised learning of interactions between humans and objects [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 34(3): 601-614

[31]	DESAI C, RAMANAN D, FOWLKES C. Discriminative models for static human-object interactions [C]//2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops. IEEE, 2010: 9–16. DOI: https://doi.org/10.1109/cvprw.2010.5543176.

[32]	GKIOXARI G, GIRSHICK R, DOLLÁR P, HE K. Detecting and recognizing human-object interactions [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 8359–8367. DOI: https://doi.org/10.1109/cvpr.2018.00872.

[33]	GoriM, MonfardiniG, ScarselliF. A new model for learning in graph domains [C]//Proceedings of 2005 IEEE International Joint Conference on Neural Networks. IEEE, 2005, 2: 729-734

[34]	KIPF T, WELLING M. Semi-supervised classification with graph convolutional networks [C]//International Conference on Learning Representations. 2017.

[35]	JAIN A, ZAMIR A R, SAVARESE S, SAXENA A. Structural-RNN: Deep learning on spatio-temporal graphs [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 5308–5317. DOI: https://doi.org/10.1109/cvpr.2016.573.

[36]	CHEN X, LI L J, LI F F, GUPTA A. Iterative visual reasoning beyond convolutions [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7239–7248. DOI: https://doi.org/10.1109/cvpr.2018.00756.

[37]	MARINO K, SALAKHUTDINOV R, GUPTA A. The More You Know: Using Knowledge Graphs for Image Classification [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 20–28. DOI: https://doi.org/10.1109/cvpr.2017.10.

[38]	HU J, SHEN L, SUN G. Squeeze-and-excitation networks [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7132–7141. DOI: https://doi.org/10.1109/cvpr.2018.00745.

[39]	PEYRE J, SIVIC J, LAPTEV I, SIVIC J. Weakly-supervised learning of visual relations [C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 5179–5188. DOI: https://doi.org/10.1109/iccv.2017.554.

[40]	NAIR V, HINTON G E. Rectified linear units improve restricted boltzmann machines [C]//International Conference on Machine Learning. 2010: 807–814.

[41]	GIRDHAR R, RAMANAN D. Attentional pooling for action recognition [C]//Advances in Neural Information Processing Systems. 2017: 34–45.

[42]	KINGMA D P, BA J. Adam: A method for stochastic optimization [J]. arXiv preprint, 2014: arXiv:1412.6980.

[43]	LinT Y, maireM, BelongieS, HaysJ, PeronaP, RamananD, DollárP, ZitnickC L. Microsoft coco: Common objects in context [C]. European Conference on Computer Vision, 2014, Cham, Springer, 740755