Graph-based method for human-object interactions detection
Li-min Xia , Wei Wu
Journal of Central South University ›› 2021, Vol. 28 ›› Issue (1) : 205 -218.
Human-object interaction (HOIs) detection is a new branch of visual relationship detection, which plays an important role in the field of image understanding. Because of the complexity and diversity of image content, the detection of HOIs is still an onerous challenge. Unlike most of the current works for HOIs detection which only rely on the pairwise information of a human and an object, we propose a graph-based HOIs detection method that models context and global structure information. Firstly, to better utilize the relations between humans and objects, the detected humans and objects are regarded as nodes to construct a fully connected undirected graph, and the graph is pruned to obtain an HOI graph that only preserving the edges connecting human and object nodes. Then, in order to obtain more robust features of human and object nodes, two different attention-based feature extraction networks are proposed, which model global and local contexts respectively. Finally, the graph attention network is introduced to pass messages between different nodes in the HOI graph iteratively, and detect the potential HOIs. Experiments on V-COCO and HICO-DET datasets verify the effectiveness of the proposed method, and show that it is superior to many existing methods.
human-object interactions / visual relationship / context information / graph attention network
| [1] |
LIN T Y, DOLLÁR P, GIRSHICK R, HE K M, HARIHARAN B, BELONGIE S. Feature pyramid networks for object detection [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 2117–2125. DOI:https://doi.org/10.1109/cvpr.2017.106. |
| [2] |
HE K, ZHANG X, REN S, SUN J. Deep residual learning for image recognition [C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2016: 770–778. DOI: https://doi.org/10.1109/cvpr.2016.90. |
| [3] |
|
| [4] |
|
| [5] |
WANG P, CHEN P, YUAN Y, HUANG Z, HOU X, COTTRELL G. Understanding Convolution for Semantic Segmentation [C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018: 1451–1460. DOI: https://doi.org/10.1109/WACV.2018.00163. |
| [6] |
GAO R, XIONG B, GRAUMAN K. Im2flow: Motion hallucination from static images for action recognition [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 5937–5947. DOI: https://doi.org/10.1109/cvpr.2018.00622. |
| [7] |
CHÉRON G, LAPTEV I, SCHMID C. P-CNN: Pose-based CNN features for action recognition [C]//Proceedings of the IEEE International Conference on Computer Vision. 2015: 3218–3226. DOI: https://doi.org/10.1109/iccv.2015.368. |
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
LI Y, OUYANG W, ZHOU B, WANG K, WANG X. Scene graph generation from objects, phrases and region captions [C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 1261–1270. DOI: https://doi.org/10.1109/iccv.2017.142. |
| [12] |
XU D, ZHU Y, CHOY C B, LI F F. Scene graph generation by iterative message passing [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 5410–5419. DOI: https://doi.org/10.1109/cvpr.2017.330. |
| [13] |
|
| [14] |
DAI Y, WANG C, DONG J, SUN C Y. Visual relationship detection based on bidirectional recurrent neural network [J]. Multimedia Tools and Applications, 2019: 1–17. DOI: https://doi.org/10.1007/s11042-019-7732-z. |
| [15] |
TENEY D, LIU L, van den HENGEL A. Graph-structured representations for visual question answering [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 1–9. DOI: https://doi.org/10.1109/cvpr.2017.344. |
| [16] |
|
| [17] |
CHEN X, ZITNICK C L. Mind’s eye: A recurrent visual representation for image caption generation [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 2422–2431. DOI: https://doi.org/10.1109/cvpr.2015.7298856. |
| [18] |
JOHNSON J, HARIHARAN B, van der MAATEN L, HOFFMAN J, LI F F, ZITNICK C L, GIRSHICK R. Inferring and executing programs for visual reasoning [C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 2989–2998. DOI: https://doi.org/10.1109/iccv.2017.325. |
| [19] |
|
| [20] |
YAO B, LI F F. Modeling mutual context of object and human pose in human-object interaction activities [C]//2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 2010: 17–24. DOI: https://doi.org/10.1109/cvpr.2010.5540235. |
| [21] |
CHAO Y W, LIU Y, LIU X, ZENG H. Learning to detect human-object interactions [C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018: 381–389. DOI: https://doi.org/10.1109/wacv.2018.00048. |
| [22] |
FANG H S, CAO J, TAI Y W, LU C. Pairwise body-part attention for recognizing human-object interactions [J]. Lecture Notes in Computer Science, 2018: 52–68. DOI: https://doi.org/10.1007/978-3-030-01249-6_4. |
| [23] |
|
| [24] |
HU J F, ZHENG W S, LAI J, GONG S G. Recognising human-object interaction via exemplar based modelling [C]//Proceedings of the IEEE International Conference on Computer Vision. 2013: 3144–3151. DOI: https://doi.org/10.1109/iccv.2013.390. |
| [25] |
GUPTA S, MALIK J. Visual semantic role labeling [J]. Computer Science: Computer Vision and Pattern Recognition, 2015: arXiv:1505.04474. |
| [26] |
SHEN L, YEUNG S, HOFFMAN J, MORIG, LI F F. Scaling human-object interaction recognition through zero-shot learning [C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018: 1568–1576. DOI: https://doi.org/10.1109/wacv.2018.00181. |
| [27] |
QI S, WANG W, JIA B, SHEN J, ZHU S C. Learning human-object interactions by graph parsing neural networks [C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 401–417. DOI: https://doi.org/10.1007/978-3-030-01240-3_25. |
| [28] |
|
| [29] |
VELICKOVIC P, CUCURULL G, CASANOVA A, ROMERO A, LIÒ P, BENGIO Y. Graph attention networks [C]//International Conference on Learning Representations, 2018. DOI: https://doi.org/10.17863/CAM.48429. |
| [30] |
|
| [31] |
DESAI C, RAMANAN D, FOWLKES C. Discriminative models for static human-object interactions [C]//2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops. IEEE, 2010: 9–16. DOI: https://doi.org/10.1109/cvprw.2010.5543176. |
| [32] |
GKIOXARI G, GIRSHICK R, DOLLÁR P, HE K. Detecting and recognizing human-object interactions [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 8359–8367. DOI: https://doi.org/10.1109/cvpr.2018.00872. |
| [33] |
|
| [34] |
KIPF T, WELLING M. Semi-supervised classification with graph convolutional networks [C]//International Conference on Learning Representations. 2017. |
| [35] |
JAIN A, ZAMIR A R, SAVARESE S, SAXENA A. Structural-RNN: Deep learning on spatio-temporal graphs [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 5308–5317. DOI: https://doi.org/10.1109/cvpr.2016.573. |
| [36] |
CHEN X, LI L J, LI F F, GUPTA A. Iterative visual reasoning beyond convolutions [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7239–7248. DOI: https://doi.org/10.1109/cvpr.2018.00756. |
| [37] |
MARINO K, SALAKHUTDINOV R, GUPTA A. The More You Know: Using Knowledge Graphs for Image Classification [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 20–28. DOI: https://doi.org/10.1109/cvpr.2017.10. |
| [38] |
HU J, SHEN L, SUN G. Squeeze-and-excitation networks [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7132–7141. DOI: https://doi.org/10.1109/cvpr.2018.00745. |
| [39] |
PEYRE J, SIVIC J, LAPTEV I, SIVIC J. Weakly-supervised learning of visual relations [C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 5179–5188. DOI: https://doi.org/10.1109/iccv.2017.554. |
| [40] |
NAIR V, HINTON G E. Rectified linear units improve restricted boltzmann machines [C]//International Conference on Machine Learning. 2010: 807–814. |
| [41] |
GIRDHAR R, RAMANAN D. Attentional pooling for action recognition [C]//Advances in Neural Information Processing Systems. 2017: 34–45. |
| [42] |
KINGMA D P, BA J. Adam: A method for stochastic optimization [J]. arXiv preprint, 2014: arXiv:1412.6980. |
| [43] |
|
| [44] |
KAREN S Y, ANDREW Z M. Very deep convolutional networks for large-scale image recognition [J]. arXiv preprint, 2014: arXiv:1409.1556. |
/
| 〈 |
|
〉 |