Instance-sequence reasoning for video question answering

Rui LIU, Yahong HAN

PDF(3072 KB)
PDF(3072 KB)
Front. Comput. Sci. ›› 2022, Vol. 16 ›› Issue (6) : 166708. DOI: 10.1007/s11704-021-1248-1
Image and Graphics
RESEARCH ARTICLE

Instance-sequence reasoning for video question answering

Author information +
History +

Abstract

Video question answering (Video QA) involves a thorough understanding of video content and question language, as well as the grounding of the textual semantic to the visual content of videos. Thus, to answer the questions more accurately, not only the semantic entity should be associated with certain visual instance in video frames, but also the action or event in the question should be localized to a corresponding temporal slot. It turns out to be a more challenging task that requires the ability of conducting reasoning with correlations between instances along temporal frames. In this paper, we propose an instance-sequence reasoning network for video question answering with instance grounding and temporal localization. In our model, both visual instances and textual representations are firstly embedded into graph nodes, which benefits the integration of intra- and inter-modality. Then, we propose graph causal convolution (GCC) on graph-structured sequence with a large receptive field to capture more causal connections, which is vital for visual grounding and instance-sequence reasoning. Finally, we evaluate our model on TVQA+ dataset, which contains the groundtruth of instance grounding and temporal localization, three other Video QA datasets and three multimodal language processing datasets. Extensive experiments demonstrate the effectiveness and generalization of the proposed method. Specifically, our method outperforms the state-of-the-art methods on these benchmarks.

Graphical abstract

Keywords

video question answering / instance grounding / graph causal convolution

Cite this article

Download citation ▾
Rui LIU, Yahong HAN. Instance-sequence reasoning for video question answering. Front. Comput. Sci., 2022, 16(6): 166708 https://doi.org/10.1007/s11704-021-1248-1

References

[1]
Gao J, Ge R, Chen K, Nevatia R. Motion-appearance co-memory networks for video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 6576−6585
[2]
Jang Y , Song Y , Kim C D , Yu Y , Kim Y , Kim G . Video question answering with spatio-temporal reasoning. International Journal of Computer Vision, 2019, 127( 10): 1385– 1412
[3]
Xu Y , Han Y , Hong R , Tian Q . Sequential video VLAD: training the aggregation locally and temporally. IEEE Transactions on Image Processing, 2018, 27( 10): 4933– 4944
[4]
Zhao S , Liu Y , Han Y , Hong R , Hu Q , Tian Q . Pooling the convolutional layers in deep ConvNets for video action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 28( 8): 1839– 1849
[5]
Yang T, Zha Z J, Xie H, Wang M, Zhang H. Question-aware tube-switch network for video question answering. In: Proceedings of the 27th ACM International Conference on Multimedia. 2019, 1184−1192
[6]
Fan C, Zhang X, Zhang S, Wang W, Zhang C, Huang H. Heterogeneous memory enhanced multimodal attention model for video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 1999−2007
[7]
Wang X, Zhu L, Yang Y. T2VLAD: global-local sequence alignment for text-video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 5075−5084
[8]
Zhao Z , Xiao S , Song Z , Lu C , Xiao J , Zhuang Y . Open-ended video question answering via multi-modal conditional adversarial networks. IEEE Transactions on Image Processing, 2020, 29 : 3859– 3870
[9]
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L. Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 6077−6086
[10]
Huang D, Chen P, Zeng R, Du Q, Tan M, Gan C. Location-aware graph convolutional networks for video question answering. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11021−11028
[11]
Chen S, Zhao Y, Jin Q, Wu Q. Fine-grained video-text retrieval with hierarchical graph reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 10635−10644
[12]
Han Y , Yang Y , Yan Y , Ma Z , Sebe N , Zhou X . Semisupervised feature selection via spline regression for video semantic recognition. IEEE Transactions on Neural Networks and Learning Systems, 2015, 26( 2): 252– 264
[13]
Fan H, Yang Y. Person tube retrieval via language description. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 10754−10761
[14]
Zhou Q , Wang R , Li J , Tian N , Zhang W . Siamese single object tracking algorithm with natural language prior. Frontiers of Computer Science, 2021, 15( 5): 155335–
[15]
Shen L , Hong R , Hao Y . Advance on large scale near-duplicate video retrieval. Frontiers of Computer Science, 2020, 14( 5): 145702–
[16]
Nan G, Qiao R, Xiao Y, Liu J, Leng S, Zhang H, Lu W. Interventional video grounding with dual contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 2764−2774
[17]
Fan H , Zhu L , Yang Y , Wu F . Recurrent attention network with reinforced generator for visual dialog. ACM Transactions on Multimedia Computing, Communications, and Applications, 2020, 16( 3): 78–
[18]
Gao J, Sun C, Yang Z, Nevatia R. TALL: temporal activity localization via language query. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 5277−5285
[19]
Lei J, Yu L, Bansal M, Berg T L. TVQA: localized, compositional video question answering. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2018, 1369−1379
[20]
Lei J, Yu L, Berg T L, Bansal M. TVQA+: spatio-temporal grounding for video question answering. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 8211−8225
[21]
Zellers R, Bisk Y, Farhadi A, Choi Y. From recognition to cognition: visual commonsense reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 6713−6724
[22]
Zhao Z , Zhang Z , Jiang X , Cai D . Multi-turn video question answering via hierarchical attention context reinforced networks. IEEE Transactions on Image Processing, 2019, 28( 8): 3860– 3872
[23]
Zhao Z , Zhang Z , Xiao S , Xiao Z , Yan X , Yu J , Cai D , Wu F . Long-form video question answering via dynamic hierarchical reinforced networks. IEEE Transactions on Image Processing, 2019, 28( 12): 5939– 5952
[24]
Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks. In: Proceedings of the 5th International Conference on Learning Representations. 2017
[25]
Xu D, Zhao Z, Xiao J, Wu F, Zhang H, He X, Zhuang Y. Video question answering via gradually refined attention over appearance and motion. In: Proceedings of the 25th ACM International Conference on Multimedia. 2017, 1645−1653
[26]
Zadeh A , Zellers R , Pincus E , Morency L P . Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intelligent Systems, 2016, 31( 6): 82– 88
[27]
Zadeh A B, Liang P P, Poria S, Cambria E, Morency L P. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 2236−2246
[28]
Busso C , Bulut M , Lee C C , Kazemzadeh A , Mower E , Kim S , Chang J N , Lee S , Narayanan S S . IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation, 2008, 42( 4): 335–
[29]
Jin W, Zhao Z, Gu M, Yu J, Xiao J, Zhuang Y. Multi-interaction network with object relation for video question answering. In: Proceedings of the 27th ACM International Conference on Multimedia. 2019, 1193−1201
[30]
Li X, Gao L, Wang X, Liu W, Xu X, Shen H, Song J. Learnable aggregating net with diversity learning for video question answering. In: Proceedings of the 27th ACM International Conference on Multimedia. 2019, 1166−1174
[31]
Le T M, Le V, Venkatesh S, Tran T. Hierarchical conditional relation networks for video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 9969−9978
[32]
Toheed A , Javed A , Irtaza A , Dawood H , Dawood H , Alfakeeh A S . An automated framework for advertisement detection and removal from sports videos using audio-visual cues. Frontiers of Computer Science, 2021, 15( 2): 152313–
[33]
Kim E S, Kang W Y, On K W, Heo Y J, Zhang B T. Hypergraph attention networks for multimodal learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 14569−14578
[34]
Li X, Song J, Gao L, Liu X, Huang W, He X, Gan C. Beyond RNNS: positional self-attention with co-attention for video question answering. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 8658−8665
[35]
Han Y , Wu A , Zhu L , Yang Y . Visual commonsense reasoning with directional visual connections. Frontiers of Information Technology & Electronic Engineering, 2021, 22( 5): 625– 637
[36]
Fukui A, Park H D, Yang D, Rohrbach A, Darrell T, Rohrbach M. Multimodal compact bilinear pooling for visual question answering and visual grounding. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016, 457−468
[37]
Ben-younes H, Cadene R, Thome N, Cord M. BLOCK: bilinear superdiagonal fusion for visual question answering and visual relationship detection. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 8102−8109
[38]
Jiang P, Han Y. Reasoning with heterogeneous graph alignment for video question answering. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11109−11116
[39]
Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015, 91−99
[40]
Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019, 4171−4186
[41]
Wang X, Gupta A. Videos as space-time region graphs. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 413−431
[42]
Yu W, Zhou J, Yu W, Liang X, Xiao N. Heterogeneous graph learning for visual commonsense reasoning. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 249
[43]
Veličković P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y. Graph attention networks. In: Proceedings of the International Conference on Learning Representations. 2018
[44]
Bai S, Kolter J Z, Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. 2018, arXiv preprint arXiv: 1803.01271
[45]
Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions. In: Proceedings of the 4th International Conference on Learning Representations. 2016
[46]
Hendricks L A, Wang O, Shechtman E, Sivic J, Darrell T, Russell B C. Localizing moments in video with temporal language. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 1380−1390
[47]
Krishna R , Zhu Y , Groth O , Johnson J , Hata K , Kravitz J , Chen S , Kalantidis Y , Li L J , Shamma D A , Bernstein M S , Fei-Fei L . Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 2017, 123( 1): 32– 73
[48]
Pennington J, Socher R, Manning C D. Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014, 1532−1543
[49]
Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations. 2015
[50]
Jiang J, Chen Z, Lin H, Zhao X, Gao Y. Divide and conquer: question-guided spatio-temporal contextual attention for video question answering. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11101−11108
[51]
van der Maaten L , Hinton G . Visualizing data using t-SNE. Journal of Machine Learning Research, 2008, 9 : 2579– 2605
[52]
Tsai Y H H, Bai S, Liang P P, Kolter J Z, Morency L P, Salakhutdinov R. Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 6558−6569
[53]
Wang Y, Shen Y, Liu Z, Liang P P, Zadeh A, Morency L P. Words can shift: dynamically adjusting word representations using nonverbal behaviors. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 7216−7223
[54]
Tsai Y H H, Liang P P, Zadeh A, Morency L P, Salakhutdinov R. Learning factorized multimodal representations. In: Proceedings of the 7th International Conference on Learning Representations. 2019
[55]
Pham H, Liang P P, Manzini T, Morency L P, Póczos B. Found in translation: learning robust joint representations by cyclic translations between modalities. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 6892−6899

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61876130, 61932009).

RIGHTS & PERMISSIONS

2022 Higher Education Press
AI Summary AI Mindmap
PDF(3072 KB)

Accesses

Citations

Detail

Sections
Recommended

/