Coreference resolution helps visual dialogs to focus

Tianwei Yue , Wenping Wang , Chen Liang , Dachi Chen , Congrui Hetang , Xuewei Wang

High-Confidence Computing ›› 2024, Vol. 4 ›› Issue (2) : 100184

PDF (850KB)
High-Confidence Computing ›› 2024, Vol. 4 ›› Issue (2) : 100184 DOI: 10.1016/j.hcc.2023.100184
Research Articles
research-article

Coreference resolution helps visual dialogs to focus

Author information +
History +
PDF (850KB)

Abstract

Visual Dialog is a multi-modal task involving both computer vision and dialog systems. The goal is to answer multiple questions in conversation style, given an image as the context. Neural networks with attention modules are widely used for this task, because of their effectiveness in reasoning the relevance between the texts and images. In this work, we study how to further improve the quality of such reasoning, which is an open challenge. Our baseline is the Recursive Visual Attention (RVA) model, which refines the vision-text attention by iteratively visiting the dialog history. Building on top of that, we propose to improve the attention mechanism with contrastive learning. We train a Matching-Aware Attention Kernel (MAAK) by aligning the deep feature embeddings of an image and its caption, to provide better attention scores. Experiments show consistent improvements from MAAK. In addition, we study the effect of using Multimodal Compact Bilinear (MCB) pooling as a three-way feature fusion for the visual, textual and dialog history embeddings. We analyze the performance of both methods in the discussion section, and propose further ideas to resolve current limitations.

Keywords

Multi-model machine learning / Visual dialog / Co-reference resolution

Cite this article

Download citation ▾
Tianwei Yue, Wenping Wang, Chen Liang, Dachi Chen, Congrui Hetang, Xuewei Wang. Coreference resolution helps visual dialogs to focus. High-Confidence Computing, 2024, 4(2): 100184 DOI:10.1016/j.hcc.2023.100184

登录浏览全文

4963

注册一个新账户 忘记密码

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

[1]

A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J.M. Moura, D. Parikh, D. Batra, Visual dialog, in:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 326-335.

[2]

S. Kottur, J.M. Moura, D. Parikh, D. Batra, M. Rohrbach, CLEVR-dialog: A diagnostic dataset for multi-round reasoning in visual dialog, 2019, arXiv preprint arXiv:1903.03166.

[3]

A. Agrawal, D. Batra, D. Parikh, A. Kembhavi, Don’t just assume; look and answer: Overcoming priors for visual question answering,in:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4971-4980.

[4]

Z. Zhou, T. Yue, C. Liang, X. Bai, D. Chen, C. Hetang, W. Wang, Unlocking everyday wisdom: Enhancing machine comprehension with script knowledge integration, Appl. Sci. 13 (16) (2023).

[5]

G. Song, B. Leng, Y. Liu, C. Hetang, S. Cai, Region-based quality estimation network for large-scale person re-identification, in:Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32, No. 1, 2018, http://dx.doi.org/10.1609/aaai.v32i1.12305, URL https://ojs.aaai.org/index.php/AAAI/article/view/12305

[6]

Y. He, J. Qian, J. Wang, Depth-wise decomposition for accelerating separable convolutions in efficient convolutional neural networks, arXiv preprint arXiv:1910.09455 (2019).

[7]

Y. Niu, H. Zhang, M. Zhang, J. Zhang, Z. Lu, J.-R. Wen, Recursive visual attention in visual dialog, in:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6679-6688.

[8]

T. Chen, S. Liu, Z. Chen, W. Hu, D. Chen, Y. Wang, Q. Lyu, C.X. Le, W. Wang, Faster, Stronger, and More Interpretable: Massive Transformer Architectures for Vision-Language Tasks, 2023.

[9]

W. Wang, Y. Guo, C. Shen, S. Ding, G. Liao, H. Fu, P.K. Prabhakar, Integrity and junkiness failure handling for embedding-based retrieval: a case study in social network search, arXiv preprint arXiv:2304.09287 ( 2023).

[10]

J. Lu, A. Kannan, J. Yang, D. Parikh, D. Batra, Best of both worlds: Transferring knowledge from discriminative learning to a generative visual dialog model, in: Advances in Neural Information Processing Systems, 2017, pp. 314-324.

[11]

K. Yu, Y. Wang, S. Zeng, C. Liang, X. Bai, D. Chen, W. Wang, InkGAN: Generative Adversarial Networks for Ink-And-Wash Style Transfer of Photographs, 2023.

[12]

Q. Wu, P. Wang, C. Shen, I. Reid, A. van den Hengel, Are you talking to me? reasoned visual dialog generation through adversarial learning, in:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6106-6115.

[13]

D. Guo, C. Xu, D. Tao, Image-question-answer synergistic network for visual dialog, in:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10434-10443.

[14]

C. Le, C. Hetang, A. Cao, Y. He, Euclidreamer: fast and high-quality texturing for 3d models with stable diffusion depth, arXiv preprint arXiv:2311.15573 ( 2023).

[15]

J. Andreas, M. Rohrbach, T. Darrell, D. Klein, Neural module networks, in:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 39-48.

[16]

J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, L. Fei-Fei, C. Lawrence Zitnick, R. Girshick, Inferring and executing programs for visual reasoning, in:Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2989-2998.

[17]

C. Hetang, Y. Wang, Novel view synthesis from a single rgbd image for indoor scenes, arXiv preprint arXiv:2311.01065 ( 2023).

[18]

R. Hu, J. Andreas, M. Rohrbach, T. Darrell, K. Saenko, Learning to reason: End-to-end module networks for visual question answering,in:Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 804-813.

[19]

F. Strub, H. De Vries, J. Mary, B. Piot, A. Courville, O. Pietquin, End-to-end optimization of goal-driven and visually grounded dialogue systems, 2017, arXiv preprint arXiv:1703.05423.

[20]

P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, A. van den Hengel, Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,in:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3674-3683.

[21]

P. Wang, Q. Wu, C. Shen, A. Dick, A. van den Hengel, Fvqa: Fact-based visual question answering, IEEE Trans. Pattern Anal. Mach. Intell. 40 (10) (2018) 2413-2427.

[22]

W. Wang, et al., Sentiment analysis: a systematic case study with yelp scores, Advances in Artificial Intelligence and Machine Learning 3 (3) (2023) 74.

[23]

Y. Li, N. Duan, B. Zhou, X. Chu, W. Ouyang, X. Wang, M. Zhou, Visual question generation as dual task of visual question answering, in:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6116-6124.

[24]

A. Fukui, D.H. Park, D. Yang, A. Rohrbach, T. Darrell, M. Rohrbach, Multimodal compact bilinear pooling for visual question answering and visual grounding, 2016, arXiv preprint arXiv:1606.01847.

[25]

M. Charikar, K. Chen, M. Farach-Colton, Finding frequent items in data streams, in: International Colloquium on Automata, Languages, and Programming, Springer, 2002, pp. 693-703.

[26]

K.J. Shih, S. Singh, D. Hoiem, Where to look: Focus regions for visual question answering,in:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4613-4621.

[27]

P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, L. Zhang, Bottom-up and top-down attention for image captioning and visual question answering, in:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6077-6086.

[28]

P.H. Seo, A. Lehrmann, B. Han, L. Sigal, Visual reference resolution using attention memory for visual dialog, in: Advances in Neural Information Processing Systems, 2017, pp. 3719-3729.

[29]

G.-C. Kang, J. Lim, B.-T. Zhang, Dual attention networks for visual reference resolution in visual dialog, 2019, arXiv preprint arXiv:1902.09368.

[30]

S. Kottur, J.M. Moura, D. Parikh, D. Batra, M. Rohrbach, Visual coreference resolution in visual dialog using neural module networks, in:Proceedings of the European Conference on Computer Vision, ECCV, 2018, pp. 153-169.

[31]

C. Hetang, Autonomous path generation with path optimization, Google Patents, 2022, US Patent App. 17/349, 450.

[32]

X. Yang, et al., Linguistically-inspired neural coreference resolution, Advances in Artificial Intelligence and Machine Learning 3 (2) (2023) 66.

[33]

S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, in: Advances in Neural Information Processing Systems, 2015, pp. 91-99.

[34]

C. Hetang, Impression network for video object detection, in: 2023 IEEE 3rd International Conference on Information Technology, Big Data and Artificial Intelligence, Vol. 3, ICIBA, 2023, pp. 735-743.

[35]

H. Tangcongrui, Q. Hongwei, Methods and apparatuses for recognizing video and training, electronic device and medium, Google Patents, 2021, US Patent 10, 909, 380.

[36]

Z. Longxiang, W. Wenping, Y. Keyi, H. Jingxian, L. Qi, X. Haoru, H. Congrui, Sliding-BERT: Striding towards conversational machine comprehension in long context, Adv. Artif. Intell. Mach. Learn. 3 (2023).

[37]

R. Thibaux, D.H. Silver, C. Hetang, Stop Location Change Detection, Google Patents, 2022, US Patent App. 17/131, 232.

[38]

C. Hetang, Y. Shen, Y. Zhou, J. Gao, Implementing synthetic scenes for autonomous vehicles, Google Patents, 2022, US Patent App. 17/349, 489.

[39]

C. Hetang, N. Zhang, Autonomous vehicle driving path label generation for machine learning models, Google Patents, 2023, US Patent App. 17/740, 215.

[40]

N. Pham, R. Pagh, Fast and scalable polynomial kernels via explicit feature maps, in:Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2013, pp. 239-247.

AI Summary AI Mindmap
PDF (850KB)

334

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/