Context-Aware Visual Entailment Driven by Specific Instructions

Yufeng HAN , Kuangrong HAO , Xuesong TANG , Bing WEI

Journal of Donghua University(English Edition) ›› 2025, Vol. 42 ›› Issue (2) : 177 -186.

PDF (11620KB)
Journal of Donghua University(English Edition) ›› 2025, Vol. 42 ›› Issue (2) :177 -186. DOI: 10.19884/j.1672-5220.202403004
Information Technology and Artificial Intelligence
research-article

Context-Aware Visual Entailment Driven by Specific Instructions

Author information +
History +
PDF (11620KB)

Abstract

Visual entailment(VE) is a prototypical task in multimodal visual reasoning, where current methods frequently utilize large language models(LLMs) as the knowledge base to assist in answering questions. These methods heavily rely on the textual modality, which inherently cannot capture the full extent of information contained within images. We propose a context-aware visual entailment(CAVE) model, which introduces a novel aggregation module designed to extract high-level semantic features from images. This module integrates lower-level semantic image features into high-level visual tokens, formatting them similarly to text tokens so that they can serve as inputs for LLMs. The CAVE model compensates for the loss of image information and integrates it more effectively with textual comprehension. Additionally, the CAVE model incorporates a new input format and training methodology, which is rooted in instruction tuning and in-context learning techniques. The objective of this research is to maximize the inherent logical reasoning capabilities of LLMs. Experimental results on the E-SNLI-VE dataset show that the proposed CAVE model exhibits outstanding performance.

Keywords

visual entailment(VE) / textual-visual integration / instruction tuning / in-context learning

Cite this article

Download citation ▾
Yufeng HAN, Kuangrong HAO, Xuesong TANG, Bing WEI. Context-Aware Visual Entailment Driven by Specific Instructions. Journal of Donghua University(English Edition), 2025, 42(2): 177-186 DOI:10.19884/j.1672-5220.202403004

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

DONG Q X, QIN Z W, XIA H M, et al. Premise-based multimodal reasoning: conditional inference on joint textual and visual clues[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2022: 932-946.

[2]

CHEN Y C, LI L J, YU L C, et al. UNITER:universal image-text representation learning[M]//Lecture Notes in Computer Science. Cham: Springer International Publishing, 2020: 104-120.

[3]

SHAO Z W, YU Z, WANG M, et al.Prompting large language models with answer heuristics for knowledge-based visual question answering[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR). New York: IEEE, 2023: 14974-14983.

[4]

SUZUKI R, YANAKA H, YOSHIKAWA M, et al. Multimodal logical inference system for visual-textual entailment[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics:Student Research Workshop. Stroudsburg: ACL, 2019: 386-392.

[5]

YU S J, JIN X Q, WU G W, et al. Deep multimodule based language priors mitigation model for visual question answering[J]. Journal of Donghua University (English Edition), 2023, 40(6): 684-694.

[6]

LI Q, TAO Q Y, JOTY S, et al. VQA-E: explaining, elaborating, and enhancing your answers for visual questions[M]//Lecture Notes in Computer Science. Cham: Springer International Publishing, 2018: 570-586.

[7]

WU J L, MOONEY R. Faithful multimodal explanation for visual question answering[C]//Proceedings of the 2019 ACL Workshop BlackboxNLP:Analyzing and Interpreting Neural Networks for NLP. Stroudsburg: ACL, 2019: 103-112.

[8]

PARK D H, HENDRICKS L A, AKATA Z, et al. Multimodal explanations: justifying decisions and pointing to the evidence[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 8779-8788.

[9]

THOMAS C, ZHANG Y P, CHANG S F. Finegrained visual entailment[M]//Lecture Notes in Computer Science. Cham: Springer Nature Switzerland, 2022: 398-416.

[10]

ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 6077-6086.

[11]

LECUN Y, BOSER B, DENKER J S, et al. Backpropagation applied to handwritten zip code recognition [J]. Neural Computation, 1989, 1(4): 541-551.

[12]

HOCHREITER S, SCHMIDHUBER J. Long short-term memory [J]. Neural Computation, 1997, 9(8): 1735-1780.

[13]

RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners [EB/OL]. (2019-02-14 ) [2024-03-15].

[14]

LI J N, SELVARAJU R R, GOTMARE A D, et al. Align before fuse: vision and language representation learning with momentum distillation [EB/OL]. (2021-10-07) [2024-03-15]. https://arxiv.org/abs/2107.07651.

[15]

ZENG Y, ZHANG X S, LI H. Multi-grained vision language pre-training: aligning texts with visual concepts[EB/OL]. (2022-06-01) [2024-03-15]. https://arxiv.org/abs/2111.08276.

[16]

PLÜSTER B, AMBSDORF J, BRAACH L, et al. Harnessing the power of multi-task pretraining for ground-truth level natural language explanations[EB/OL]. (2023-03-29) [2024-03-15]. https://arxiv.org/abs/2212.04231.

[17]

WANG P, YANG A, MEN R, et al. unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework[EB/OL]. (2022-06-01) [2024-03-15]. https://arxiv.org/abs/2202.03052.

[18]

GUI L K, WANG B R, HUANG Q Y, et al.KAT: a knowledge augmented transformer for vision-and-language [C]//Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg: ACL, 2022: 956-968.

[19]

LIN Y Z, XIE Y J, CHEN D D, et al. REVIVE: regional visual representation matters in knowledge-based visual question answering[EB/OL].(2022-10-10 ) [2024-03-18]. https://arxiv.org/abs/2206.01201.

[20]

YANG H, LIN J Y, YANG A, et al. Prompt tuning for generative multimodal pretrained models[EB/OL]. (2022-08-04) [2024-03-15]. https://arxiv.org/abs/2208.02532.

[21]

LIU H T, LI C Y, WU Q Y, et al. Visual instruction tuning[EB/OL]. (2023-04-17) [2024-03-18]. https://arxiv.org/abs/2304.08485.

[22]

BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS ). New York: ACM, 2020: 1877-1901.

[23]

RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning. Berlin: PMLR, 2021: 8748-8763.

[24]

TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: open and efficient foundation language models[EB/OL]. (2023-02-27) [2024-03-30]. https://arxiv.org/abs/2302.13971.

[25]

KAYSER M, CAMBURU O M, SALEWSKI L, et al. E-ViL:a dataset and benchmark for natural language explanations in vision-language tasks[C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV). New York: IEEE, 2021: 1224-1234.

[26]

YANG Z Y, GAN Z, WANG J F, et al. An empirical study of GPT-3 for few-shot knowledge-based VQA[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(3): 3081-3089.

[27]

SCHROFF F, KALENICHENKO D, PHILBIN J. FaceNet: a unified embedding for face recognition and clustering [C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2015: 815-823.

[28]

XIE N, LAI F, DORAN D, et al. Visual entailment: a novel task for fine-grained image understanding[EB/OL]. (2019-01-20) [2024-03-30]. https://arxiv.org/abs/1901.06706.

[29]

PLUMMER B A, WANG L W, CERVANTES C M, et al. Flickr30k entities: collecting region-tophrase correspondences for richer image-tosentence models [C]//2015 IEEE International Conference on Computer Vision (ICCV). New York: IEEE, 2015: 2641-2649.

[30]

DO V, CAMBURU O M, AKATA Z, et al. E-SNLI-VE: corrected visual-textual entailment with natural language explanations[EB/OL]. (2021-08-19)[2024-03-15]. https://arxiv.org/abs/2004.03744.

[31]

BANERJEE S, LAVIE A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Stroudsburg: ACL, 2005: 65-72.

[32]

ZHANG T Y, KISHORE V, WU F, et al. BERTScore: evaluating text generation with BERT[EB/OL]. (2020-01-24)[2024-04-23]. https://arxiv.org/abs/1904.09675.

[33]

MARASOVIC' A, BHAGAVATULA C, PARK J S, et al. Natural language rationales with fullstack visual reasoning: from pixels to semantic frames to commonsense graphs[C]////Findings of the Association for Computational Linguistics:EMNLP 2020. Stroudsburg: ACL, 2020: 2810-2829.

[34]

SAMMANI F, MUKHERJEE T, DELIGIANNIS N. NLX-GPT: a model for natural language explanations in vision and vision-language tasks[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition ( CVPR). New York: IEEE, 2022: 8312-8322.

Funding

Fundamental Research Funds for the Central Universities, China(2232021A-10)

Shanghai Pujiang Program, China(22PJ1423400)

PDF (11620KB)

76

Accesses

0

Citation

Detail

Sections
Recommended

/