Retrieve-then-compare mitigates visual hallucination in multi-modal large language models

Dingchen Yang , Bowen Cao , Sanqing Qu , Fan Lu , Shangding Gu , Guang Chen

Intelligence & Robotics ›› 2025, Vol. 5 ›› Issue (2) : 248 -75.

PDF
Intelligence & Robotics ›› 2025, Vol. 5 ›› Issue (2) :248 -75. DOI: 10.20517/ir.2025.13
Research Article
Research Article

Retrieve-then-compare mitigates visual hallucination in multi-modal large language models

Author information +
History +
PDF

Abstract

Multi-modal large language models (MLLMs) demonstrate remarkable success in a range of vision-language tasks. However, they are prone to visual hallucinations, where their textual responses diverge from the provided image. Inaccurate visual understanding poses risks to the practical applications of MLLMs. Are MLLMs oblivious to accurate visual cues when they hallucinate? Our investigation indicates that the visual branch of MLLMs may advocate both erroneous and accurate content equally, highlighting a high level of uncertainty. To address this issue, we propose retrieval contrastive decoding (RCD), a training-free method that leverages analogous visual hallucinations, which are induced by images sharing common semantic and appearance characteristics, to mitigate visual hallucinations. Specifically, RCD retrieves relevant images to serve as references for MLLMs, and compares their visual content with the test image through confidence score subtraction. Additionally, RCD coordinates the correction of hallucinations from both the visual and textual branches of MLLMs by adaptively scaling the subtracted scores. Experiments on public hallucination benchmarks demonstrate the efficacy of RCD in mitigating visual hallucinations for three state-of-the-art MLLMs, surpassing other advanced decoding strategies. Furthermore, we validate the effectiveness of RCD in enhancing the capability of MLLMs to comprehend complex and potentially hazardous situations in real-world traffic scenarios. RCD enhances the accuracy of MLLMs in understanding real-world scenes and improves their capability for reasoning, thereby enhancing the reliability of MLLMs in real-world applications.

Keywords

Vision language model / visual hallucination / autonomous driving

Cite this article

Download citation ▾
Dingchen Yang, Bowen Cao, Sanqing Qu, Fan Lu, Shangding Gu, Guang Chen. Retrieve-then-compare mitigates visual hallucination in multi-modal large language models. Intelligence & Robotics, 2025, 5(2): 248-75 DOI:10.20517/ir.2025.13

登录浏览全文

4963

注册一个新账户 忘记密码

References

AI Summary AI Mindmap
PDF

148

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/