Hacking reference-free image captioning metrics

Zheng MA , Chang-Xin WANG , Ya-Wen OUYANG , Fei ZHAO , Jian-Bing ZHANG , Shu-Jian HUANG , Jia-Jun CHEN

Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (8) : 2008343

PDF (4793KB)
Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (8) : 2008343 DOI: 10.1007/s11704-025-50178-6
Artificial Intelligence
RESEARCH ARTICLE

Hacking reference-free image captioning metrics

Author information +
History +
PDF (4793KB)

Abstract

Assessing the alignment between textual descriptions and corresponding images is fundamental to multi-modal research. In recent years, there has been a surge in the adoption of reference-free methods that utilize visual-language pre-trained models (VLMs). Empirical evidence supports that these innovative methods correlate more closely with human judgment, representing a notable progression in the field. However, due to the unknown underlying judgment mechanisms within VLMs, the metrics designed based on VLMs may exhibit some unidentified flaws. To uncover potential issues with the reference-free metrics, we employ a reinforcement learning approach to hack these metrics, guiding the model to generate sentences that better align with the metric criteria. If the metrics contain some flaws, these deficiencies will manifest in the generated sentences. On the hacking experiment, we observe that the generated sentences achieve higher metric scores, yet they also become unreadable. These inconsistencies reflect the inherent flaws within the metrics themselves. To address these issues, we propose a simple but effective approach by introducing sentences with flaws as negative samples in contrastive learning called Negative Text Contrastive Learning (NTCL). We utilize GPT-4V as an evaluation tool to analyze the generated sentences, and our results demonstrate that the NTCL method is more robust and achieves state-of-the-art performance. We hope our findings can raise awareness in the community about the importance of reference-free image captioning metrics hacking and pave the way for the design of more robust metrics.

Graphical abstract

Keywords

image captioning / reference-free metric / visual-language pre-trained model / reinforcement learning

Cite this article

Download citation ▾
Zheng MA, Chang-Xin WANG, Ya-Wen OUYANG, Fei ZHAO, Jian-Bing ZHANG, Shu-Jian HUANG, Jia-Jun CHEN. Hacking reference-free image captioning metrics. Front. Comput. Sci., 2026, 20(8): 2008343 DOI:10.1007/s11704-025-50178-6

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Wang P, Yang A, Men R, Lin J, Bai S, Li Z, Ma J, Zhou C, Zhou J, Yang H. OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 23318–23340

[2]

Ma Z, Wang C, Huang B, Zhu Z, Zhang J. Bounding and filling: a fast and flexible framework for image captioning. In: Proceedings of the 12th National CCF Conference on Natural Language Processing and Chinese Computing. 2023, 469–481

[3]

Cheng K, Song W, Ma Z, Zhu W, Zhu Z, Zhang J. Beyond generic: enhancing image captioning with real-world knowledge using vision-language pre-training model. In: Proceedings of the 31st ACM International Conference on Multimedia. 2023, 5038–5047

[4]

Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L. Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 6077–6086

[5]

Shi B, Ji L, Lu P, Niu Z, Duan N. Knowledge aware semantic concept expansion for image-text matching. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence. 2019, 5182–5189

[6]

Li K, Zhang Y, Li K, Li Y, Fu Y. Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, 4653–4661

[7]

Huang H, Qu Y, Liu J, Yang M, Zhao T. An empirical study of LLM-as-a-judge for LLM evaluation: Fine-tuned judge models are task-specific classifiers. 2024, arXiv preprint arXiv: 2403.02839

[8]

Ao J, Wang R, Zhou L, Wang C, Ren S, Wu Y, Liu S, Ko T, Li Q, Zhang Y, Wei Z, Qian Y, Li J, Wei F. SpeechT5: unified-modal encoder-decoder pre-training for spoken language processing. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 5723–5738

[9]

Liu J, Xia C S, Wang Y, Zhang L. Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 943

[10]

Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 8748–8763

[11]

Lu J, Batra D, Parikh D, Lee S. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 2

[12]

Li J, Li D, Xiong C, Hoi S. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 12888–12900

[13]

Liu H, Li C, Wu Q, Lee Y J. Visual instruction tuning. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 1516

[14]

Bai J, Bai S, Yang S, Wang S, Tan S, Wang P, Lin J, Zhou C, Zhou J. Qwen-VL: a frontier large vision-language model with versatile abilities. 2023, arXiv preprint arXiv: 2308.12966

[15]

Chen L, Li J, Dong X, Zhang P, He C, Wang J, Zhao F, Lin D. ShareGPT4V: improving large multi-modal models with better captions. In: Proceedings of the 18th European Conference on Computer Vision. 2024, 370–387

[16]

Papineni K, Roukos S, Ward T, Zhu W J. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002, 311–318

[17]

Vedantam R, Zitnick C L, Parikh D. CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, 4566–4575

[18]

Jiang M, Huang Q, Zhang L, Wang X, Zhang P, Gan Z, Diesner J, Gao J. TIGEr: text-to-image grounding for image caption evaluation. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019, 2141–2152

[19]

Lee H, Yoon S, Dernoncourt F, Bui T, Jung K. UMIC: an unreferenced metric for image captioning via contrastive learning. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021, 220–226

[20]

Hessel J, Holtzman A, Forbes M, Le Bras R, Choi Y. CLIPScore: a reference-free evaluation metric for image captioning. In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 7514–7528

[21]

Varpio L, Ajjawi R, Monrouxe L V, O’Brien B C, Rees C E . Shedding the cobra effect: problematising thematic emergence, triangulation, saturation and member checking. Medical Education, 2017, 51( 1): 40–50

[22]

Banerjee S, Lavie A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 2005, 65–72

[23]

Lin C Y. ROUGE: a package for automatic evaluation of summaries. In: Proceedings of Text Summarization Branches Out. 2004, 74–81

[24]

Anderson P, Fernando B, Johnson M, Gould S. SPICE: semantic propositional image caption evaluation. In: Proceedings of the 14th European Conference on Computer Vision. 2016, 382–398

[25]

Hu A, Chen S, Zhang L, Jin Q. InfoMetIC: an informative metric for reference-free image caption evaluation. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, 3171–3185

[26]

Yuksekgonul M, Bianchi F, Kalluri P, Jurafsky D, Zou J. When and why vision-language models behave like bags-of-words, and what to do about it?. In: Proceedings of the 11th International Conference on Learning Representations. 2023

[27]

Ma Z, Zong S, Pan M, Zhang J, Huang S, Dai X, Chen J. Probing cross-modal semantics alignment capability from the textual perspective. In: Proceedings of Findings of the Association for Computational Linguistics. 2022, 5739–5749

[28]

Cho J, Yoon S, Kale A, Dernoncourt F, Bui T, Bansal M. Fine-grained image captioning with CLIP reward. In: Proceedings of Findings of the Association for Computational Linguistics. 2022, 517–527

[29]

Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019, 4171–4186

[30]

Rennie S J, Marcheret E, Mroueh Y, Ross J, Goel V. Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 1179–1195

[31]

Chen Y C, Li L, Yu L, El Kholy A, Ahmed F, Gan Z, Cheng Y, Liu J. UNITER: UNiversal image-TExt representation learning. In: Proceedings of 16th European Conference on Computer Vision. 2020, 104–120

[32]

Lu Y, Yang X, Li X, Wang X E, Wang W Y. LLMScore: unveiling the power of large language models in text-to-image synthesis evaluation. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 1001

[33]

Chen X, Fang H, Lin T Y, Vedantam R, Gupta S, Dollar P, Zitnick C L. Microsoft COCO captions: data collection and evaluation server. 2015, arXiv preprint arXiv: 1504.00325

[34]

Karpathy A, Li F F. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, 3128–3137

[35]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000–6010

[36]

Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L J, Shamma D A, Bernstein M S, Li F F . Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 2017, 123( 1): 32–73

[37]

Hodosh M, Young P, Hockenmaier J. Framing image description as a ranking task: data, models and evaluation metrics. In: Proceedings of the 24th International Conference on Artificial Intelligence. 2015, 4188–4192

RIGHTS & PERMISSIONS

Higher Education Press

AI Summary AI Mindmap
PDF (4793KB)

Supplementary files

Highlights

258

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/