Hacking reference-free image captioning metrics
Zheng MA , Chang-Xin WANG , Ya-Wen OUYANG , Fei ZHAO , Jian-Bing ZHANG , Shu-Jian HUANG , Jia-Jun CHEN
Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (8) : 2008343
Hacking reference-free image captioning metrics
Assessing the alignment between textual descriptions and corresponding images is fundamental to multi-modal research. In recent years, there has been a surge in the adoption of reference-free methods that utilize visual-language pre-trained models (VLMs). Empirical evidence supports that these innovative methods correlate more closely with human judgment, representing a notable progression in the field. However, due to the unknown underlying judgment mechanisms within VLMs, the metrics designed based on VLMs may exhibit some unidentified flaws. To uncover potential issues with the reference-free metrics, we employ a reinforcement learning approach to hack these metrics, guiding the model to generate sentences that better align with the metric criteria. If the metrics contain some flaws, these deficiencies will manifest in the generated sentences. On the hacking experiment, we observe that the generated sentences achieve higher metric scores, yet they also become unreadable. These inconsistencies reflect the inherent flaws within the metrics themselves. To address these issues, we propose a simple but effective approach by introducing sentences with flaws as negative samples in contrastive learning called Negative Text Contrastive Learning (NTCL). We utilize GPT-4V as an evaluation tool to analyze the generated sentences, and our results demonstrate that the NTCL method is more robust and achieves state-of-the-art performance. We hope our findings can raise awareness in the community about the importance of reference-free image captioning metrics hacking and pave the way for the design of more robust metrics.
image captioning / reference-free metric / visual-language pre-trained model / reinforcement learning
| [1] |
|
| [2] |
Ma Z, Wang C, Huang B, Zhu Z, Zhang J. Bounding and filling: a fast and flexible framework for image captioning. In: Proceedings of the 12th National CCF Conference on Natural Language Processing and Chinese Computing. 2023, 469–481 |
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
Chen L, Li J, Dong X, Zhang P, He C, Wang J, Zhao F, Lin D. ShareGPT4V: improving large multi-modal models with better captions. In: Proceedings of the 18th European Conference on Computer Vision. 2024, 370–387 |
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
Hessel J, Holtzman A, Forbes M, Le Bras R, Choi Y. CLIPScore: a reference-free evaluation metric for image captioning. In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 7514–7528 |
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
Anderson P, Fernando B, Johnson M, Gould S. SPICE: semantic propositional image caption evaluation. In: Proceedings of the 14th European Conference on Computer Vision. 2016, 382–398 |
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019, 4171–4186 |
| [30] |
|
| [31] |
|
| [32] |
|
| [33] |
|
| [34] |
|
| [35] |
|
| [36] |
|
| [37] |
|
Higher Education Press
/
| 〈 |
|
〉 |