CLIP-IML: A novel approach for CLIP-based image manipulation localization

Xue-Yang Hou , Yilihamu Yaermaimaiti , Shuo-Qi Cheng

Journal of Electronic Science and Technology ›› 2025, Vol. 23 ›› Issue (3) : 100325

PDF (5625KB)
Journal of Electronic Science and Technology ›› 2025, Vol. 23 ›› Issue (3) : 100325 DOI: 10.1016/j.jnlest.2025.100325
research-article

CLIP-IML: A novel approach for CLIP-based image manipulation localization

Author information +
History +
PDF (5625KB)

Abstract

Existing image manipulation localization (IML) techniques require large, densely annotated sets of forged images. This requirement greatly increases labeling costs and limits a model’s ability to handle manipulation types that are novel or absent from the training data. To address these issues, we present CLIP-IML, an IML framework that leverages contrastive language-image pre-training (CLIP). A lightweight feature-reconstruction module transforms CLIP token sequences into spatial tensors, after which a compact feature-pyramid network and a multi-scale fusion decoder work together to capture information from fine to coarse levels. We evaluated CLIP-IML on ten public datasets that cover copy-move, splicing, removal, and artificial intelligence (AI)-generated forgeries. The framework raises the average F1-score by 7.85% relative to the strongest recent baselines and secures either first- or second-place performance on every dataset. Ablation studies show that CLIP pre-training, higher resolution inputs, and the multi-scale decoder each make complementary contributions. Under six common post-processing perturbations, as well as the compression pipelines used by Facebook, Weibo, and WeChat, the performance decline never exceeds 2.2%, confirming strong practical robustness. Moreover, CLIP-IML requires only a few thousand annotated images for training, which markedly reduces data-collection and labeling effort compared with previous methods. All of these results indicate that CLIP-IML is highly generalizable for image tampering localization across a wide range of tampering scenarios.

Keywords

Image manipulation localization / Multi-scale feature / Pre-trained model / Vision-language model / Vision Transformer

Cite this article

Download citation ▾
Xue-Yang Hou, Yilihamu Yaermaimaiti, Shuo-Qi Cheng. CLIP-IML: A novel approach for CLIP-based image manipulation localization. Journal of Electronic Science and Technology, 2025, 23(3): 100325 DOI:10.1016/j.jnlest.2025.100325

登录浏览全文

4963

注册一个新账户 忘记密码

CRediT authorship contribution statement

Xue-Yang Hou: Conceptualization, Investigation, Methodology, Software, Visualization, Writing—original draft. Yilihamu Yaermaimaiti: Funding acquisition, Project administration, Supervision, Writing—review & editing. Shuo-Qi Cheng: Data curation, Formal analysis, Validation.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported by the Natural Science Foundation of Xinjiang Uygur Autonomous Region under Grant No. 2023D01C21 and the National Natural Science Foundation of China under Grant No. 62362063.

References

[1]

P. Duszejko, T. Walczyna, Z. Piotrowski, Detection of manipulations in digital images: a review of passive and active methods utilizing deep learning, Appl. Sci. 15 (2) (2025) 881.

[2]

T. Bianchi, A. Piva, Image forgery localization via block-grained analysis of JPEG artifacts, IEEE T. Inf. Foren. Sec. 7 (3) (2012) 1003-1017.

[3]

Z.-C. Lin, J.-F. He, X.-O. Tang, C.-K. Tang, Fast, automatic and fine-grained tampered JPEG image detection via DCT coefficient analysis, Pattern Recogn. 42 (11) (2009) 2492-2501.

[4]

A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., An image is worth 16×16 words: Transformers for image recognition at scale, in: Proc. of the 9th Intl. Conf. on Learning Representations, Virtual Event, (2021), pp. 1-21.

[5]

A. Radford, J.W. Kim, C. Hallacy, et al., Learning transferable visual models from natural language supervision, in: Proc. of the 38th Intl. Conf. on Machine Learning, Virtual Event, (2021), pp. 8748-8763.

[6]

T.-T. Xiao, Y.-C. Liu, B.-L. Zhou, Y.-N. Jiang, J. Sun, Unified perceptual parsing for scene understanding, in: Proc. of the 15th European Conf. on Computer Vision, Munich, Germany, (2018), pp. 418-434.

[7]

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional Transformers for language understanding, in: Proc. of the Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, USA, (2019), pp. 4171-4186.

[8]

A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, Improving language understanding by generative pre-training [Online]. Available, https://openai.com/index/language-unsupervised/, 2018

[9]

Y. Wu, W. Abd-Almageed, P. Natarajan, BusterNet: detecting copy-move image forgery with source/target localization, in: Proc. of the 15th European Conf. on Computer Vision, Munich, Germany, (2018), pp. 168-184.

[10]

X.-F. Hu, Z.-H. Zhang, Z.-Y. Jiang, S. Chaudhuri, Z.-H. Yang, R. Nevatia, SPAN: spatial pyramid attention network for image manipulation localization, in: Proc. of the 16th European Conf. on Computer Vision, Glasgow, UK, (2020), pp. 312-328.

[11]

F. Guillaro, D. Cozzolino, A. Sud, N. Dufour, L. Verdoliva, TruFor: leveraging all-round clues for trustworthy image forgery detection and localization, in: Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Vancouver, Canada, (2023), pp. 20606-20615.

[12]

C.-B. Dong, X.-R. Chen, R.-H. Hu, J. Cao, X.-R. Li, MVSS-Net: multi-view multi-scale supervised networks for image manipulation detection, IEEE T. Pattern Anal. 45 (3) (2023) 3539-3553.

[13]

X. Guo, X.-H. Liu, Z.-Y. Ren, et al., Hierarchical fine-grained image forgery detection and localization, in: Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Vancouver, Canada, (2023), pp. 3155-3165.

[14]

W.-H. Liu, X. Shen, C.-M. Pun, X.-D. Cun, Explicit visual prompting for low-level structure segmentations, in: Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Vancouver, Canada, (2023), pp. 19434-19445.

[15]

X.-C. Ma, B. Du, Z.-H. Jiang, X. Du, A.Y. Al Hammadi, J.-Z. Zhou, IML-ViT: benchmarking image manipulation localization by Vision Transformer, arXiv preprint, (2023). arXiv: 2307.14863.

[16]

Z.-C. Zhu, J.-F. Li, Y. Wen, Self-optimization training for weakly supervised image manipulation localization, in: Proc. of IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, Hyderabad, India, (2025), pp. 1-5.

[17]

Y.-H. Li, H.-Z. Mao, R. Girshick, K.-M. He, Exploring plain Vision Transformer backbones for object detection, in: Proc. of the 17th European Conf. on Computer Vision, Tel Aviv, Israel, (2022), pp. 280-296.

[18]

J. Dong, W. Wang, T.-N. Tan, CASIA image tampering detection evaluation database, in: Proc. of the IEEE China Summit and Intl. Conf. on Signal and Information Processing, Beijing, China, (2013), pp. 422-426.

[19]

B.-H. Wen, Y. Zhu, R. Subramanian, T.-T. Ng, X.-J. Shen, S. Winkler, COVERAGE—a novel database for copy-move forgery detection, in: Proc. of the IEEE Intl. Conf. on Image Processing, Phoenix, USA, (2016), pp. 161-165.

[20]

Y.-F. Hsu, S.-F. Chang, Detecting image splicing using geometry invariants and camera characteristics consistency, in: Proc. of the IEEE Intl. Conf. on Multimedia and Expo, Toronto, Canada, (2006), pp. 549-552.

[21]

H.-Y. Guan, M. Kozak, E. Robertson, et al., MFC datasets: large-scale benchmark datasets for media forensic challenge evaluation, in: Proc. of the IEEE Winter Applications of Computer Vision Workshops, Waikoloa, USA, (2019), pp. 63-72.

[22]

A. Novozámský, B. Mahdian, S. Saic, IMD2020: a large-scale annotated dataset tailored for detecting manipulated images, in: Proc. of the IEEE Winter Applications of Computer Vision Workshops, Snowmass, USA, (2020), pp. 71-80.

[23]

M. Huh, A. Liu, A. Owens, A.A. Efros, Fighting fake news: image splice detection via learned self-consistency, in: Proc. of the 15th European Conf. on Computer Vision, Munich, Germany, (2018), pp. 101-117.

[24]

E. Tahir, M. Bal, Deep image composition meets image forgery, arXiv preprint, (2024). arXiv: 2404.02897.

[25]

T.J. de Carvalho, C. Riess, E. Angelopoulou, H. Pedrini, A. de Rezende Rocha, Exposing digital image forgeries by illumination color classification, IEEE T. Inf. Foren. Sec. 8 (7) (2013) 1182-1194.

[26]

P. Korus, J.-W. Huang, Evaluation of random field models in multi-modal unsupervised tampering localization, in: Proc. of the IEEE Intl. Workshop on Information Forensics and Security, Abu Dhabi, United Arab Emirates, (2016), pp. 1-6.

[27]

H.-D. Li, J.-W. Huang, Localization of deep inpainting using high-pass fully convolutional network, in: Proc. of 2019 IEEE/CVF Intl. Conf. on Computer Vision, Seoul, Republic of Korea, (2019), pp. 8301-8310.

[28]

J.H. Bappy, C. Simons, L. Nataraj, B.S. Manjunath, A.K. Roy-Chowdhury, Hybrid LSTM and encoder-decoder architecture for detection of image forgeries, IEEE T. Image Process. 28 (7) (2019) 3286-3300.

[29]

H.-C. Zhu, G. Cao, M. Zhao, H.-W. Tian, W.-G. Lin, Effective image tampering localization with multi-scale ConvNeXt feature fusion, J. Vis. Commun. Image R. 98 (2024) 103981.

[30]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, F.-F. Li, ImageNet: a large-scale hierarchical image database, in: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Miami, USA, (2009), pp. 248-255.

[31]

K.-M. He, X.-L. Chen, S.-N. Xie, Y.-H. Li, P. Dollár, R. Girshick, Masked autoencoders are scalable vision learners, in: Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, New Orleans, USA, (2022), pp. 16000-16009.

[32]

H.-W. Wu, J.-T. Zhou, J.-Y. Tian, J. Liu, Y. Qiao, Robust image forgery detection against transmission over online social networks, IEEE T. Inf. Foren. Sec. 17 (2022) 443-456.

AI Summary AI Mindmap
PDF (5625KB)

54

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/