CLIP-IML: A novel approach for CLIP-based image manipulation localization
Xue-Yang Hou , Yilihamu Yaermaimaiti , Shuo-Qi Cheng
Journal of Electronic Science and Technology ›› 2025, Vol. 23 ›› Issue (3) : 100325
CLIP-IML: A novel approach for CLIP-based image manipulation localization
Existing image manipulation localization (IML) techniques require large, densely annotated sets of forged images. This requirement greatly increases labeling costs and limits a model’s ability to handle manipulation types that are novel or absent from the training data. To address these issues, we present CLIP-IML, an IML framework that leverages contrastive language-image pre-training (CLIP). A lightweight feature-reconstruction module transforms CLIP token sequences into spatial tensors, after which a compact feature-pyramid network and a multi-scale fusion decoder work together to capture information from fine to coarse levels. We evaluated CLIP-IML on ten public datasets that cover copy-move, splicing, removal, and artificial intelligence (AI)-generated forgeries. The framework raises the average F1-score by 7.85% relative to the strongest recent baselines and secures either first- or second-place performance on every dataset. Ablation studies show that CLIP pre-training, higher resolution inputs, and the multi-scale decoder each make complementary contributions. Under six common post-processing perturbations, as well as the compression pipelines used by Facebook, Weibo, and WeChat, the performance decline never exceeds 2.2%, confirming strong practical robustness. Moreover, CLIP-IML requires only a few thousand annotated images for training, which markedly reduces data-collection and labeling effort compared with previous methods. All of these results indicate that CLIP-IML is highly generalizable for image tampering localization across a wide range of tampering scenarios.
Image manipulation localization / Multi-scale feature / Pre-trained model / Vision-language model / Vision Transformer
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
|
/
| 〈 |
|
〉 |