CMMF-Net: a generative network based on CLIP-guided multi-modal feature fusion for thermal infrared image colorization

Qian Jiang , Tao Zhou , Youwei He , Wenjun Ma , Jingyu Hou , Ahmad Shahrizan Abdul Ghani , Shengfa Miao , Xin Jin

Intelligence & Robotics ›› 2025, Vol. 5 ›› Issue (1) : 34 -49.

PDF
Intelligence & Robotics ›› 2025, Vol. 5 ›› Issue (1) :34 -49. DOI: 10.20517/ir.2025.03
Research Article
Research Article

CMMF-Net: a generative network based on CLIP-guided multi-modal feature fusion for thermal infrared image colorization

Author information +
History +
PDF

Abstract

Thermal infrared (TIR) images remain unaffected by variations in light and atmospheric conditions, which makes them extensively utilized in diverse nocturnal traffic scenarios. However, challenges pertaining to low contrast and absence of chromatic information persist. The technique of image colorization emerges as a pivotal solution aimed at ameliorating the fidelity of TIR images. This enhancement is conducive to facilitating human interpretation and downstream analytical tasks. Because of the blurred and intricate features of TIR images, extracting and processing their feature information accurately through image-based approaches alone becomes challenging for networks. Hence, we propose a multi-modal model that integrates text features from TIR images with image features to jointly perform TIR image colorization. A vision transformer (ViT) model will be employed to extract features from the original TIR images. Concurrently, we manually observe and summarize the textual descriptions of the images, and then input these descriptions into a pretrained contrastive language-image pretraining (CLIP) model to capture text-based features. These two sets of features will then be fed into a cross-modal interaction (CI) module to establish the relationship between text and image. Subsequently, the text-enhanced image features will be processed through a U-Net network to generate the final colorized images. Additionally, we utilize a comprehensive loss function to ensure the network's ability to generate high-quality colorized images. The effectiveness of the methodology put forward in this study is evaluated using the KAIST datasets. The experimental results vividly showcase the superior performance of our CMMF-Net method in comparison to other methodologies for the task of TIR image colorization.

Keywords

Thermal infrared image colorization / vision and language / transformer

Cite this article

Download citation ▾
Qian Jiang, Tao Zhou, Youwei He, Wenjun Ma, Jingyu Hou, Ahmad Shahrizan Abdul Ghani, Shengfa Miao, Xin Jin. CMMF-Net: a generative network based on CLIP-guided multi-modal feature fusion for thermal infrared image colorization. Intelligence & Robotics, 2025, 5(1): 34-49 DOI:10.20517/ir.2025.03

登录浏览全文

4963

注册一个新账户 忘记密码

References

AI Summary AI Mindmap
PDF

77

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/