Cross-modal semantic retrieval for hydropower plant surveillance videos based on multi-level denoising
Xiaolian HU , Jiaqing TANG , Zhi YANG , Wen ZHOU , Kun HUANG , Liangliang CAO , Yijun MO , Hefei LING , Yuxuan SHI , Jianbo LI
Water Resources and Hydropower Engineering ›› 2025, Vol. 56 ›› Issue (11) : 179 -188.
[Objective] To apply the cross-modal retrieval mechanisms to scenarios such as personnel security, facility protection, and equipment status monitoring in hydropower video surveillance systems, a multi-modal data mapping between texts and images is developed to enable flexible semantic content search through textual descriptions. [Methods] In order to address issues of the slow inference speed of single-stream models and the lack of modal fusion in dual-stream models in existing cross-modal method, a multi-level denoising multimodal fusion technology was proposed. Based on a dual-stream pre-trained model, this technology integrated masked language modeling with fine-grained cross-modal semantic alignment. A “noise addition followed by denoising” task was designed at multiple levels of the neural network to promote fine-grained interactions between texts and images. [Results] Through extensive experiments, it was validated that under different settings, compared with the fine-tuned CLIP baseline model, the R@1 recall rates for image and text retrieval tasks were increased by 4.1% and 2.7%, respectively, on the Flickr30K dataset. On the MS-COCO dataset, the recall rates were increased by 4.3% and 3.2%, respectively. In a self-collected dataset of hydropower system surveillance scenarios, retrieval tests for personnel in dam areas, equipment operating status, and instrument anomalies were conducted, achieving satisfactory result. [Conclusion] Experiments verify the advantages of the multi-level denoising algorithm in cross-modal semantic retrieval tasks and prove its applicability in hydropower plant surveillance video scenarios.
cross-modal retrieval / image-text retrieval / vision-language pre-training / contrastive learning / denoising
/
| 〈 |
|
〉 |