Cross-modal semantic retrieval for hydropower plant surveillance videos based on multi-level denoising

Xiaolian HU; Jiaqing TANG; Zhi YANG; Wen ZHOU; Kun HUANG; Liangliang CAO; Yijun MO; Hefei LING; Yuxuan SHI; Jianbo LI

doi:10.13928/j.cnki.wrahe.2025.11.014

Water Resources and Hydropower Engineering ›› 2025, Vol. 56 ›› Issue (11) :179 -188. DOI: 10.13928/j.cnki.wrahe.2025.11.014

research-article

Cross-modal semantic retrieval for hydropower plant surveillance videos based on multi-level denoising

Author information +

History +

PDF (12082KB)

Abstract

[Objective] To apply the cross-modal retrieval mechanisms to scenarios such as personnel security, facility protection, and equipment status monitoring in hydropower video surveillance systems, a multi-modal data mapping between texts and images is developed to enable flexible semantic content search through textual descriptions. [Methods] In order to address issues of the slow inference speed of single-stream models and the lack of modal fusion in dual-stream models in existing cross-modal method, a multi-level denoising multimodal fusion technology was proposed. Based on a dual-stream pre-trained model, this technology integrated masked language modeling with fine-grained cross-modal semantic alignment. A “noise addition followed by denoising” task was designed at multiple levels of the neural network to promote fine-grained interactions between texts and images. [Results] Through extensive experiments, it was validated that under different settings, compared with the fine-tuned CLIP baseline model, the R@1 recall rates for image and text retrieval tasks were increased by 4.1% and 2.7%, respectively, on the Flickr30K dataset. On the MS-COCO dataset, the recall rates were increased by 4.3% and 3.2%, respectively. In a self-collected dataset of hydropower system surveillance scenarios, retrieval tests for personnel in dam areas, equipment operating status, and instrument anomalies were conducted, achieving satisfactory result. [Conclusion] Experiments verify the advantages of the multi-level denoising algorithm in cross-modal semantic retrieval tasks and prove its applicability in hydropower plant surveillance video scenarios.

Keywords

cross-modal retrieval / image-text retrieval / vision-language pre-training / contrastive learning / denoising

Cite this article

Download citation ▾

Xiaolian HU, Jiaqing TANG, Zhi YANG, Wen ZHOU, Kun HUANG, Liangliang CAO, Yijun MO, Hefei LING, Yuxuan SHI, Jianbo LI. Cross-modal semantic retrieval for hydropower plant surveillance videos based on multi-level denoising. Water Resources and Hydropower Engineering, 2025, 56(11): 179-188 DOI:10.13928/j.cnki.wrahe.2025.11.014