Railway-CLIP: A multimodal model for abnormal object detection in high-speed railway

Jiayu Zhang , Qingji Guan , Junbo Liu , Yaping Huang , Jianyong Guo

High-speed Railway ›› 2025, Vol. 3 ›› Issue (3) : 194 -204.

PDF (4641KB)
High-speed Railway ›› 2025, Vol. 3 ›› Issue (3) : 194 -204. DOI: 10.1016/j.hspr.2025.06.001
Research article
review-article

Railway-CLIP: A multimodal model for abnormal object detection in high-speed railway

Author information +
History +
PDF (4641KB)

Abstract

Automated detection of suspended anomalous objects on high-speed railway catenary systems using computer vision-based technology is a critical task for ensuring railway transportation safety. Despite the critical importance of this task, conventional vision-based foreign object detection methodologies have predominantly concentrated on image data, neglecting the exploration and integration of textual information. The currently popular multimodal model Contrastive Language-Image Pre-training (CLIP) employs contrastive learning to enable simultaneous understanding of both visual and textual modalities. Drawing inspiration from CLIP’s capabilities, this paper introduces a novel CLIP-based multimodal foreign object detection model tailored for railway applications, referred to as Railway-CLIP. This model leverages CLIP’s robust generalization capabilities to enhance performance in the context of catenary foreign object detection. The Railway-CLIP model is primarily composed of an image encoder and a text encoder. Initially, the Segment Anything Model (SAM) is employed to preprocess raw images, identifying candidate bounding boxes that may contain foreign objects. Both the original images and the detected candidate bounding boxes are subsequently fed into the image encoder to extract their respective visual features. In parallel, distinct prompt templates are crafted for both the original images and the candidate bounding boxes to serve as textual inputs. These prompts are then processed by the text encoder to derive textual features. The image and text encoders collaboratively project the multimodal features into a shared semantic space, facilitating the computation of similarity scores between visual and textual representations. The final detection results are determined based on these similarity scores, ensuring a robust and accurate identification of anomalous objects. Extensive experiments on our collected Railway Anomaly Dataset (RAD) demonstrate that the proposed Railway-CLIP outperforms previous state-of-the-art methods, achieving 97.25 % AUROC and 92.66 % F1-score, thereby validating the effectiveness and superiority of the proposed approach in real-world high-speed railway anomaly detection scenarios.

Keywords

High-speed railway catenary systems / Anomalous object detection / Multimodal model / Railway-CLIP

Highlight

Cite this article

Download citation ▾
Jiayu Zhang, Qingji Guan, Junbo Liu, Yaping Huang, Jianyong Guo. Railway-CLIP: A multimodal model for abnormal object detection in high-speed railway. High-speed Railway, 2025, 3(3): 194-204 DOI:10.1016/j.hspr.2025.06.001

登录浏览全文

4963

注册一个新账户 忘记密码

CRediT authorship contribution statement

Jiayu Zhang: Writing – original draft, Visualization, Validation, Methodology. Qingji Guan: Writing – review & editing, Methodology. Junbo Liu: Data curation. Yaping Huang: Writing – review & editing, Methodology. Jianyong Guo: Methodology.

Declaration of Competing Interest

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Junbo Liu is currently employed by China Academy of Railway Science Corporation Limited.

Acknowledgement

This work is supported by the Technology Research and Development Program of China National Railway Group (Q2024T002), and the Open Project Fund of National Engineering Research Center of Digital Construction and Evaluation Technology of Urban Rail Transit (2024023).

References

[1]

B. Zhang, Q. Yang, F. Chen, et al., A real-time foreign object detection method based on deep learning in complex open railway environments. J. Real-Time Image Process., 21 (2024), pp. 166.

[2]

Y.B. Jinila. A novel approach on obstacle detection and automatic braking system in railways. Int. J. Sci. Appl. Res., 5(7)(2018), pp. 1-5.

[3]

J.J. García, J. Ureña, A. Hernandez, et al., Efficient multisensory barrier for obstacle detection on railways. IEEE Trans. Intell. Transp. Syst., 11 (2010), pp. 702-713.

[4]

S. Liu, Q. Wang, Y. Luo. A review of applications of visual inspection technology based on image processing in the railway industry. Transp. Saf. Environ., 1 (2019), pp. 185-204.

[5]

M. Niu, K. Song, L. Huang, et al., Unsupervised saliency detection of rail surface defects using stereoscopic images. IEEE Trans. Ind. Inform., 17 (2020), pp. 2271-2281.

[6]

Z. Zou, K. Chen, Z. Shi, et al., Object detection in 20 years: A survey. Proc. IEEE, 111 (2023), pp. 257-276.

[7]

Z.Q. Zhao, P. Zheng, S.T. Xu, et al., Object detection with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst., 30 (2019), pp. 3212-3232.

[8]

D. Zhang, Y. Yu, J. Dong, et al., Mm-llms: Recent advances in multimodal large language models, arXiv:2401.13601, 2024.

[9]

D. Huang, C. Yan, Q. Li, et al., From large language models to large multimodal models: A literature review. Appl. Sci., 14 (2024), p. 5068.

[10]

A. Radford, J.W. Kim, C. Hallacy, et al., Learning transferable visual models from natural language supervision, arXiv:2103.00020, 2021.

[11]

D.A. Reynolds. Gaussian mixture models. Encyclopedia of Biometrics, Springer, Boston (2009).

[12]

N. Görnitz, M. Kloft, K. Rieck, et al., Toward supervised anomaly detection. J. Artif. Intell. Res., 46 (2013), pp. 235-262.

[13]

S. Ren, K. He, R. Girshick, et al., Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell., 39 (2016), pp. 1137-1149.

[14]

A. Wang, H. Chen, L. Liu, et al., Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst., 37 (2024), pp. 107984-108011.

[15]

J. Terven, D.M. Córdova-Esparza, J.A. Romero-González. A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas. Mach. Learn. Knowl. Extr., 5 (2023), pp. 1680-1716.

[16]

N. Carion, F. Massa, G. Synnaeve, et al., End-to-end object detection with transformers, European Conference on Computer Vision, Springer, Cham (2020), pp. 213-229.

[17]

R. Gasparini, A. D’Eusanio, G. Borghi, et al., Anomaly detection, localization and classification for railway inspection, IEEE, Milan (2021), pp. 3419-3426.

[18]

T. Ye, X. Zhang, Y. Zhang, et al., Railway traffic object detection using differential feature fusion convolution neural network. IEEE Trans. Intell. Transp. Syst., 22 (2020), pp. 1375-1387.

[19]

Y. Wang, Z. Yu, L. Zhu. Intrusion detection for high-speed railways based on unsupervised anomaly detection models. Appl. Intell., 53 (2023), pp. 8453-8466.

[20]

J. An, S. Cho. Variational autoencoder based anomaly detection using reconstruction probability. Spec. Lect. IE, 2 (2015), pp. 1-18.

[21]

Y. Lyu, Z. Han, J. Zhong, et al., A gan-based anomaly detection method for isoelectric line in high-speed railway, IEEE, Auckland (2019), pp. 1-6

[22]

I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, et al., Generative adversarial nets, Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, 2014, pp. 2672–2680.

[23]

J. Ho, A. Jain, P. Abbeel. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst., 33 (2020), pp. 6840-6851.

[24]

H. Deng, X. Li. Anomaly detection via reverse distillation from one-class embedding. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022), pp. 9737-9746.

[25]

M. Rudolph, T. Wehrbein, B. Rosenhahn, et al., Asymmetric student-teacher networks for industrial anomaly detection. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, IEEE, Waikoloa, (2023), pp. 2592-2602.

[26]

K. Roth, L. Pemula, J. Zepeda, et al., Towards total recall in industrial anomaly detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Waikoloa, (2022), pp. 14318-14328.

[27]

Q. Wan, L. Gao, X. Li, et al., Industrial image anomaly localization based on gaussian clustering of pretrained feature. IEEE Trans. Ind. Electron., 69 (2021), pp. 6182-6192.

[28]

D. Gudovskiy, S. Ishizaka, K. Kozuka. Cflow-ad: Real-time unsupervised anomaly detection with localization via conditional normalizing flows. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision 2022; 98-107.

[29]

Y. Chang, X. Wang, J. Wang, et al., A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol., 15 (2024), pp. 1-45.

[30]

D. Zhu, J. Chen, X. Shen, et al., Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592, 2023.

[31]

Z. Gu, B. Zhu, G. Zhu, et al., Anomalygpt: Detecting industrial anomalies using large vision-language models. (2024), pp. 1932-1940.

[32]

Q. Zhou, G. Pang, Y. Tian, et al., Anomalyclip: Object-agnostic prompt learning for zero-shot anomaly detection. arXiv:2310.18961, 2023.

[33]

Y. Cao, X. Xu, C. Sun, et al., Segment any anomaly without training via hybrid prompt regularization. arXiv:2305.10724, 2023.

[34]

J. Jeong, Y. Zou, T. Kim, et al., Zero-/few-shot anomaly classification and segmentation. (2023), pp. 19606-19616.

[35]

L.C. Chen, G. Papandreou, I. Kokkinos, et al., Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell., 40 (2017), pp. 834-848.

[36]

O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, arXiv:1505.04597, 2015.

[37]

A. Kirillov, E. Mintun, N. Ravi, et al., Segment anything. Proceedings of the IEEE/CVF International Conference on Computer Vision (2023), pp. 4015-4026.

[38]

A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv:2010.11929, 2020.

[39]

J.R. Uijlings, K.E. Van De Sande, T. Gevers, et al., Selective search for object recognition. Int. J. Comput. Vis., 104 (2013), pp. 154-171.

[40]

P.F. Felzenszwalb, D.P. Huttenlocher. Efficient graph-based image segmentation. Int. J. Comput. Vis., 59 (2004), pp. 167-181.

[41]

A. Neubeck, L. Van Gool. Efficient non-maximum suppression. 18th International Conference on Pattern Recognition (ICPR’06), IEEE (2006), pp. 850-855.

[42]

K. Zhou, J. Yang, C.C. Loy, et al., Learning to prompt for vision-language models. Int. J. Comput. Vis., 130 (2022), pp. 2337-2348.

[43]

X. Zhang, Y. Ding, H. Zhao, et al., Mixed skewness probability modeling and extreme value predicting for physical system input-output based on full Bayesian generalized maximum-likelihood estimation. IEEE Trans. Instrum. Meas., 73 (2023), pp. 1-16.

AI Summary AI Mindmap
PDF (4641KB)

149

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/