Semantic and lexical analysis of pre-trained vision language artificial intelligence models for automated image descriptions in civil engineering
Pedram Bazrafshan , Kris Melag , Arvin Ebrahimkhanlou
Semantic and lexical analysis of pre-trained vision language artificial intelligence models for automated image descriptions in civil engineering
This paper investigates the application of pre-trained Vision-Language Models (VLMs) for describing images from civil engineering materials and construction sites, with a focus on construction components, structural elements, and materials. The novelty of this study lies in the investigation of VLMs for this specialized domain, which has not been previously addressed. As a case study, the paper evaluates ChatGPT-4v’s ability to serve as a descriptor tool by comparing its performance with three human descriptions (a civil engineer and two engineering interns). The contributions of this work include adapting a pre-trained VLM to civil engineering applications without additional fine-tuning and benchmarking its performance using both semantic similarity analysis (SentenceTransformers) and lexical similarity methods. Utilizing two datasets—one from a publicly available online repository and another manually collected by the authors—the study employs whole-text and sentence pair-wise similarity analyses to assess the model’s alignment with human descriptions. Results demonstrate that the best-performing model achieved an average similarity of 76% (4% standard deviation) when compared to human-generated descriptions. The analysis also reveals better performance on the publicly available dataset.
Vision language models / Artificial intelligence / Image description / Pre-Trained Transformers / Civil engineering / Digital twin
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
Baechler, G., Sunkara, S., Wang, M., Zubach, F., Mansoor, H., Etter, V., Cărbune, V., Lin, J., Chen, J. & Sharma, A. (2024). ScreenAI: A Vision-Language Model for UI and Infographics Understanding. http://arxiv.org/abs/2402.04615 |
| [5] |
Banerjee, S. & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 65–72. https://aclanthology.org/W05-0909.pdf |
| [6] |
Bazrafshan, P., On, T. & Ebrahimkhanlou, A. (2022). A computer vision-based crack quantification of reinforced concrete shells using graph theory measures. In D. Zonta, Z. Su & B. Glisic (Eds.), Sensors and Smart Structures Technologies for Civil, Mechanical, and Aerospace Systems 2022 (Vol. 12046, p. 25). SPIE. https://doi.org/10.1117/12.2612359 |
| [7] |
Bazrafshan, P. & Ebrahimkhanlou, A. (2023). A virtual-reality framework for graph-based damage evaluation of reinforced concrete structures. In P. J. Shull, T. Yu, A. L. Gyekenyesi & H. F. Wu (Eds.), Nondestructive Characterization and Monitoring of Advanced Materials, Aerospace, Civil Infrastructure, and Transportation XVII (Vol. 12487, p. 5). SPIE. https://doi.org/10.1117/12.2657736 |
| [8] |
Bazrafshan, P. & Ebrahimkhanlou, A. (2024). Detection of cracking mechanism transition on reinforced concrete shear walls using graph theory. In P. J. Shull, T. Yu, A. L. Gyekenyesi & H. F. Wu (Eds.), Nondestructive Characterization and Monitoring of Advanced Materials, Aerospace, Civil Infrastructure, and Transportation XVIII (Vol. 12950, p. 28). SPIE. https://doi.org/10.1117/12.3011092 |
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out, 74–81. https://aclanthology.org/W04-1013.pdf |
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
Microsoft Corporation. (2024). Bing Chat. https://www.bing.com/chat. Accessed 23 July 2024 |
| [26] |
|
| [27] |
Office of the Federal Register, National Archives and Records Administration. (2023). DCPD-202300949 - Executive Order 14110-Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. Govinfo.Gov. https://www.govinfo.gov/app/details/DCPD-202300949. Accessed 2 July 2024 |
| [28] |
OpenAI. (2023). GPT-4V(ision) system card. https://cdn.openai.com/papers/GPTV_System_Card.pdf. Accessed 23 July 2024 |
| [29] |
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318. https://aclanthology.org/P02-1040.pdf |
| [30] |
|
| [31] |
Reimers, N., Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3980–3990. https://doi.org/10.18653/v1/D19-1410 |
| [32] |
Roboflow: Computer vision tools for developers and enterprises. (2024). https://roboflow.com/ |
| [33] |
|
| [34] |
Schroeppel, K. (2010). Larimer Street Improvements Update. DenverInfill. https://denverinfill.com/2010/11/larimer-street-improvements-update.htmlDate. Accessed 2 July 2024 |
| [35] |
|
| [36] |
|
| [37] |
|
| [38] |
|
| [39] |
Vaswani, A., Brain, G., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. https://doi.org/10.48550/arXiv.1706.03762. https://user.phil.hhu.de/~cwurm/wp-content/uploads/2020/01/7181-attention-is-all-you-need.pdf. Accessed 2 July 2024 |
| [40] |
Wang, Z., Yu, J., Yu, A. W., Dai, Z., Tsvetkov, Y., Cao, Y. (2021). SimVLM: Simple Visual Language Model Pretraining with Weak Supervision. International Conference on Learning Representations. http://arxiv.org/abs/2108.10904 |
| [41] |
|
| [42] |
|
| [43] |
|
| [44] |
Workersafety. (2023). Worker safety_v1 Dataset. Roboflow Universe. https://universe.roboflow.com/workersafety/worker-safety_v1. Accessed 2 July 2024 |
| [45] |
|
| [46] |
|
| [47] |
|
| [48] |
|
| [49] |
Zhan, Y., Xiong, Z., Yuan, Y. (2024). SkyEyeGPT: Unifying remote sensing vision-language tasks via instruction tuning with large language model. http://arxiv.org/abs/2401.09712 |
The Author(s)
/
| 〈 |
|
〉 |