VAGen: waterbody segmentation with prompting for visual in-context learning
Jiapei Zhao , Nobuyoshi Yabuki , Tomohiro Fukuda
AI in Civil Engineering ›› 2024, Vol. 3 ›› Issue (1) : 24
VAGen: waterbody segmentation with prompting for visual in-context learning
Effective water management and flood prevention are critical challenges encountered by both urban and rural areas, necessitating precise and prompt monitoring of waterbodies. As a fundamental step in the monitoring process, waterbody segmentation involves precisely delineating waterbody boundaries from imagery. Previous research using satellite images often lacks the resolution and contextual detail needed for local-scale analysis. In response to these challenges, this study seeks to address them by leveraging common natural images that are more easily accessible and provide higher resolution and more contextual information compared to satellite images. However, the segmentation of waterbodies from ordinary images faces several obstacles, including variations in lighting, occlusions from objects like trees and buildings, and reflections on the water surface, all of which can mislead algorithms. Additionally, the diverse shapes and textures of waterbodies, alongside complex backgrounds, further complicate this task. While large-scale vision models have typically been leveraged for their generalizability across various downstream tasks that are pre-trained on large datasets, their application to waterbody segmentation from ground-level images remains underexplored. Hence, this research proposed the Visual Aquatic Generalist (VAGen) as a countermeasure. Being a lightweight model for waterbody segmentation inspired by visual In-Context Learning (ICL) and Visual Prompting (VP), VAGen refines large visual models by innovatively adding learnable perturbations to enhance the quality of prompts in ICL. As demonstrated by the experimental results, VAGen demonstrated a significant increase in the mean Intersection over Union (mIoU) metric, showing a 22.38% enhancement when compared to the baseline model that lacked the integration of learnable prompts. Moreover, VAGen surpassed the current state-of-the-art (SOTA) task-specific models designed for waterbody segmentation by 6.20%. The performance evaluation and analysis of VAGen indicated its capacity to substantially reduce the number of trainable parameters and computational overhead, and proved its feasibility to be deployed on cost-limited devices including unmanned aerial vehicles (UAVs) and mobile computing platforms. This study thereby makes a valuable contribution to the field of computer vision, offering practical solutions for engineering applications related to urban flood monitoring, agricultural water resource management, and environmental conservation efforts.
| [1] |
Bahng, H., Jahanian, A., Sankaranarayanan, S., & Isola, P. (2022). Exploring Visual Prompts for Adapting Large-Scale Models (arXiv:2203.17274). arXiv. http://arxiv.org/abs/2203.17274 |
| [2] |
Bai, Y., Geng, X., Mangalam, K., Bar, A., Yuille, A. L., Darrell, T., Malik, J., & Efros, A. A. (2024). Sequential modeling enables scalable learning for large vision models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22861–22872. https://openaccess.thecvf.com/content/CVPR2024/html/Bai_Sequential_Modeling_Enables_Scalable_Learning_for_Large_Vision_Models_CVPR_2024_paper.html |
| [3] |
|
| [4] |
|
| [5] |
Chen, Y.-S., Song, Y.-Z., Yeo, C. Y., Liu, B., Fu, J., & Shuai, H.-H. (2023). SINC: Self-Supervised In-Context Learning for Vision-Language Tasks. Proceedings of the IEEE/CVF International Conference on Computer Vision, 15430–15442. http://openaccess.thecvf.com/content/ICCV2023/html/Chen_SINC_Self-Supervised_In-Context_Learning_for_Vision-Language_Tasks_ICCV_2023_paper.html |
| [6] |
|
| [7] |
Dong, Q., Li, L., Dai, D., Zheng, C., Wu, Z., Chang, B., Sun, X., Xu, J., Li, L., & Sui, Z. (2023). A Survey on In-context Learning (arXiv:2301.00234). arXiv. http://arxiv.org/abs/2301.00234 |
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
Ghafoorian, M., Mehrtash, A., Kapur, T., Karssemeijer, N., Marchiori, E., Pesteie, M., Guttmann, C. R. G., De Leeuw, F.-E., Tempany, C. M., Van Ginneken, B., Fedorov, A., Abolmaesumi, P., Platel, B., & Wells, W. M. (2017). Transfer learning for domain adaptation in MRI: Application in brain lesion segmentation. In M. Descoteaux, L. Maier-Hein, A. Franz, P. Jannin, D. L. Collins, & S. Duchesne (Eds.), Medical Image Computing and Computer Assisted Intervention − MICCAI 2017 (Vol. 10435, pp. 516–524). Springer International Publishing. https://doi.org/10.1007/978-3-319-66179-7_59 |
| [13] |
|
| [14] |
|
| [15] |
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., & Gelly, S. (2019). Parameter-efficient transfer learning for NLP. International Conference on Machine Learning, 2790–2799. http://proceedings.mlr.press/v97/houlsby19a.html |
| [16] |
Howard, J., & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification (arXiv:1801.06146). arXiv. http://arxiv.org/abs/1801.06146 |
| [17] |
|
| [18] |
Jia, M., Tang, L., Chen, B.-C., Cardie, C., Belongie, S., Hariharan, B., & Lim, S.-N. (2022). Visual prompt tuning. In S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, & T. Hassner (Eds.), Computer Vision—ECCV 2022 (Vol. 13693, pp. 709–727). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-19827-4_41 |
| [19] |
|
| [20] |
Kingma, D. P., & Ba, J. (2017). Adam: A Method for Stochastic Optimization (arXiv:1412.6980). arXiv. http://arxiv.org/abs/1412.6980 |
| [21] |
|
| [22] |
Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., & Liu, H. (2019). Expectation-maximization attention networks for semantic segmentation. Proceedings of the IEEE/CVF International Conference on Computer Vision, 9167–9176. http://openaccess.thecvf.com/content_ICCV_2019/html/Li_Expectation-Maximization_Attention_Networks_for_Semantic_Segmentation_ICCV_2019_paper.html |
| [23] |
|
| [24] |
|
| [25] |
Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H., & Zettlemoyer, L. (2022). Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? (arXiv:2202.12837). arXiv. http://arxiv.org/abs/2202.12837 |
| [26] |
|
| [27] |
Nori, H., King, N., McKinney, S. M., Carignan, D., & Horvitz, E. (2023). Capabilities of GPT-4 on Medical Challenge Problems (arXiv:2303.13375). arXiv. http://arxiv.org/abs/2303.13375 |
| [28] |
OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., … Zoph, B. (2024). GPT-4 Technical Report (arXiv:2303.08774). arXiv. http://arxiv.org/abs/2303.08774 |
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., & Clark, J. (2021). Learning transferable visual models from natural language supervision. International Conference on Machine Learning, 8748–8763. http://proceedings.mlr.press/v139/radford21a |
| [33] |
Saleh, F. S., Aliakbarian, M. S., Salzmann, M., Petersson, L., & Alvarez, J. M. (2018). Effective use of synthetic data for urban scene semantic segmentation. Proceedings of the European Conference on Computer Vision (ECCV), 84–100. http://openaccess.thecvf.com/content_ECCV_2018/html/Fatemeh_Sadat_Saleh_Effective_Use_of_ECCV_2018_paper.html |
| [34] |
|
| [35] |
|
| [36] |
|
| [37] |
Stroulia, E., & Goel, A. K. (1994). Learning problem-solving concepts by reflecting on problem solving. In F. Bergadano & L. Raedt (Eds.), Machine Learning: ECML-94 (Vol. 784, pp. 287–306). Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-57868-4_65 |
| [38] |
Sun, Y., Chen, Q., Wang, J., Wang, J., & Li, Z. (2023). Exploring Effective Factors for Improving Visual In-Context Learning (arXiv:2304.04748). arXiv. http://arxiv.org/abs/2304.04748 |
| [39] |
Sung, Y.-L., Cho, J., & Bansal, M. (2022). Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5227–5237. http://openaccess.thecvf.com/content/CVPR2022/html/Sung_VL-Adapter_Parameter-Efficient_Transfer_Learning_for_Vision-and-Language_Tasks_CVPR_2022_paper.html |
| [40] |
Szabo, S., Gácsi, Z., & Balazs, B. (2016). Specific features of NDVI, NDWI and MNDWI as reflected in land cover categories. Acta Geographica Debrecina. Landscape & Environment Series, 10(3/4), 194. |
| [41] |
Wang, R., Tang, D., Duan, N., Wei, Z., Huang, X., ji, J., Cao, G., Jiang, D., & Zhou, M. (2020). K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters (arXiv:2002.01808). arXiv. http://arxiv.org/abs/2002.01808 |
| [42] |
Wang, X., Wang, W., Cao, Y., Shen, C., & Huang, T. (2023a). Images speak in images: A generalist painter for in-context visual learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6830–6839. http://openaccess.thecvf.com/content/CVPR2023/html/Wang_Images_Speak_in_Images_A_Generalist_Painter_for_In-Context_Visual_CVPR_2023_paper.html |
| [43] |
Wang, X., Zhang, X., Cao, Y., Wang, W., Shen, C., & Huang, T. (2023b). SegGPT: Segmenting Everything In Context (arXiv:2304.03284). arXiv. http://arxiv.org/abs/2304.03284 |
| [44] |
|
| [45] |
|
| [46] |
|
| [47] |
Wu, S., Shen, H., Weld, D. S., Heer, J., & Ribeiro, M. T. (2023). ScatterShot: Interactive in-context example curation for text transformation. Proceedings of the 28th International Conference on Intelligent User Interfaces, 353–367. https://doi.org/10.1145/3581641.3584059 |
| [48] |
|
| [49] |
|
| [50] |
Yin, M., Yao, Z., Cao, Y., Li, X., Zhang, Z., Lin, S., & Hu, H. (2020). Disentangled non-local neural networks. In A. Vedaldi, H. Bischof, T. Brox, & J.-M. Frahm (Eds.), Computer Vision—ECCV 2020 (Vol. 12360, pp. 191–207). Springer International Publishing. https://doi.org/10.1007/978-3-030-58555-6_12 |
| [51] |
|
| [52] |
|
| [53] |
Zhang, J., Wang, B., Li, L., Nakashima, Y., & Nagahara, H. (2024a). Instruct me more! Random prompting for visual in-context learning. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2597–2606. https://openaccess.thecvf.com/content/WACV2024/html/Zhang_Instruct_Me_More_Random_Prompting_for_Visual_In-Context_Learning_WACV_2024_paper.html |
| [54] |
Zhang, Y., Zhou, K., & Liu, Z. (2024b). What makes good examples for visual in-context learning? Advances in Neural Information Processing Systems, 36. https://proceedings.neurips.cc/paper_files/paper/2023/hash/398ae57ed4fda79d0781c65c926d667b-Abstract-Conference.html. |
The Author(s)
/
| 〈 |
|
〉 |