MAU-Depth: a multi-attention-based underwater lightweight self-supervised monocular depth estimation method
Peng Yao , Yalu Wang , Dongdong Yang , Qiming Liu , Jiatao Yu
Intelligent Marine Technology and Systems ›› 2025, Vol. 3 ›› Issue (1) : 30
MAU-Depth: a multi-attention-based underwater lightweight self-supervised monocular depth estimation method
Accurate depth estimation is essential for unmanned underwater vehicles to effectively perceive their environment during target tracking tasks. Therefore, we propose a self-supervised monocular depth estimation framework tailored for underwater scenes, incorporating multi-attention mechanisms and the unique optical characteristics of underwater imagery. To address issues such as color distortion in underwater images, primarily caused by light attenuation in underwater scenes, we design an adaptive underwater light attenuation loss function to improve the model’s adaptability and generalization across diverse underwater scenes. The inherent blurriness of underwater images poses considerable challenges for feature extraction and semantic interpretation. We use dilated convolutions and linear space reduction attention (CDC Joint Linear SRA) to capture local and global features of underwater images, which are then integrated through feature map fusion. Subsequently, we use a multi-attention feature enhancement module to further enhance the spatial and semantic information of the extracted features. To address fusion interference arising from discrepancies in semantic information between feature maps, we introduce a progressive fusion module that balances cross-module features using a two-step feature refinement strategy. Comparative, ablation, and generalization experiments were conducted on the FLSea dataset to verify the superiority of the proposed model.
Monocular depth estimation / Underwater depth estimation / Self-supervised learning / Underwater vision
| [1] |
Alexey D, Beyer L, Kolesnikov A, Weissenborn D, Zhai XH, Unterthiner T et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. Preprint at arXiv:2010.11929 |
| [2] |
Amitai S, Klein I, Treibitz T (2023) Self-supervised monocular depth underwater. In: IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp 1098–1104. https://doi.org/10.1109/ICRA48891.2023.10161161 |
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. In: 28th Conference on Neural Information Processing Systems (NIPS). NIPS, pp 1–9 |
| [7] |
Godard C, Mac Aodha O, Firman M, Brostow GJ (2019) Digging into self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, pp 3827–3837. https://doi.org/10.1109/ICCV.2019.00393 |
| [8] |
Guizilini V, Ambrus R, Pillai S, Raventos A, Gaidon A (2020) 3D packing for self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, pp 2482–2491. https://doi.org/10.1109/CVPR42600.2020.00256 |
| [9] |
Gupta H, Mitra K (2019) Unsupervised single image underwater depth estimation. In: 2019 IEEE International Conference on Image Processing (ICIP). IEEE, pp 624–628. https://doi.org/10.1109/icip.2019.8804200 |
| [10] |
|
| [11] |
He M, Hui L, Bian YK, Ren J, Xie J, Yang J (2022) RA-Depth: resolution adaptive self-supervised monocular depth estimation. In: 17th European Conference on Computer Vision. Springer, Cham, pp 565–581. https://doi.org/10.1007/978-3-031-19812-0_33 |
| [12] |
Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. Preprint at arXiv:1503.02531 |
| [13] |
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp 7132–7141. https://doi.org/10.1109/CVPR.2018.00745 |
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
Lyu XY, Liu L, Wang MM, Kong X, Liu LN, Liu Y et al (2021) HR-Depth: high resolution self-supervised monocular depth estimation. In: Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, pp 2294–2301. https://doi.org/10.1609/aaai.v35i3.16329 |
| [18] |
Peng YT, Zhao XY, Cosman PC (2015) Single underwater image enhancement using depth estimation based on blurriness. In: 2015 IEEE International Conference on Image Processing (ICIP). IEEE, pp 4952–4956. https://doi.org/10.1109/ICIP.2015.7351749 |
| [19] |
|
| [20] |
Randall Y (2023) FLSea: underwater visual-inertial and stereo-vision forward-looking datasets. PhD thesis, University of Haifa (Israel) |
| [21] |
Ren WS, Wang LJ, Piao YR, Zhang M, Lu HC, Liu T (2022) Adaptive co-teaching for unsupervised monocular depth estimation. In: 17th European Conference on Computer Vision. Springer, pp 89–105. https://doi.org/10.1007/978-3-031-19769-7_6 |
| [22] |
|
| [23] |
Song W, Wang Y, Huang DM, Tjondronegoro D (2018) A rapid scene depth estimation model based on underwater light attenuation prior for underwater image restoration. In: Advances in Multimedia Information Processing–PCM 2018: 19th Pacific-Rim Conference on Multimedi. Springer, Cham, pp 678–688. https://doi.org/10.1007/978-3-030-00776-8_62 |
| [24] |
Sun JM, Xie YM, Chen LH, Zhou XW, Bao HJ (2021) NeuralRecon: real-time coherent 3D reconstruction from monocular video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, pp 15593–15602. https://doi.org/10.1109/CVPR46437.2021.01534 |
| [25] |
|
| [26] |
Wang WH, Xie EZ, Li X, Fan DP, Song KT, Liang D et al (2021b) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, pp 548–558. https://doi.org/10.1109/ICCV48922.2021.00061 |
| [27] |
|
| [28] |
Wu CY, Wang JL, Hall M, Neumann U, Su SC (2022) Toward practical monocular indoor depth estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, pp 3804–3814. https://doi.org/10.1109/CVPR52688.2022.00379 |
| [29] |
Zhang F, You S, Li Y, Fu Y (2024) Atlantis: enabling underwater depth estimation with stable diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, pp 11852–11861. https://doi.org/10.1109/CVPR52733.2024.01126 |
| [30] |
Zhang N, Nex F, Vosselman G, Kerle N (2023) Lite-Mono: a lightweight CNN and Transformer architecture for self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, pp 18537–18546. https://doi.org/10.1109/CVPR52729.2023.01778 |
| [31] |
Zhao CQ, Zhang YM, Poggi M, Tosi F, Guo XD, Zhu Z et al (2022) MonoViT: self-supervised monocular depth estimation with a vision transformer. In: International Conference on 3D Vision (3DV). IEEE, pp 668–678. https://doi.org/10.1109/3DV57658.2022.00077 |
| [32] |
Zhou H, Greenwood D, Taylor S (2021) Self-supervised monocular depth estimation with internal feature fusion. Preprint at arXiv:2110.09482 |
| [33] |
Zhou KY, Chen J, Gui SC, Wang ZK (2024) Towards lightweight underwater depth estimation. In: 2024 IEEE Conference on Artificial Intelligence (CAI). IEEE, pp 1442–1445. https://doi.org/10.1109/CAI59869.2024.00258 |
| [34] |
Zhou TH, Brown M, Snavely N, Lowe DG (2017) Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, pp 6612–6619. https://doi.org/10.1109/CVPR.2017.700 |
The Author(s)
/
| 〈 |
|
〉 |