MVI-Depth: Multi-View Indoor Depth Estimation Based on the Fusion of Semantic Information

Ying Zhu; Buyun Chen; Hong Liu; ;Xia Li

doi:10.1049/cit2.70061

CAAI Transactions on Intelligence Technology ›› 2026, Vol. 11 ›› Issue (1) :98 -110. DOI: 10.1049/cit2.70061

ORIGINAL RESEARCH

research-article

MVI-Depth: Multi-View Indoor Depth Estimation Based on the Fusion of Semantic Information

Author information +

History +

PDF (2331KB)

Abstract

Compared to monocular depth estimation, multi-view depth estimation often yields more accurate results. However, traditional multi-view depth estimation methods often fail to leverage semantic information fully and struggle to effectively fuse infor-mation from multiple views, leading to suboptimal prediction performance in challenging scenarios such as texture-less regions and refiective surfaces. To address these limitations, we present MVI-Depth, a novel framework with two core innovations: (1) a Semantic Fusion Module (SFM) that establishes semantic correspondence, and (2) a Depth Updating Module (DUM) enabling iterative depth refinement. Specifically, MVI-Depth initially establishes a main view representation that integrates single-view depth, depth features, and semantic features. Subsequent feature extraction from neighbouring views enables the construction of the original cost volume. Recognising the inherent limitations of direct cost volume utilisation in complex scenes, the proposed SFM constructs an aligned semantic cost volume to utilise the complementarity between semantic and depth in-formation, forming an improved final cost volume. The final cost volume is updated through the proposed DUM to achieve iterative depth optimisation. Comprehensive evaluations demonstrate that MVI-Depth achieves superior performance across all standard metrics on both ScanNet and KITTI benchmarks, outperforming existing methods. Additional experiments on the 7- Scenes dataset further confirm the framework's robust generalisation capabilities in diverse environments.

Keywords

computer vision / deep learning / depth

Cite this article

Download citation ▾

Ying Zhu, Buyun Chen, Hong Liu, ;Xia Li. MVI-Depth: Multi-View Indoor Depth Estimation Based on the Fusion of Semantic Information. CAAI Transactions on Intelligence Technology, 2026, 11(1): 98-110 DOI:10.1049/cit2.70061

登录浏览全文

4963

注册一个新账户忘记密码

Conflicts of Interest

Hong Liu is an Executive Editor-in-Chief for the journal, and was not involved in peer review process or the decision to publish this article. The authors declare that they have no confiict of interest.

Data Availability Statement

The data that support the findings of this study are available in the public domain: https://github.com/ScanNet/ScanNet.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	R. de Queiroz Mendes, E. G. Ribeiro, N. dos Santos Rosa, and V. Grassi Jr., “On Deep Learning Techniques to Boost Monocular Depth Estimation for Autonomous Navigation,” Robotics and Autonomous Systems 136 (2021): 103701, https://doi.org/10.1016/j.robot.2020.103701.

[2]

F. Westermeier, L. Brübach, C. Wienrich, and M. E. Latoschik, “Assessing Depth Perception in VR and Video See-Through AR: A Comparison on Distance Judgment, Performance, and Preference,” IEEE Transactions on Visualization and Computer Graphics 30, no. 5 (2024): 2140-2150, https://doi.org/10.1109/tvcg.2024.3372061.

[3]	S. U. Jo, Y. D. Lee, and C. E. Rhee, “Occlusion-Aware Amodal Depth Estimation for Enhancing 3D Reconstruction From a Single Image,” IEEE Access 12 (2024): 106524-106536, https://doi.org/10.1109/ACCESS.2024.3436570.

[4]	K. Tateno, F. Tombari,I. Laina, and N. Navab, “CNN-SLAM: Real- Time Dense Monocular Slam With Learned Depth Prediction,” in Pro-ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2017), 6243-6252.

[5]	C. Tang, C. Hou, and Z. Song, “Depth Recovery and Refinement From a Single Image Using Defocus Cues,” Journal of Modern Optics 62, no. 6 (2015): 441-448, https://doi.org/10.1080/09500340.2014.967321.

[6]	Y.-M. Tsai Y.-L. Chang and L. -G. Chen, “Block-Based Vanishing Line and Vanishing Point Detection for 3D Scene Reconstruction,” in International Symposium on Intelligent Signal Processing and Commu-nications (ISPACS) (IEEE, 2005), 586-589.

[7]	J. M. Facil, B. Ummenhofer, H. Zhou, L. Montesano,T. Brox, and J. Civera, “CAM-Convs: Camera-Aware Multi-Scale Convolutions for Single-View Depth,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2019), 11826-11835.

[8]	D. Wofk, F. Ma, T.-J. Yang, S. Karaman, and V. Sze, “Fastdepth: Fast Monocular Depth Estimation on Embedded Systems,” in International Conference on Robotics and Automation (ICRA) (IEEE, 2019), 6101-6108.

[9]	D. Eigen, C. Puhrsch, and R. Fergus, “Depth Map Prediction From a Single Image Using a Multi-Scale Deep Network,” in Advances in Neural Information Processing Systems (NeurIPS) (MIT Press, 2014), 2366-2374.

[10]	I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper Depth Prediction With Fully Convolutional Residual Net-works,” in International Conference on 3D Vision (3DV) (IEEE, 2016), 239-248.

[11]	B. Li, Y. Dai, and M. He, “Monocular Depth Estimation With Hi-erarchical Fusion of Dilated CNNs and Soft-Weighted-Sum Inference,” Pattern Recognition 83 (2018): 328-339, https://doi.org/10.1016/j.patcog.2018.05.029.

[12]	S. Chen, Y. Zhu, and H. Liu, “PMIndoor: Pose Rectified Network and Multiple Loss Functions for Self-Supervised Monocular Indoor Depth Estimation,” Sensors 23, no. 21 (2023): 8821, https://doi.org/10.3390/s23218821.

[13]	Y. Duan, X. Guo, and Z. Zhu, “DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation,” in Proceedings of the Eu-ropean Conference on Computer Vision (ECCV) (Springer, 2024), 432-449.

[14]	L. Wang and R. Yang, “Global Stereo Matching Leveraged by Sparse Ground Control Points,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2011), 3033-3040.

[15]	M. P. Muresan,M. Negru, and S. Nedevschi, “Improving Local Stereo Algorithms Using Binary Shifted Windows, Fusion and Smoothness Constraint,” in IEEE International Conference on Intelligent Computer Communication and Processing (ICCP) (IEEE, 2015), 179-185.

[16]	R. Spangenberg, T. Langner,S. Adfeldt, and R. Rojas, “Large Scale Semi-Global Matching on the CPU,” in IEEE Intelligent Vehicles Sym-posium Proceedings (IEEE, 2014), 195-201.

[17]	G. Bae,I. Budvytis, and R. Cipolla, “Multi-View Depth Estimation by Fusing Single-View Depth Probability With Multi-View Geometry,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2022), 2842-2851.

[18]	A. Saxena, S. H. Chung, and A. Y. Ng, “Learning Depth From Single Monocular Images,” in Advances in Neural Information Processing Sys-tems (NeurIPS) (MIT Press, 2005), 1161-1168.

[19]	S. F. Bhat,I. Alhashim, and P. Wonka, “AdaBins: Depth Estimation Using Adaptive Bins,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2021), 4009-4018.

[20]	R. Ranftl, A. Bochkovskiy and V. Koltun, “Vision Transformers for Dense Prediction,” in Proceedings of the IEEE/CVF International Con-ference on Computer Vision (ICCV) (IEEE, 2021), 12179-12188.

[21]	G. Yang, H. Tang, M. Ding,N. Sebe, and E. Ricci, “Transformer- Based Attention Networks for Continuous Pixel-Wise Prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2021), 16269-16279.

[22]	H. Zhang, C. Shen, Y. Li, Y. Cao,Y. Liu, and Y. Yan, “Exploiting Temporal Consistency for Real-Time Video Depth Estimation,” in Pro-ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (IEEE, 2019), 1725-1734.

[23]	N. Mayer, E. Ilg, P. Hausser, et al., “A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2016), 4040-4048.

[24]	X. Long, Y. Zheng, Y. Zheng, et al., “Adaptive Surface Normal Constraint for Geometric Estimation From Monocular Images,” IEEE Transactions on Pattern Analysis and Machine Intelligence 46, no. 9 (2024): 6263-6279, https://doi.org/10.1109/tpami.2024.3381710.

[25]	A. J. Amiri,S. Y. Loo, and H. Zhang, “Semi-Supervised Monocular Depth Estimation With Left-Right Consistency Using Deep Neural Network,” in IEEE International Conference on Robotics and Bio-mimetics (ROBIO) (IEEE, 2019), 602-607.

[26]	T. Zhou, M. Brown,N. Snavely, and D. G. Lowe, “Unsupervised Learning of Depth and Ego-Motion From Video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2017), 1851-1858.

[27]	C. Godard, O. Mac Aodha,M. Firman, and G. J. Brostow, “Digging Into Self-Supervised Monocular Depth Estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (IEEE, 2019), 3828-3838.

[28]	P. Ji, R. Li,B. Bhanu, and Y. Xu, “MonoIndoor: Towards Good Practice of Self-Supervised Monocular Depth Estimation for Indoor Environments,” in Proceedings of the IEEE/CVF International Confer-ence on Computer Vision (ICCV) (IEEE, 2021), 12787-12796.

[29]	S. Gasperini, N. Morbitzer, H. Jung,N. Navab, and F. Tombari, “Robust Monocular Depth Estimation Under Challenging Conditions,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (IEEE, 2023), 8177-8186.

[30]	X. Lyu, L. Liu, M. Wang, et al., “HR-Depth: High Resolution Self- Supervised Monocular Depth Estimation,” Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 3 (2021): 2294-2301, https://doi.org/10.1609/aaai.v35i3.16329.

[31]	Y. Sun and B. Hariharan, “Dynamo-Depth: Fixing Unsupervised Depth Estimation for Dynamical Scenes,” Advances in Neural Infor-mation Processing Systems 36 (2023): 54987-55005, https://doi.org/10.48550/arXiv.2310.18887.

[32]	N. Zhang, F. Nex,G. Vosselman, and N. Kerle, “Lite-Mono: A Lightweight CNN and Transformer Architecture for self-supervised Monocular Depth Estimation,” in Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2023), 18537-18546.

[33]	Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan, “MVSNet: Depth Inference for Unstructured Multi-View Stereo,” in Proceedings of the European Conference on Computer Vision (ECCV) (Springer, 2018), 767-783.

[34]	Y. Yao, Z. Luo, S. Li, T. Shen,T. Fang, and L. Quan, “Recurrent MVSNet for High-Resolution Multi-View Stereo Depth Inference,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2019), 5525-5534.

[35]	X. Ma, Y. Gong, Q. Wang, J. Huang,L. Chen, and F. Yu, “EPP- MVSNet: Epipolar-Assembling Based Depth Prediction for Multi-View Stereo,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (IEEE, 2021), 5732-5740.

[36]	S. Xu, Q. Xu, W. Su, and W. Tao, “Edge-Aware Spatial Propagation Network for Multi-View Depth Estimation,” Neural Processing Letters 55, no. 8 (2023): 10905-10923, https://doi.org/10.1007/s11063-023-11356-4.

[37]	Y. Xu, D. Chen, K. Liu, et al., “SE(3) Equivariant Ray Embeddings for Implicit Multi-View Depth Estimation,” Advances in Neural Infor-mation Processing Systems 37 (2024): 13627-13659, https://doi.org/10.48550/arXiv.2411.07326.

[38]	M. Tan and Q. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” in International Conference on Ma-chine Learning (ICML) (ACM, 2019), 6105-6114.

[39]	J.-R. Chang and Y. -S. Chen, “Pyramid Stereo Matching Network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2018), 5410-5418.

[40]	C. Liu, J. Gu, K. Kim,S. G. Narasimhan, and J. Kautz, “Neural Rgb (R) D Sensing: Depth and Uncertainty From a Video Camera,” in Pro-ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (IEEE, 2019), 10986-10995.

[41]	M. Oquab, T. Darcet, T. Moutakanni, et al., “DINOv2: Learning Robust Visual Features Without Supervision,” Transactions on Machine Learning Research Journal (2024): 1-31, https://doi.org/10.48550/arXiv.2304.07193.

[42]	L. Yang, B. Kang, Z. Huang, X. Xu,J. Feng, and H. Zhao, “Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2024), 10371-10381.

[43]	D. Misra, T. Nalamada,A. U. Arasanipalai, and Q. Hou, “Rotate to Attend: Convolutional Triplet Attention Module,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (IEEE, 2021), 3139-3148.

[44]	A. Dai, A. X. Chang, M. Savva, M. Halber,T. Funkhouser, and M. Nießner, “ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2017), 5828-5839.

[45]	J. Shotton, B. Glocker, C. Zach, S. Izadi,A. Criminisi, and A. Fitz-gibbon, “Scene Coordinate Regression Forests for Camera Relocaliza-tion in RGB-D Images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2013), 2930-2937.

[46]	A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision Meets Robotics: The KITTI Dataset,” International Journal of Robotics Research 32, no. 11 (2013): 1231-1237, https://doi.org/10.1177/0278364913491297.

[47]	L. N. Smith and N. Topin, “Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates,” in Artificial Intelli-gence and Machine Learning for Multi-domain Operations Applications (SPIE, Vol. 11006, 2019), 369-386.

[48]	K. Wang and S. Shen, “MVDepthNet: Real-Time Multiview Depth Estimation Neural Network,” in International Conference on 3D Vision (3DV) (IEEE, 2018), 248-257.

[49]	S. Im, H.-G. Jeon,S. Lin, and I. S. Kweon, “DPSNet: End-to-End Deep Plane Sweep Stereo,” in International Conference on Learning Representations (ICLR) (ICLR, 2019).

[50]	U. Kusupati, S. Cheng, R. Chen and H. Su, “Normal Assisted Stereo Depth Estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2020), 2189-2199.

[51]	X. Long, L. Liu, C. Theobalt, and W. Wang, “Occlusion-Aware Depth Estimation With Adaptive Normal Constraints,” in Proceedings of the European Conference on Computer Vision (ECCV) (Springer, 2020), 640-657.

[52]	A. Sinha, Z. Murez, J. Bartolozzi, V. Badrinarayanan, and A. Rabi-novich, “DELTAS: Depth Estimation by Learning Triangulation and Densification of Sparse Points,” in Proceedings of the European Confer-ence on Computer Vision (ECCV) (Springer, 2020), 104-121.

[53]	S. Cheng, Z. Xu, S. Zhu, et al., “Deep Stereo Using Adaptive Thin Volume Representation With Uncertainty Awareness,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2020), 2524-2534.

[54]	X. Long, L. Liu, W. Li,C. Theobalt, and W. Wang, “Multi-View Depth Estimation Using Epipolar Spatio-Temporal Networks,” in Pro-ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2021), 8258-8267.

[55]	J. H. Lee, M.-K. Han, D. W. Ko, and I. H. Suh, “From Big to Small: Multi-Scale Local Planar Guidance for Monocular Depth Estimation,” arXiv preprint arXiv:1907. 10326 (2019), https://doi.org/10.48550/arXiv.1907.10326.

[56]	Z. Wu, X. Wu, X. Zhang,S. Wang, and L. Ju, “Spatial Correspon-dence With Generative Adversarial Network: Learning Depth From Monocular Videos,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (IEEE, 2019), 7494-7504.

[57]	C. Shu, K. Yu, Z. Duan, and K. Yang, “Feature-Metric Loss for Self- Supervised Learning of Depth and Egomotion,” in Proceedings of the European Conference on Computer Vision (ECCV) (Springer, 2020), 572-588.

[58]	S. Shao, Z. Pei, W. Chen,X. Wu, and Z. Li, “NDDepth: Normal- Distance Assisted Monocular Depth Estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (IEEE, 2023), 7931-7940.

[59]	Z. Li, X. Wang, X. Liu, and J. Jiang, “BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation,” IEEE Transactions on Image Processing 33 (2024): 3964-3976, https://doi.org/10.1109/tip.2024.3416065.

[60]	Q. Hou,D. Zhou, and J. Feng, “Coordinate Attention for Efficient Mobile Network Design,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2021), 13713-13722.

[61]	Y. Liu, Z. Shao, and N. Hoffmann, “Global Attention Mechanism: Retain Information to Enhance Channel-Spatial Interactions,” arXiv preprint arXiv:2112. 05561 (2021), https://doi.org/10.48550/arXiv.2112.05561.