Real-time lightweight self-supervised monocular depth estimation

Tianxiang YANG; Lingjun MENG; Hong JIN; Wenjie FENG; Xinhao LIU

doi:10.62756/jmsi.1674-8042.2026024

Journal of Measurement Science and Instrumentation ›› 2026, Vol. 17 ›› Issue (2) :278 -296. DOI: 10.62756/jmsi.1674-8042.2026024

Signal and image processing technology

research-article

Real-time lightweight self-supervised monocular depth estimation

Author information +

History +

PDF (47655KB)

Abstract

Monocular depth estimation aims to predict depth information within a scene from a single RGB image, but many models remain computationally intensive for real-time inference on resource-constrained edge devices. This paper presents a lightweight self-supervised monocular depth estimation network that balances accuracy and efficiency through targeted encoder–decoder design. The encoder employed a synergistic modeling approach combining decomposable large-kernel convolutions and local depthwise convolutions to capture both long-range context and local details with low computational overhead. The decoder utilized cross-scale feature differences as guidance to dynamically fuse multi-scale features, enhancing detail recovery and geometric consistency under lightweight constraints. In addition, a temporal soft fusion reprojection loss was employed to better leverage the complementary information of forward and backward frames, improving the robustness of self-supervised training. The model contained 3.0 M parameters and required 3.5 GFLOPs of computation. On KITTI, it achieves Abs Rel=0.105 and δ₁=0.892. On Make3D, it achieves Abs Rel=0.308 in a zero-shot setting. On a Rockchip RK3588S, a hybrid-quantized multi-thread implementation runs at 67 frames/s. The results demonstrated that the proposed method achieved a favorable accuracy–efficiency balance on edge devices, making it suitable for real-time monocular depth estimation tasks.

Keywords

monocular depth estimation / deep learning / self-supervised learning / large-kernel attention / differential-driven dynamic fusion / lightweight network / RK3588S

Cite this article

Download citation ▾

Tianxiang YANG, Lingjun MENG, Hong JIN, Wenjie FENG, Xinhao LIU. Real-time lightweight self-supervised monocular depth estimation. Journal of Measurement Science and Instrumentation, 2026, 17 (2) : 278-296 DOI:10.62756/jmsi.1674-8042.2026024

登录浏览全文

4963

注册一个新账户忘记密码

Acknowledgement

This work was supported by the National Natural Science Foundation of China General Program (No. 52475575) and the 2024 Shanxi Provincial Basic Research Program (No.202403021211076).

Declaration of conflicting interests

The authors have no conflict of interests related to this publication.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	LI L Q, CHANG M, HOU X Y, et al. CEP⁃SLAM based on RGB⁃D data coupling errors processing. Journal of North University of China (Natural Science Edition), 2024, 45(5): 614-627.

[2]	EIGEN D, PUHRSCH C, FERGUS R. Depth map prediction from a single image using a multi-scale deep network//Advances in Neural Information Processing Systems 27 (NIPS 2014), Montréal, QC, Canada, 2014: 2366-2374.

[3]	LAINA I, RUPPRECHT C, BELAGIANNIS V, et al. Deeper depth prediction with fully convolutional residual networks//2016 Fourth International Conference on 3D Vision, October 25-28, 2016, Stanford, CA, USA. New York: IEEE, 2016: 239-248.

[4]	WOFK D, MA F C, YANG T J, et al. FastDepth: fast monocular depth estimation on embedded systems//2019 International Conference on Robotics and Automation, May 20-24, 2019. Montreal, QC, Canada. New York: IEEE, 2019: 6101-6108.

[5]	PAPA L, ALATI E, RUSSO P, et al. SPEED: separable pyramidal pooling EncodEr-decoder for real-time monocular depth estimation on low-resource settings. IEEE Access, 2022, 10: 44881-44890.

[6]	GARG R, B G V K, CARNEIRO G, et al. Unsupervised CNN for single view depth estimation: geometry to the rescue. Computer Vision-ECCV 2016. Cham: Springer, 2016: 740-756.

[7]	GODARD C, AODHA O MAC, BROSTOW G J. Unsupervised monocular depth estimation with left-right consistency//2017 IEEE Conference on Computer Vision and Pattern Recognition, July 21-26, 2017, Honolulu, HI, USA. New York: IEEE, 2017: 6602-6611.

[8]	ZHOU T H, BROWN M, SNAVELY N, et al. Unsupervised learning of depth and ego-motion from video//2017 IEEE Conference on Computer Vision and Pattern Recognition, July 21-26, 2017, Honolulu, HI, USA. New York: IEEE, 2017: 6612-6619.

[9]	GODARD C, AODHA O MAC, FIRMAN M, et al. Digging into self-supervised monocular depth estimation//2019 IEEE/CVF International Conference on Computer Vision, October 27-November 2, 2019. Seoul, Korea. New York: IEEE, 2019: 3827-3837.

[10]	GUIZILINI V, AMBRUS R, PILLAI S, et al. 3D packing for self-supervised monocular depth estimation//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 13-19, 2020. Seattle, WA, USA. New York: IEEE, 2020: 2482-2491.

[11]	WATSON J, AODHA O MAC, PRISACARIU V, et al. The temporal opportunist: self-supervised multi-frame monocular depth//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 20-25, 2021. Nashville, TN, USA. New York: IEEE, 2021: 1164-1174.

[12]	ZHOU Z K, FAN X N, SHI P F, et al. R-MSFM: recurrent multi-scale feature modulation for monocular depth estimating//2021 IEEE/CVF International Conference on Computer Vision, October 10-17, 2021, Montreal, QC, Canada. New York: IEEE, 2022: 12757-12766.

[13]	LÜ X Y, LIU L, WANG M M, et al. HR-depth: high resolution self-supervised monocular depth estimation. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(3): 2294-2301.

[14]	ZHAO C Q, ZHANG Y M, POGGI M, et al. MonoViT: self-supervised monocular depth estimation with a vision transformer//2022 International Conference on 3D Vision, September 12-16, 2022, Prague, Czech Republic. New York: IEEE, 2023: 668-678.

[15]	BAE J, MOON S, IM S. Deep digging into the generalization of self-supervised monocular depth estimation. 2022: arXiv: 2205.11083.

[16]	ZHANG N, NEX F, VOSSELMAN G, et al. Lite-mono: a lightweight CNN and transformer architecture for self-supervised monocular depth estimation//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 17-24, 2023, Vancouver, BC, Canada. New York: IEEE, 2023: 18537-18546.

[17]	ZHAO B W, HE H D, XU H, et al. RTIA-Mono: Real-time lightweight self-supervised monocular depth estimation with global-local information aggregation. Digital Signal Processing, 2025, 156: 104769.

[18]	EL-NOUBY A, TOUVRON H, CARON M, et al. XCiT: cross-covariance image transformers. 2021: arXiv: 2106.09681.

[19]	FENG C, ZHANG C X, CHEN Z, et al. Real-time monocular depth estimation on embedded systems//2024 IEEE International Conference on Image Processing, October 27-30, 2024, Abu Dhabi, United Arab Emirates. New York: IEEE, 2024: 3464-3470.

[20]	ZHANG G W, TANG X C, WANG L, et al. Repmono: a lightweight self-supervised monocular depth estimation architecture for high-speed inference. Complex & Intelligent Systems, 2024, 10(6): 7927-7941.

[21]	ZHAO B W, HE H D, XU H, et al. LDA-Mono: a lightweight dual aggregation network for self-supervised monocular depth estimation. Knowledge-Based Systems, 2024, 304: 112552.

[22]	ZHANG L, LYU C Z, WANG P, et al. LEDepth: a lightweight self-supervised monocular depth estimation network combining CNN and transformer. Signal, Image and Video Processing, 2025, 19(14): 1208.

[23]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need//Advances in Neural Information Processing Systems 30 (NeurIPS 2017), LongBeach, CA, USA, 2017:5998-6008.

[24]	GUO M H, LU C Z, LIU Z N, et al. Visual attention network. Computational Visual Media, 2023, 9(4): 733-752.

[25]	WANG B Y, WANG S, YE D, et al. Deep neighbor layer aggregation for lightweight self-supervised monocular depth estimation//ICASSP 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, April 14-19, 2024, Seoul, Korea, Republic of. New York: IEEE, 2024: 4405-4409.

[26]	HOWARD A G, ZHU M L, CHEN B, et al. MobileNets: efficient convolutional neural networks for mobile vision applications. 2017: arXiv: 1704.04861.

[27]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition//2016 IEEE Conference on Computer Vision and Pattern Recognition, June 27-30, 2016, Las Vegas, NV, USA. New York: IEEE, 2016: 770-778.

[28]	WANG Z, BOVIK A C, SHEIKH H R, et al. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 2004, 13(4): 600-612.

[29]	GEIGER A, LENZ P, STILLER C, et al. Vision meets robotics: The KITTI dataset. International Journal of Robotics Research, 2013, 32(11): 1231-1237.

[30]	SAXENA A, SUN M, NG A Y. Make3D: learning 3D scene structure from a single still image//AAAI Conference on Artificial Intelligence (AAAI), Chicago, IL, USA, 2008:1571-1576.