UNO: Unified Self-Supervised Monocular Odometry for Platform-Agnostic Deployment

Wentao Zhao; Yihe Niu; Yanbo Wang; Tianchen Deng; Shenghai Yuan; Zhenli Wang; Rui Guo; Jingchuan Wang

doi:10.1049/cit2.70095

CAAI Transactions on Intelligence Technology ›› 2026, Vol. 11 ›› Issue (1) :205 -222. DOI: 10.1049/cit2.70095

ORIGINAL RESEARCH

research-article

UNO: Unified Self-Supervised Monocular Odometry for Platform-Agnostic Deployment

Author information +

History +

PDF (2504KB)

Abstract

This work presents UNO, a unified monocular visual odometry framework that enables robust and adaptable pose esti-mation across diverse environments, platforms and motion patterns. Unlike traditional methods that rely on deployment- specific tuning or predefined motion priors, our approach generalises effectively across a wide range of real-world sce-narios, including autonomous vehicles, aerial drones, mobile robots and handheld devices. To this end, we introduce a mixture-of-experts strategy for local state estimation, with several specialised decoders that each handle a distinct class of ego-motion patterns. Moreover, we introduce a fully differentiable Gumbel-softmax module that constructs a robust inter- frame correlation graph, selects the optimal expert decoder and prunes erroneous estimates. These cues are then fed into a unified back-end that combines pretrained scale-independent depth priors with a lightweight bundling adjustment to enforce geometric consistency. We extensively evaluate our method on three major benchmark datasets: KITTI (outdoor/ autonomous driving), EuRoC-MAV (indoor/aerial drones) and TUM-RGBD (indoor/handheld), demonstrating state-of-the- art performance.

Keywords

computer vision / pose estimation / robotics / unsupervised learning

Cite this article

Download citation ▾

Wentao Zhao, Yihe Niu, Yanbo Wang, Tianchen Deng, Shenghai Yuan, Zhenli Wang, Rui Guo, Jingchuan Wang. UNO: Unified Self-Supervised Monocular Odometry for Platform-Agnostic Deployment. CAAI Transactions on Intelligence Technology, 2026, 11(1): 205-222 DOI:10.1049/cit2.70095

登录浏览全文

4963

注册一个新账户忘记密码

Acknowledgements

The authors gratefully acknowledge financial and technical support from the State Grid Corporation of China through the Technology Programme (Grant 5700-202416334A-2-1-ZX).

Funding

This research was supported by the Technology Project Managed by the State Grid Corporation of China (Grant 5700‐202416334A‐2‐1‐ZX).

Ethics Statement

As the research involved only computational procedures, no institu-tional ethics review was necessary.

Consent

The authors have nothing to report.

Conflicts of Interest

The authors declare no confiicts of interest.

Data Availability Statement

The data and code that support the findings of this study are available from the corresponding author upon request. The data and code are not publicly available due to privacy or ethical restrictions.

Use of Third-Party Material

The authors have nothing to report.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	X. Li, X. Li, M. O. Khyam, C. Luo, and Y. Tan, “Visual Navigation Method for Indoor Mobile Robot Based on Extended Bow Model,” CAAI Transactions on Intelligence Technology 2, no. 4 (2017): 142-147, https://doi.org/10.1049/trit.2017.0020.

[2]	T. Deng, G. Shen, T. Qin, et al., “PLGSLAM: Progressive Neural Scene Represenation With Local to Global Bundle Adjustment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024), 19657-19666.

[3]	T. Deng, G. Shen, C. Xun, et al., “MNE-SLAM: Multi-Agent Neural SLAM for Mobile Robots,” in Proceedings of the Computer Vision and Pattern Recognition Conference (2025), 1485-1494.

[4]	T. Deng, N. Wang, C. Wang, et al., “Incremental Joint Learning of Depth, Pose and Implicit Scene Representation on Monocular Camera in Large-Scale Scenes,” preprint, arXiv:2404.06050 (2024).

[5]	T. Deng, Y. Wang, H. Xie, et al., “NeSLAM: Neural Implicit Mapping and Self-Supervised Feature Tracking With Depth Completion and Denoising,” IEEE Transactions on Automation Science and Engineering 22 (2025): 1-12321, https://doi.org/10.1109/TASE.2025.3541064.

[6]	Z. Huang, S. Qiao, N. Han, C.-A. Yuan, X. Song, and Y. Xiao, “Survey on Vehicle Map Matching Techniques,” CAAI Transactions on Intelligence Technology 6, no. 1 (2021): 55-71, https://doi.org/10.1049/cit2.12030.

[7]	T. Deng, Y. Zhou, W. Wu, et al., “Multi-Modal UAV Detection, Clas-sification and Tracking Algorithm,” preprint, arXiv:2405.16464 (2024).

[8]	Y. Hua, J. Sui, H. Fang, C. Hu, and D. Yi, “Domain-Adapted Driving Scene Understanding With Uncertainty-Aware and Diversified Generative Adversarial Networks,” CAAI Transactions on Intelligence Technology (2023): cit2.12257, https://doi.org/10.1049/cit2.12257.

[9]	R. MurArtal and J. D. Tardós,“ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras,” IEEE Transactions on Robotics 33, no. 5 (2017): 1255-1262, https://doi.org/10.1109/tro.2017.2705103.

[10]	K. Xu, Y. Hao, S. Yuan, C. Wang, and L. Xie, “AirSLAM: An Effi-cient and Illumination-Robust Point-Line Visual SLAM System,” IEEE Transactions on Robotics 41 (2025): 1673-1692, https://doi.org/10.1109/tro.2025.3539171.

[11]	Z. Teed and J. Deng, “DROID-SLAM: Deep Visual Slam for Monoc-ular, Stereo, and RGB-D Cameras,” Advances in Neural Information Processing Systems 34 (2021): 16558-16569, 10.48550/arXiv.2108.10869.

[12]	J. Bian, Z. Li, N. Wang, et al., “Unsupervised Scale-Consistent Depth and Ego-Motion Learning From Monocular Video,” Advances in Neural Information Processing Systems 32 (2019), 10.48550/arXiv.1908.10553.

[13]	W. Zhao, Y. Wang, Z. Wang, et al., “Self-Supervised Deep Monocular Visual Odometry and Depth Estimation With Observation Variation,” Displays 80 (2023): 102553, https://doi.org/10.1016/j.displa.2023.102553.

[14]	T. Zhou, M. Brown,N. Snavely, and D. G. Lowe, “Unsupervised Learning of Depth and Ego-Motion From Video,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), 1851-1858.

[15]	J. Bian, H. Zhan, N. Wang, T. Chin, C. Shen, and I. Reid, “Auto- Rectify Network for Unsupervised Indoor Depth Estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence 44, no. 12 (2021): 9802-9813, https://doi.org/10.1109/tpami.2021.3136220.

[16]	A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision Meets Robotics: The KITTI Dataset,” International Journal of Robotics Research 32, no. 11 (2013): 1231-1237, https://doi.org/10.1177/0278364913491297.

[17]	M. Burri, J. Nikolic, P. Gohl, et al., “The EuRoC Micro Aerial Vehicle Datasets,” International Journal of Robotics Research 35, no. 10 (2016): 1157-1163, https://doi.org/10.1177/0278364915620033.

[18]	J. Sturm, N. Engelhard, F. Endres,W. Burgard, and D. Cremers, “A Benchmark for the Evaluation of RGB-D SLAM Systems,” in Proceedings of the International Conference on Intelligent Robot Systems (IROS) (2012).

[19]	M. Oquab, T. Darcet, T. Moutakanni, et al., “DINOv2: Learning Robust Visual Features Without Supervision,” Transactions on Machine Learning Research Journal (2024): 1-31, https://doi.org/10.48550/arXiv.2304.07193.

[20]	S. Wang, V. Leroy, Y. Cabon,B. Chidlovskii, and J. Revaud, “DUSt3R:Geometric 3D Vision Made Easy,” in Proceedings of the IEEE/ CVF Conference on Computer Vision and Pattern Recognition (2024), 20697-20709.

[21]	H. Zhan, C. S. Weerasekera, J. Bian, R. Garg, and I. Reid, “DF-VO: What Should be Learnt for Visual Odometry?,” preprint, arXiv: 2103. 00933 (2021).

[22]	W. Zhao, S. Liu, Y. Shu, and Y.-J. Liu, “Towards Better General-ization: Joint Depth-Pose Learning Without PoseNet,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), 9151-9161.

[23]	Z. Liu, E. Malis, and P. Martinet, “Adaptive Learning for Hybrid Visual Odometry,” IEEE Robotics and Automation Letters 9, no. 8 (2024): 7341-7348, https://doi.org/10.1109/lra.2024.3418271.

[24]	N. Yang, L. von Stumberg,R. Wang, and D. Cremers, “D3VO:Deep Depth, Deep Pose and Deep Uncertainty for Monocular Visual Odom-etry,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), 1281-1292.

[25]	Y. Zou, P. Ji, Q. H. Tran, J. B. Huang, and M. Chandraker, “Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling,” in European Conference on Computer Vision (Springer, 2020), 710-727.

[26]	Y. Zhou, T. Lei, H. Liu, et al. “Mixture-of-Experts With Expert Choice Routing,” Advances in Neural Information Processing Systems, 35 (2022): 7103-7114, https://doi.org/10.48550/arXiv.2202.09368.

[27]	D. Schubert, T. Goll, N. Demmel, V. Usenko,J. Stückler, and D. Cremers, “The TUM VI Benchmark for Evaluating Visual-Inertial Odometry,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2018), 1680-1687.

[28]	P. Lindenberger,P. E. Sarlin, and M. Pollefeys, “LightGlue: Local Feature Matching at Light Speed,” in Proceedings of the IEEE/CVF In-ternational Conference on Computer Vision (2023), 17627-17638.

[29]	Z. Teed and J. Deng, “RAFT: Recurrent All-Pairs Field Transforms for Optical Flow,” in Computer Vision-ECCV 2020: 16Th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part II 16Springer, 2020),402-419.

[30]

C. Wang, G. Zhang, Z. Cheng, and W. Zhou, “KPDepth-VO: Self- Supervised Learning of Scale-Consistent Visual Odometry and Depth With Keypoint Features From Monocular Video,” IEEE Transactions on Circuits and Systems for Video Technology 35, no. 6 (2025): 5762-5775, https://doi.org/10.1109/tcsvt.2025.3533256.

[31]	L. Sun, J. Bian, H. Zhan, W. Yin, I. Reid, and C. Shen, “SC-DepthV3: Robust Self-Supervised Monocular Depth Estimation for Dynamic Scenes,” IEEE Transactions on Pattern Analysis and Machine Intelligence 46, no. 1 (2023): 497-508, https://doi.org/10.1109/tpami.2023.3322549.

[32]

W. Wei, Y. Ping, J. Li, X. Liu, and Y. Zhou, “Fine-MVO: Toward Fine-Grained Feature Enhancement for Self-Supervised Monocular Vi-sual Odometry in Dynamic Environments,” IEEE Transactions on Intelligent Transportation Systems 25, no. 10 (2024): 13947-13960, https://doi.org/10.1109/tits.2024.3404924.

[33]	Y. Feng, Z. Guo, Q. Chen, and R. Fan, “SCIPaD: Incorporating Spatial Clues Into Unsupervised Pose-Depth Joint Learning,” IEEE Transactions on Intelligent Vehicles (2024): 1-11, https://doi.org/10.1109/tiv.2024.3460868.

[34]	W. Wei, J. Li, K. Huang, J. Li, X. Liu, and Y. Zhou, “Lite-SVO: Towards a Lightweight Self-Supervised Semantic Visual Odometry Exploiting Multi-Feature Sharing Architecture,” in 2024 IEEE Interna-tional Conference on Robotics and Automation (ICRA) 2024), 10305-10311.

[35]	F. Xue, X. Wang, S. Li, Q. Wang,J. Wang, and H. Zha, “Beyond Tracking: Selecting Memory and Refining Poses for Deep Visual Odometry,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), 8575-8583.

[36]	G. Lu, “Deep Unsupervised Visual Odometry via Bundle Adjusted Pose Graph Optimization,” in 2023 IEEE International Conference on Robotics and Automation (ICRA) 2023), 6131-6137.

[37]	R. Song, J. Liu, Z. Xiao, and B. Yan, “GraphAVO: Self-Supervised Visual Odometry Based on Graph-Assisted Geometric Consistency,” IEEE Transactions on Intelligent Transportation Systems 25, no. 12 (2024): 20673-20682, https://doi.org/10.1109/tits.2024.3462596.

[38]	G. Xu, W. Yin, H. Chen, C. Shen,K. Cheng, and F. Zhao, “Fro-zenRecon:Pose-Free 3D Scene Reconstruction With Frozen Depth Models,” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV) 2023), 9276-9286.

[39]	S. Mohamed, M. Rosca, M. Figurnov, and A. Mnih, “Monte Carlo Gradient Estimation in Machine Learning,” Journal of Machine Learning Research 21, no. 132 (2020): 1-62, https://doi.org/10.48550/arXiv.1906.10652.

[40]	S. Zhu and X. Liu, “LightedDepth: Video Depth Estimation in Light of Limited Inference View Angles,” in Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition (2023), 5003-5012.

[41]	W. Yin, J. Zhang, O. Wang, et al., “Learning to Recover 3D Scene Shape From a Single Image,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021), 204-213.

[42]	Z. Jiang, H. Taira, N. Miyashita, and M. Okutomi, “Self-Supervised Ego-Motion Estimation Based on Multi-Layer Fusion of RGB and Inferred Depth,” in 2022 International Conference on Robotics and Automation (ICRA) IEEE, 2022), 7605-7611.

[43]	A. Ranjan, V. Jampani, L. Balles, et al., “Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), 12240-12249.

[44]	J. Engel, V. Koltun and D. Cremers, “Direct Sparse Odometry ” IEEE Transactions on Pattern Analysis and Machine Intelligence 40, no. 3 (2017): 611-625, https://doi.org/10.1109/tpami.2017.2658577.

[45]	T. Qin, P. Li, and S. Shen, “VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator,” IEEE Transactions on Ro-botics 34, no. 4 (2018): 1004-1020, https://doi.org/10.1109/tro.2018.2853729.

[46]	S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, “Keyframe-Based Visual-Inertial Odometry Using Nonlinear Optimi-zation,” International Journal of Robotics Research 34, no. 3 (2015): 314-334, https://doi.org/10.1177/0278364914554813.

[47]	M. Bloesch, S. Omari, M. Hutter, and R. Siegwart, “Robust Visual Inertial Odometry Using a Direct EKF-Based Approach,” in 2015 IEEE/ RSJ International Conference on Intelligent Robots and Systems (IROS) IEEE, 2015), 298-304.

[48]	A. I. Mourikis and S. I. Roumeliotis, “A Multi-State Constraint Kalman Filter for Vision-Aided Inertial Navigation,” in Proceedings 2007 IEEE International Conference on Robotics and Automation 2007), 3565-3572.

[49]	C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “On- Manifold Preintegration for Real-Time Visual-Inertial Odometry,” IEEE Transactions on Robotics 33, no. 1 (2016): 1-21, https://doi.org/10.1109/tro.2016.2597321.

[50]	Y. Kuznietsov,M. Proesmans, and L. Van Gool, “CoMoDA: Continuous Monocular Depth Adaptation Using Past Experiences,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (2021), 2907-2917.

[51]	Z. Teed and J. Deng, “DeepV2D:Video to Depth With Differentiable Structure From Motion,” in International Conference on Learning Rep-resentations (2020).

[52]	R. Birkl, D. Wofk, and M. Müller, “MiDaS v3. 1-A Model Zoo for Robust Monocular Relative Depth Estimation,” preprint, arXiv: 2307.14460 (2023).

[53]	L. Yang, B. Kang, Z. Huang, et al., “Depth Anything V2,” Advances in Neural Information Processing Systems 37 (2024): 21875-21911, https://doi.org/10.52202/079017-0688.

[54]	C. Godard, O. Mac Aodha,M. Firman, and G. J. Brostow, “Digging Into Self-Supervised Monocular Depth Estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (2019), 3828-3838.

[55]	M. Cordts, M. Omran, S. Ramos, et al., “The Cityscapes Dataset for Semantic Urban Scene Understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), 3213-3223.