UNO: Unified Self-Supervised Monocular Odometry for Platform-Agnostic Deployment
Wentao Zhao , Yihe Niu , Yanbo Wang , Tianchen Deng , Shenghai Yuan , Zhenli Wang , Rui Guo , Jingchuan Wang
CAAI Transactions on Intelligence Technology ›› 2026, Vol. 11 ›› Issue (1) : 205 -222.
This work presents UNO, a unified monocular visual odometry framework that enables robust and adaptable pose esti-mation across diverse environments, platforms and motion patterns. Unlike traditional methods that rely on deployment- specific tuning or predefined motion priors, our approach generalises effectively across a wide range of real-world sce-narios, including autonomous vehicles, aerial drones, mobile robots and handheld devices. To this end, we introduce a mixture-of-experts strategy for local state estimation, with several specialised decoders that each handle a distinct class of ego-motion patterns. Moreover, we introduce a fully differentiable Gumbel-softmax module that constructs a robust inter- frame correlation graph, selects the optimal expert decoder and prunes erroneous estimates. These cues are then fed into a unified back-end that combines pretrained scale-independent depth priors with a lightweight bundling adjustment to enforce geometric consistency. We extensively evaluate our method on three major benchmark datasets: KITTI (outdoor/ autonomous driving), EuRoC-MAV (indoor/aerial drones) and TUM-RGBD (indoor/handheld), demonstrating state-of-the- art performance.
computer vision / pose estimation / robotics / unsupervised learning
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
Z. Teed and J. Deng, “DROID-SLAM: Deep Visual Slam for Monoc-ular, Stereo, and RGB-D Cameras,” Advances in Neural Information Processing Systems 34 (2021): 16558-16569, 10.48550/arXiv.2108.10869. |
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
|
| [33] |
|
| [34] |
|
| [35] |
|
| [36] |
|
| [37] |
|
| [38] |
|
| [39] |
|
| [40] |
|
| [41] |
|
| [42] |
|
| [43] |
|
| [44] |
|
| [45] |
|
| [46] |
|
| [47] |
|
| [48] |
|
| [49] |
|
| [50] |
|
| [51] |
Z. Teed and J. Deng, “DeepV2D:Video to Depth With Differentiable Structure From Motion,” in International Conference on Learning Rep-resentations (2020). |
| [52] |
|
| [53] |
|
| [54] |
|
| [55] |
|
/
| 〈 |
|
〉 |