High-precision 3D perception in autonomous driving remains constrained by dependence on expensive LiDAR sensors and computationally intensive models. These prohibitive requirements effectively limit resource-constrained platforms from accessing advanced perception capabilities, hindering the widespread adoption of autonomous technology. We present T-3MS Fusion, a transformer-based middle-fusion framework that achieves state-of-the-art 3D object detection performance using only a Velodyne VLP-32C LiDAR and a consumer-grade 360° camera, eliminating the need for test-time augmentation while maintaining computational efficiency. In contrast to early-fusion strategies that weaken spatial fidelity and late-fusion methods that lose geometric consistency, T-3MS employs a transformer-based middle-deep fusion architecture. This design leverages hierarchical gated residual transformers and adaptive cross-modal reactivation to preserve LiDAR geometry and camera semantics while enabling effective multi-scale feature integration. Sparse bird’s-eye-view processing and quantization-aware training enable real-time inference on embedded platforms. Validation on the nuScenes benchmark confirms strong performance with 74.9% NDS and 72.8% mAP, while evaluation on a self-collected semi-urban dataset acquired with low-cost and accessible hardware demonstrates resilience under occlusion, adverse illumination, and sparse point-cloud conditions, where even high-resolution LiDAR systems often experience significant performance degradation. These results establish T-3MS Fusion as an effective approach for jointly advancing accuracy, efficiency, and affordability in next-generation autonomous driving perception systems.
| [1] |
Martínez F.S., Casas-Roma J., Subirats L., Parada R.. Spiking neural networks for autonomous driving: a review. Eng. Appl. Artif. Intell.. 2024, 138. 109415
|
| [2] |
Zhao J., Zhao W., Deng B., Wang Z., Zhang F., Zheng W., Cao W., Nan J., Lian Y., Burke A.F.. Autonomous driving system: a comprehensive survey. Expert Syst. Appl.. 2024, 242. 122836
|
| [3] |
Bai X., Hu Z., Zhu X., Huang Q., Chen Y., Fu H., Tai C.-L.. Transfusion: robust lidar-camera fusion for 3d object detection with transformers. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 202210901099
|
| [4] |
Li Y., Yu A.W., Meng T., Caine B., Ngiam J., Peng D., Shen J., Lu Y., Zhou D., Le Q.V.et al. . Deepfusion: lidar-camera deep fusion for multi-modal 3d object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20221718217191
|
| [5] |
J. Huang, G. Huang, Z. Zhu, Y. Ye, D. Du, Bevdet: high-performance multi-camera 3d object detection in bird-eye-view (2021). arXiv preprint. arXiv:2112.11790
|
| [6] |
Barrera A., Guindel C., Beltrán J., García F.. Birdnet+: end-to-end 3d object detection in lidar bird’s eye view. 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC). 202016IEEE
|
| [7] |
Wang H., Tang H., Shi S., Li A., Li Z., Schiele B., Wang L.. Unitr: a unified and efficient multi-modal transformer for bird’s-eye-view representation. Proceedings of the IEEE/CVF International Conference on Computer Vision. 202367926802
|
| [8] |
Zhang J., Lin X.. Advances in fusion of optical imagery and lidar point cloud applied to photogrammetry and remote sensing. Int. J. Image Data Fus.. 2017, 8(1): 1-31.
|
| [9] |
Liang T., Xie H., Yu K., Xia Z., Lin Z., Wang Y., Tang T., Wang B., Tang Z.. Bevfusion: a simple and robust lidar-camera fusion framework. Adv. Neural Inf. Process. Syst.. 2022, 35: 10421-10434.
|
| [10] |
Hussain M., O’Nils M., Lundgren J., Mousavirad S.J.. A comprehensive review on deep learning-based data fusion. IEEE Access. 2024, 12: 180093-180124.
|
| [11] |
Gaw N., Pardalos P.M., Gahrooei M.R.. Multimodal and Tensor Data Analytics for Industrial Systems Improvement. 2024, Cham, Springer.
|
| [12] |
X. Li, H. Ding, H. Yuan, W. Zhang, J. Pang, G. Cheng, K. Chen, Z. Liu, C.C. Loy, Transformer-based visual segmentation: a survey. IEEE Trans. Pattern Anal. Mach. Intell. (2024) arXiv preprint. arXiv:1706.03762
|
| [13] |
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017) https://doi.org/10.1109/TPAMI.2024.3434373
|
| [14] |
Chitta K., Prakash A., Jaeger B., Yu Z., Renz K., Geiger A.. Transfuser: imitation with transformer-based sensor fusion for autonomous driving. IEEE Trans. Pattern Anal. Mach. Intell.. 2022, 45(11): 12878-12895.
|
| [15] |
Misra I., Girdhar R., Joulin A.. An end-to-end transformer model for 3d object detection. Proceedings of the IEEE/CVF International Conference on Computer Vision. 202129062917
|
| [16] |
Carion N., Massa F., Synnaeve G., Usunier N., Kirillov A., Zagoruyko S.. End-to-end object detection with transformers. European Conference on Computer Vision. 2020, Berlin, Springer213229
|
| [17] |
Caesar H., Bankiti V., Lang A.H., Vora S., Liong V.E., Xu Q., Krishnan A., Pan Y., Baldan G., Beijbom O.. Nuscenes: a multimodal dataset for autonomous driving. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20201162111631
|
| [18] |
J. Deng, S. Zhang, F. Dayoub, W. Ouyang, Y. Zhang, I. Reid, Poifusion: multi-modal 3d object detection via fusion at points of interest (2024). arXiv preprint. arXiv:2403.09212
|
| [19] |
Alaba S.Y., Gurbuz A.C., Ball J.E.. Emerging trends in autonomous vehicle perception: multimodal fusion for 3d object detection. World Electr. Veh. J.. 2024, 15(1): 20.
|
| [20] |
Castanedo F.. A review of data fusion techniques. Sci. World J.. 2013, 20131. 704504
|
| [21] |
Man Y., Gui L.-Y., Wang Y.-X.. Bev-guided multi-modality fusion for driving perception. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 202321960-21969
|
| [22] |
Z. Wang, Z. Huang, Y. Gao, N. Wang, S. Liu, Mv2dfusion: leveraging Modality-Specific Object Semantics for Multi-Modal 3d Detection (2024). https://doi.org/10.48550/arXiv.2408.05945
|
| [23] |
Liu H., Teng Y., Lu T., Wang H., Wang L.. Sparsebev: high-performance sparse 3d object detection from multi-camera videos. Proceedings of the IEEE/CVF International Conference on Computer Vision. 20231858018590
|
| [24] |
Sindagi V.A., Zhou Y., Tuzel O.. Mvx-net: multimodal voxelnet for 3d object detection. 2019 International Conference on Robotics and Automation (ICRA). 201972767282. IEEE
|
| [25] |
J. Ahmad, A. Del Bue, mmfusion: multimodal fusion for 3d objects detection (2023). arXiv preprint. arXiv:2311.04058
|
| [26] |
Qin Y., Wang C., Kang Z., Ma N., Li Z., Zhang R.. Supfusion: supervised lidar-camera fusion for 3d object detection. Proceedings of the IEEE/CVF International Conference on Computer Vision. 20232201422024
|
| [27] |
Li Y., Chen Y., Qi X., Li Z., Sun J., Jia J.. Unifying voxel-based representation with transformer for 3d object detection. Adv. Neural Inf. Process. Syst.. 2022, 35: 18442-18455
|
| [28] |
Chen X., Kundu K., Zhang Z., Ma H., Fidler S., Urtasun R.. Monocular 3d object detection for autonomous driving. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 201621472156
|
| [29] |
Mousavian A., Anguelov D., Flynn J., Kosecka J.. 3d bounding box estimation using deep learning and geometry. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 201770747082
|
| [30] |
Chen W., Li P., Zhao H.. Msl3d: 3d object detection from monocular, stereo and point cloud for autonomous driving. Neurocomputing. 2022, 494: 23-32.
|
| [31] |
H. Lu, Y. Zhang, Q. Lian, D. Du, Y. Chen, Towards generalizable multi-camera 3d object detection via perspective debiasing (2023). arXiv preprint. arXiv:2310.11346
|
| [32] |
T. Roddick, A. Kendall, R. Cipolla, Orthographic feature transform for monocular 3d object detection (2018). arXiv preprint. arXiv:1811.08188
|
| [33] |
Philion J., Fidler S.. Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. European Conference on Computer Vision. 2020, Berlin, Springer194210
|
| [34] |
Liu Y., Wang T., Zhang X., Sun J.. Petr: position embedding transformation for multi-view 3d object detection. European Conference on Computer Vision. 2022, Berlin, Springer531548
|
| [35] |
Wang S., Zhao X., Xu H.-M., Chen Z., Yu D., Chang J., Yang Z., Zhao F.. Towards domain generalization for multi-view 3d object detection in bird-eye-view. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20231333313342
|
| [36] |
Qi C.R., Su H., Mo K., Guibas L.J.. Pointnet: deep learning on point sets for 3d classification and segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017652660
|
| [37] |
Zhou Y., Tuzel O.. Voxelnet: end-to-end learning for point cloud based 3d object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 201844904499
|
| [38] |
Yan Y., Mao Y., Li B.. Second: sparsely embedded convolutional detection. Sensors. 2018, 18(10): 3337.
|
| [39] |
Shi S., Wang X., Li H.. Pointrcnn: 3d object proposal generation and detection from point cloud. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019770-779
|
| [40] |
Pan X., Xia Z., Song S., Li L.E., Huang G.. 3d object detection with pointformer. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 202174637472
|
| [41] |
Wu Y., Wang Y., Zhang S., Ogai H.. Deep 3d object detection networks using lidar data: a review. IEEE Sens. J.. 2020, 21(2): 1152-1171.
|
| [42] |
Vora S., Lang A.H., Helou B., Beijbom O.. Pointpainting: sequential fusion for 3d object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 202046044612
|
| [43] |
Li S., Geng K., Yin G., Wang Z., Qian M.. Mvmm: multiview multimodal 3-d object detection for autonomous driving. IEEE Trans. Ind. Inform.. 2023, 20(1): 845-853.
|
| [44] |
Welch G., Bishop G.et al. . An Introduction to the Kalman Filter. 1995
|
| [45] |
Shi P., Liu Z., Dong X., Yang A.. Cl-fusionbev: 3d object detection method with camera-lidar fusion in bird’s eye view. Complex Intell. Syst.. 2024, 10(6): 7681-7696.
|
| [46] |
Liu Z., Tang H., Amini A., Yang X., Mao H., Rus D.L., Han S.. Bevfusion: multi-task multi-sensor fusion with unified bird’s-eye view representation. 2023 IEEE International Conference on Robotics and Automation (ICRA). 202327742781. IEEE
|
| [47] |
Zhong H., Wang H., Wu Z., Zhang C., Zheng Y., Tang T.. A survey of lidar and camera fusion enhancement. Proc. Comput. Sci.. 2021, 183: 579-588.
|
| [48] |
C.J. Zuo, C.Y. Gu, Y.K. Guo, X.D. Miao, Cross-supervised lidar-camera fusion for 3d object detection. IEEE Access (2024)
|
| [49] |
Zhu H., Deng J., Zhang Y., Ji J., Mao Q., Li H., Zhang Y.. Vpfnet: improving 3d object detection with virtual point based lidar and stereo data fusion. IEEE Trans. Multimed.. 2022, 25: 5291-5304.
|
| [50] |
Liu Z., Huang T., Li B., Chen X., Wang X., Bai X.. Epnet++: cascade bi-directional fusion for multi-modal 3d object detection. IEEE Trans. Pattern Anal. Mach. Intell.. 2022, 45(7): 8324-8341
|
| [51] |
Pang S., Morris D., Radha H.. Clocs: camera-lidar object candidates fusion for 3d object detection. 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 20201038610393. IEEE
|
| [52] |
Ku J., Mozifian M., Lee J., Harakeh A., Waslander S.L.. Joint 3d proposal generation and object detection from view aggregation. 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 201818IEEE
|
| [53] |
Li Y., Miao N., Ma L., Shuang F., Huang X.. Transformer for object detection: review and benchmark. Eng. Appl. Artif. Intell.. 2023, 126. 107021
|
| [54] |
Rodriguez A.. Models and applications. Deep Learning Systems: Algorithms, Compilers, and Processors for Large-Scale Production. 20214372. Springer
|
| [55] |
Furukawa Y., Ponce J.. Accurate camera calibration from multi-view stereo and bundle adjustment. Int. J. Comput. Vis.. 2009, 84: 257-268.
|
| [56] |
Chen Y., Liu F., Pei K.. Cross-modal matching cnn for autonomous driving sensor data monitoring. Proceedings of the IEEE/CVF International Conference on Computer Vision. 202131103119
|
| [57] |
Wu X., Peng L., Yang H., Xie L., Huang C., Deng C., Liu H., Cai D.. Sparse fuse dense: towards high quality 3d detection with depth completion. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 202254185427
|
| [58] |
Jiao Y., Jie Z., Chen S., Chen J., Ma L., Jiang Y.-G.. Msmdfusion: fusing lidar and camera at multiple scales with multi-depth seeds for 3d object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20232164321652
|
| [59] |
Cai Q., Pan Y., Yao T., Ngo C.-W., Mei T.. Objectfusion: multi-modal 3d object detection with object-centric fusion. Proceedings of the IEEE/CVF International Conference on Computer Vision. 20231806718076
|
| [60] |
Yang Z., Chen J., Miao Z., Li W., Zhu X., Zhang L.. Deepinteraction: 3d object detection via modality interaction. Adv. Neural Inf. Process. Syst.. 2022, 35: 1992-2005
|
| [61] |
Xie Y., Xu C., Rakotosaona M.-J., Rim P., Tombari F., Keutzer K., Tomizuka M., Zhan W.. Sparsefusion: fusing multi-modal sparse representations for multi-sensor 3d object detection. Proceedings of the IEEE/CVF International Conference on Computer Vision. 20231759117602
|
| [62] |
Yan J., Liu Y., Sun J., Jia F., Li S., Wang T., Zhang X.. Cross modal transformer: towards fast and robust 3d object detection. Proceedings of the IEEE/CVF International Conference on Computer Vision. 20231826818278
|
| [63] |
Y. Zhu, X. Jia, X. Yang, J. Yan, Flatfusion: delving into details of sparse transformer-based camera-lidar fusion for autonomous driving (2024). arXiv preprint. arXiv:2408.06832
|
| [64] |
Waymo Open Dataset, An autonomous driving dataset. https://www.waymo.com/open. Accessed: July 2025 (2019–2025)
|
Funding
Sirindhorn International Institute of Technology, Thammasat University(( will be added only after acceptance))
RIGHTS & PERMISSIONS
The Author(s)