GRAMO: geometric resampling augmentation for monocular 3D object detection
He GUAN, Chunfeng SONG, Zhaoxiang ZHANG
GRAMO: geometric resampling augmentation for monocular 3D object detection
Data augmentation is widely recognized as an effective means of bolstering model robustness. However, when applied to monocular 3D object detection, non-geometric image augmentation neglects the critical link between the image and physical space, resulting in the semantic collapse of the extended scene. To address this issue, we propose two geometric-level data augmentation operators named Geometric-Copy-Paste (Geo-CP) and Geometric-Crop-Shrink (Geo-CS). Both operators introduce geometric consistency based on the principle of perspective projection, complementing the options available for data augmentation in monocular 3D. Specifically, Geo-CP replicates local patches by reordering object depths to mitigate perspective occlusion conflicts, and Geo-CS re-crops local patches for simultaneous scaling of distance and scale to unify appearance and annotation. These operations ameliorate the problem of class imbalance in the monocular paradigm by increasing the quantity and distribution of geometrically consistent samples. Experiments demonstrate that our geometric-level augmentation operators effectively improve robustness and performance in the KITTI and Waymo monocular 3D detection benchmarks.
3D detection / monocular / augmentation / geometry
He Guan is now a PhD candidate with the University of Chinese Academy of Sciences, China. He received the bachelor degree from Harbin Institute of Technology, China in 2015, and the master’s degree from the Institute of Automation, Chinese Academy of Sciences, China in 2018. His research interests include computer graphics and computer vision
Chunfeng Song received the PhD degree from University of Chinese Academy of Sciences, China in 2020. He is now working at the Center for Research on Intelligent Perception and Computing, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences as an Assistant Professor. He has published more than 20 conference and journal papers such as IEEE TPAMI, TIP, IJCV, CVPR, ECCV, and AAAI. His current research focuses on person identification, image segmentation, and unsupervised learning
Zhaoxiang Zhang received his bachelor degree in Circuits and Systems from the University of Science and Technology of China, China in 2004, and he received his PhD degree in 2009. He is now a full Professor in the Center for Research on Intelligent Perception and Computing and the State Key Laboratory of Multimodal Artificial Intelligence Systems, China. His research interests include computer vision, pattern recognition, and machine learning. Specifically, he recently focuses on biologically inspired intelligent computing and its applications in human analysis and scene understanding. He has published more than 150 papers in international journals and conferences, such as IEEE TPAMI, TIP, TIFS, IJCV, CVPR, ICCV, ECCV, and NeurIPS
[1] |
Dijk T V, Croon G D. How do neural networks see depth in single images? In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, 2183−2191
|
[2] |
Lu Y, Ma X, Yang L, Zhang T, Liu Y, Chu Q, Yan J, Ouyang W. Geometry uncertainty projection network for monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, 3111−3121
|
[3] |
Qin Z, Li X. MonoGround: detecting monocular 3D objects from the ground. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 3793−3802
|
[4] |
Ding M, Huo Y, Yi H, Wang Z, Shi J, Lu Z, Luo P. Learning depth-guided convolutions for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2020, 1000−1001
|
[5] |
Qin Z, Wang J, Lu Y. MonoGRNet: a geometric reasoning network for monocular 3D object localization. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 8851−8858
|
[6] |
Wang L, Du L, Ye X, Fu Y, Guo G, Xue X, Feng J, Zhang L. Depth-conditioned dynamic message propagation for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 454−463
|
[7] |
Park D, Ambrus R, Guizilini V, Li J, Gaidon A. Is pseudo-lidar needed for monocular 3D object detection? In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, 3142−3152
|
[8] |
Wang Y, Chao W, Garg D, Hariharan B, Campbell M, Weinberger K. Pseudo-LiDAR From Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 8445−8453
|
[9] |
Qian R, Garg D, Wang Y, You Y, Belongie S, Hariharan B, Campbell M, Weinberger K, Chao W. End-to-end Pseudo-LiDAR for image-based 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 5881−5890
|
[10] |
Chen Y, Dai H, Ding Y. Pseudo-Stereo for monocular 3D object detection in autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 887−897
|
[11] |
Ma X, Liu S, Xia Z, Zhang H, Zeng X, Ouyang W. Rethinking Pseudo-LiDAR representation. In: Proceedings of European Conference on Computer Vision. 2020, 311−327
|
[12] |
Reading C, Harakeh A, Chae J, Waslander S. Categorical depth distribution network for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 8555−8564
|
[13] |
Shi X, Ye Q, Chen X, Chen C, Chen Z, Kim T. Geometry-based distance decomposition for monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, 15172−15181
|
[14] |
Brazil G, Liu X. M3D-RPN: monocular 3D region proposal network for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, 9287−9296
|
[15] |
Luo S, Dai H, Shao L, Ding Y. M3DSSD: monocular 3D single stage object detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 6145−6154
|
[16] |
Wang T, Zhu X, Pang J, Lin D. FCOS3D: fully convolutional one-stage monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, 913−922
|
[17] |
Mousavian A, Anguelov D, Flynn J, Kosecka J. 3D bounding box estimation using deep learning and geometry. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2017, 7074−7082
|
[18] |
Shi X, Chen Z, Kim T. Distance-normalized unified representation for monocular 3D object detection. In: Proceedings of European Conference on Computer Vision. 2020, 91−107
|
[19] |
Liu X, Xue N, Wu T. Learning auxiliary monocular contexts helps monocular 3D object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2022, 1810−1818
|
[20] |
Chen Y, Tai L, Sun K, Li M. MonoPair: monocular 3D object detection using pairwise spatial relationships. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 12093−12102
|
[21] |
Gu J, Wu B, Fan L, Huang J, Cao S, Xiang Z, Hua X. Homography loss for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 1080−1089
|
[22] |
Chabot F, Chaouch M, Rabarisoa J, Teuliere C, Chateau T. Deep MANTA: A Coarse-To-Fine Many-Task Network for Joint 2D and 3D Vehicle Analysis From Monocular Image. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, 2040−2049
|
[23] |
Liu Z, Zhou D, Lu F, Fang J, Zhang L. AutoShape: Real-Time Shape-Aware Monocular 3D Object Detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, 15641−15650
|
[24] |
Ma X, Zhang Y, Xu D, Zhou D, Yi S, Li H, Ouyang W. Delving into localization errors for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 4721−4730
|
[25] |
Zhang Y, Lu J, Zhou J. Objects are different: flexible monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 3289−3298
|
[26] |
Li Y, Chen Y, He J, Zhang Z. Densely constrained depth estimator for monocular 3D object detection. In: European Conference on Computer Vision. 2022, 718−734
|
[27] |
Chen H, Huang Y, Tian W, Gao Z, Xiong L. MonoRUn: monocular 3D object detection by reconstruction and uncertainty propagation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 10379−10388
|
[28] |
Chen H, Wang P, Wang F, Tian W, Xiong L, Li H. EPro-PnP: generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 2781−2790
|
[29] |
Fang H, Sun J, Wang R, Gou M, Li Y, Lu C. InstaBoost: boosting instance segmentation via probability map guided copy-pasting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, 682−691
|
[30] |
Georgakis G, Mousavian A, Berg A, Kosecka J. Synthesizing training data for object detection in indoor scenes. 2017, arXiv preprint arXiv: 1702.07836
|
[31] |
Dvornik N, Mairal J, Schmid C. Modeling visual context is key to augmenting object detection datasets. In: Proceedings of the European Conference on Computer Vision. 2018, 364−380
|
[32] |
Dwibedi D, Misra I, Hebert M. Cut, paste and learn: surprisingly easy synthesis for instance detection. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 1301−1310
|
[33] |
Wang H, Huang D, Wang Y. GridNet: efficiently learning deep hierarchical representation for 3D point cloud understanding. Frontiers of Computer Science, 2022, 16(1): 161301.
|
[34] |
Xian Y, Xiao J, Wang Y. A fast registration algorithm of rock point cloud based on spherical projection and feature extraction. Frontiers of Computer Science, 2019, 13(1): 170−182
|
[35] |
Yan Y, Mao Y, Li B. SECOND: sparsely embedded convolutional detection. Sensors, 2018, 18(10): 3337
|
[36] |
Xiao A, Huang J, Guan D, Cui K, Lu S, Shao L. PolarMix: a general data augmentation technique for LiDAR point clouds. In: Proceedings of Advances in Neural Information Processing Systems. 2022, 11035−11048
|
[37] |
Zhang W, Wang Z, Loy C. Exploring data augmentation for multi-modality 3D object detection. 2021, arXiv preprint arXiv: 2012.12741.
|
[38] |
Wang C, Ma C, Zhu M, Yang X. Point augmenting: cross-modal augmentation for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 11794−11803
|
[39] |
Jiang H, Cheng M, Li S, Borji A, Wang J. Joint salient object detection and existence prediction. Frontiers of Computer Science, 2019, 13(1): 778−788
|
[40] |
Yang X, Xue T, Luo H, Guo J. Fast and accurate visual odometry from a monocular camera. Frontiers of Computer Science, 2019, 13(1): 1326−1336
|
[41] |
Lian Q, Ye B, Xu R, Yao W, Zhang T. Exploring geometric consistency for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 1685−1694
|
[42] |
Peng L, Wu X, Yang Z, Liu H, Cai D. DID-M3D: decoupling instance depth for monocular 3D object detection. In: Proceedings of European Conference on Computer Vision. 2022, 71−88
|
[43] |
Chen X, Kundu K, Zhang Z, Ma H, Fidler S, Urtasun R. Monocular 3D object detection for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 2147−2156
|
[44] |
Yu F, Wang D, Shelhamer E, Darrell T. Deep layer aggregation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, 2403−2412
|
[45] |
Zhang R, Qiu H, Wang T, Guo Z, Qiao Y, Li H, Gao P. MonoDETR: Depth-guided transformer for monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023, 9155−9166
|
[46] |
Kumar A, Brazil G, Liu X. GrooMeD-NMS: grouped mathematically differentiable NMS for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 8973−8983
|
[47] |
Li Z, Wang W, Li H, Xie E, Sima C, Lu T, Yu Q, Dai J. BEVFormer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Proceedings of European Conference on Computer Vision. 2022, 1−18
|
/
〈 | 〉 |