Multi-Modal Multi-View 3D Hand Pose Estimation

Hao WANG; Ping WANG; Haoran YU; Dong DING; Weiming XIANG

doi:10.19884/j.1672-5220.202411014

Journal of Donghua University(English Edition) ›› 2025, Vol. 42 ›› Issue (6) :673 -682. DOI: 10.19884/j.1672-5220.202411014

Information Technology and Artificial Intelligence

research-article

Multi-Modal Multi-View 3D Hand Pose Estimation

Author information +

History +

PDF (7971KB)

Abstract

With the rapid progress of the artificial intelligence (AI) technology and mobile internet, 3D hand pose estimation has become critical to various intelligent application areas, e.g., human-computer interaction.To avoid the low accuracy of single-modal estimation and the high complexity of traditional multi-modal 3D estimation, this paper proposes a novel multi-modal multi-view (MMV) 3D hand pose estimation system, which introduces a registration before translation (RT)-translation before registration (TR) jointed conditional generative adversarial network (cGAN) to train a multi-modal registration network, and then employs the multi-modal feature fusion to achieve high-quality estimation, with low hardware and software costs both in data acquisition and processing.Experimental results demonstrate that the MMV system is effective and feasible in various scenarios.It is promising for the MMV system to be used in broad intelligent application areas.

Keywords

3D hand pose estimation / registration network / multi-modal / multi-view / conditional generative adversarial network (cGAN)

Cite this article

Download citation ▾

Hao WANG, Ping WANG, Haoran YU, Dong DING, Weiming XIANG. Multi-Modal Multi-View 3D Hand Pose Estimation. Journal of Donghua University(English Edition), 2025, 42(6): 673-682 DOI:10.19884/j.1672-5220.202411014

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	GAO C Y, YANG Y J, LI W S. 3D interacting hand pose and shape estimation from a single RGB image[J]. Neurocomputing, 2022, 474: 25-36.

[2]	CHEN J Y, YAN M, ZHANG J Z.Tracking and reconstructing hand object interactions from point cloud sequences in the wild[C]//37th Annual AAAI Conference on Artificial Intelligence (AAAI-24). Washington DC: AAAI Press, 2023: 304-312.

[3]	YU Z W, YANG L L, CHEN S C.Local and global point cloud reconstruction for 3D hand pose estimation[EB/OL].(2021-12-13)[2024-11-02].https://arxiv.org/abs/2112.06389.

[4]	CHANG J Y, MOON G, LEE K M. V₂ VPoseNet: voxel-to-voxel prediction network for accurate 3D hand and human pose estimation from a single depth map[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 5079-5088.

[5]	CHARLES R Q, HAO S, MO K C, et al.PointNet:deep learning on point sets for 3D classification and segmentation[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2017: 77-85.

[6]	CHEN X H, WANG G J, GUO H K, et al. Pose guided structured region ensemble network for cascaded hand pose estimation[J]. Neurocomputing, 2020, 395: 138-149.

[7]	ROMERO J, TZIONAS D, BLACK M J. Embodied hands: modeling and capturing hands and bodies together[J]. ACM Transactions on Graphics (TOG), 2022, 36(6): 1-17.

[8]	MUELLER F, BERNARD F, SOTNYCHENKO O, et al.Generated hands for real-time 3D hand tracking from monocular RGB[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2018: 49-59.

[9]	JIANG C L, XIAO Y, WU C L, et al. A2Jtransformer: anchor-to-joint transformer network for 3D interacting hand pose estimation from a single RGB image[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New York: IEEE, 2023: 8846-8855.

[10]	GAO D H, ZHANG X D, CHEN X Y, et al. CycleHand:increasing 3D pose estimation ability on in-the-wild monocular image through cyclic flow[C]//Proceedings of the 30th ACM International Conference on Multimedia. New York: ACM, 2022: 2452-2463.

[11]	ZHANG F, BAZAREVSKY V, VAKUNOV A, et al.MediaPipe hands: on-device real-time hand tracking[EB/OL]. (2020-06-18)[2024-11-02]https://arxiv.org/abs/2006.10214.

[12]	KIPF T N, WELLING M. Semi-supervised classification with graph convolutional networks[EB/OL]. (2016-09-09)[2024-11-02]. https://arxiv.org/abs/1609.02907.

[13]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//31st Conference on Neural Information Processing Systems (NIPS 2017) Long Beach: NIPS, 2017.

[14]	FAN Z C, SPURR A, KOCABAS M, et al. Learning to disambiguate strongly interacting hands via probabilistic per-pixel part segmentation[C]//2021 International Conference on 3D Vision (3DV) New York: IEEE, 2021: 1-10.

[15]	HUANG L, ZHANG B S, GUO Z L, et al. Survey on depth and RGB image-based 3D hand shape and pose estimation[J]. Virtual Reality & Intelligent Hardware, 2021, 3(3): 207-234.

[16]	GE L H, REN Z, LI Y C, et al.3D hand shape and pose estimation from a single RGB image[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2019: 10825-10834.

[17]	SPURR A, IQBAL U, MOLCHANOV P, et al.Weakly supervised 3D hand pose estimation via biomechanical constraints[EB/OL].(2016-09-09)[2024-11-02].https://arxiv.org/pdf/2003.09282.

[18]	SPURR A, SONG J, PARK S, et al. Crossmodal deep variational hand pose estimation[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 89-98.

[19]	CHEN L J, LIN S Y, XIE Y S, et al. DGGAN: depth-image guided generative adversarial networks for disentangling RGB and depth images in 3D hand pose estimation[C]//2020 IEEE Winter Conference on Applications of Computer Vision (WACV). New York: IEEE, 2020: 400-408.

[20]	HOANG D C, TAN P X, NGUYEN A N, et al. Multi-modal hand-object pose estimation with adaptive fusion and interaction learning[J]. IEEE Access, 2024, 12: 54339-54351.

[21]	TU Y Y, JIANG J N, LI S, et al. PoseFusion:robust object-in-hand pose estimation with SelectLSTM[C]//2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). New York: IEEE, 2023: 6839-6846.

[22]	HU J, LUO Z W, WANG X, et al. End-to-end multimodal image registration via reinforcement learning[J]. Medical Image Analysis, 2021, 68: 101878.

[23]	HU Y P, MODAT M, GIBSON E, et al. Weakly-supervised convolutional neural networks for multimodal image registration[J]. Medical Image Analysis, 2018, 49: 1-13.

[24]	FAN J F, CAO X H, YAP P T, et al. BIRNet: brain image registration using dual-supervised fully convolutional networks[J]. Medical Image Analysis, 2019, 54: 193-206.

[25]	QIN C, SHI B B, LIAO R, et al. Unsupervised deformable registration for multi-modal images via disentangled representations[C]//International Conference on Information Processing in Medical Imaging. Berlin: Springer, 2019: 249-261.

[26]	ARAR M, GINGER Y, DANON D, et al.Unsupervised multi-modal image registration via geometry preserving image-to-image translation[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2020: 13407-13416.

[27]	DAI X R, MA T, CAI H B, et al. Unsupervised hierarchical translation-based model for multimodal medical image registration[C]//2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). New York: IEEE, 2022: 1261-1265.

[28]	EVAN M Y, WANG A Q, DALCA A V, et al. Keymorph:robust multi-modal affine registration via unsupervised keypoint detection[C]//Proceedings of the 5th International Conference on Medical Imaging with Deep Learning. New York: PMLR, 2022: 172: 1482-1503.