Generative masked text-to-motion model with hybrid vector quantization

Jiaqi ZHANG; Jiajun WANG; Fanglue ZHANG; Miao WANG

doi:10.1007/s11704-025-50904-0

Front. Comput. Sci. ›› 2027, Vol. 21 ›› Issue (4) :2104701 DOI: 10.1007/s11704-025-50904-0

Image and Graphics

RESEARCH ARTICLE

Generative masked text-to-motion model with hybrid vector quantization

Author information +

History +

PDF (8808KB)

Abstract

Text-based motion generation enhances the flexibility of human motion design and editing, enabling applications in animation, virtual reality, and beyond. However, diffusion-based methods for text-to-motion generation often produce low-quality results. Conditional autoregressive approaches leveraging vector quantization variational autoencoders (VQ-VAE) struggle with vector quantization errors, requiring hierarchical or residual quantization. This increases the length of quantized token sequences, forcing the model to predict more tokens from text input, which complicates high-quality generation. To address this, we introduce HyT2M, an innovative text-to-motion model based on a hybrid VQ-VAE framework. Our approach decomposes motion into global and local components: local motion is quantized using a single vector quantization layer to preserve fine details, while global motion is reconstructed via residual vector quantization (RVQ) to compensate for errors caused by the limited perceptual range of local components. This hybrid strategy shortens token sequences while maintaining high reconstruction quality, easing the burden on the second-stage model. Furthermore, we develop a conditional masked transformer with a hybrid cross-guidance module, leveraging global motion tokens to enhance local motion predictions. This improves accuracy and usability for motion editing. Experiments on the HumanML3D, KIT-ML, and Motion-X datasets indicate that HyT2M achieves competitive results and excels in tasks such as motion completion and long-motion generation.

Graphical abstract

Keywords

text to motion / VQ-VAE / masked transformer / generative model

Cite this article

Download citation ▾

Jiaqi ZHANG, Jiajun WANG, Fanglue ZHANG, Miao WANG. Generative masked text-to-motion model with hybrid vector quantization. Front. Comput. Sci., 2027, 21(4): 2104701 DOI:10.1007/s11704-025-50904-0

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Ahuja C, Morency L P. Language2Pose: natural language grounded pose forecasting. In: Proceedings of 2019 International Conference on 3D Vision. 2019, 719–728

[2]	Tevet G, Gordon B, Hertz A, Bermano A H, Cohen-Or D. MotionCLIP: exposing human motion generation to CLIP space. In: Proceedings of the 17th European Conference on Computer Vision. 2022, 358–374

[3]	Hong F, Zhang M, Pan L, Cai Z, Yang L, Liu Z . AvatarCLIP: zero-shot text-driven generation and animation of 3D avatars. ACM Transactions on Graphics (TOG), 2022, 41( 4): 161

[4]	Kalakonda S S, Maheshwari S, Sarvadevabhatla R K. Action-GPT: leveraging large-scale language models for improved and generalized action generation. 2022, arXiv preprint arXiv: 2211.15603

[5]	Guo C, Zou S, Zuo X, Wang S, Ji W, Li X, Cheng L. Generating diverse and natural 3D human motions from text. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 5142–5151

[6]	Tevet G, Raab S, Gordon B, Shafir Y, Cohen-Or D, Bermano A H. Human motion diffusion model. 2022, arXiv preprint arXiv: 2209.14916

[7]	Wang Y, Leng Z, Li F W B, Wu S C, Liang X. Fg-T2M: fine-grained text-driven human motion generation via diffusion model. In: Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. 2023, 22035–22044

[8]	Zhang M, Cai Z, Pan L, Hong F, Guo X, Yang L, Liu Z . MotionDiffuse: text-driven human motion generation with diffusion model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46( 6): 4115–4128

[9]	Dai W, Chen L H, Wang J, Liu J, Dai B, Tang Y. MotionLCM: real-time controllable motion generation via latent consistency model. In: Proceedings of the 18th European Conference on Computer Vision. 2025

[10]	Petrovich M, Black M J, Varol G. TEMOS: generating diverse human motions from textual descriptions. In: Proceedings of the 17th European Conference on Computer Vision. 2022, 480–497

[11]	Liu J, Dai W, Wang C, Cheng Y, Tang Y, Tong X. Plan, posture and go: towards open-world text-to-motion generation. 2023, arXiv preprint arXiv: 2312.14828

[12]	Lu S, Chen L H, Zeng A, Lin J, Zhang R, Zhang L, Shum H Y. HumanTOMATO: text-aligned whole-body motion generation. In: Proceedings of the 41st International Conference on Machine Learning. 2024

[13]	Chen X, Jiang B, Liu W, Huang Z, Fu B, Chen T, Yu G. Executing your commands via motion diffusion in latent space. In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 18000–18010

[14]	Zhang J, Zhang Y, Cun X, Huang S, Zhang Y, Zhao H, Lu H, Shen X. T2M-GPT: generating human motion from textual descriptions with discrete representations. 2023, arXiv preprint arXiv: 2301.06052

[15]	Pinyoanuntapong E, Wang P, Lee M, Chen C. MMM: generative masked motion model. In: Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024

[16]	Guo C, Mu Y, Javed M G, Wang S, Cheng L. MoMask: generative masked modeling of 3D human motions. In: Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, 1900–1910

[17]	Razavi A, van den Oord A, Vinyals O. Generating diverse high-fidelity images with VQ-VAE-2. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 1331

[18]	Zeghidour N, Luebs A, Omran A, Skoglund J, Tagliasacchi M . SoundStream: an end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30: 495–507

[19]	Pinyoanuntapong E, Saleem M U, Wang P, Lee M, Das S, Chen C. BAMM: bidirectional autoregressive motion model. In: Proceedings of the 18th European Conference on Computer Vision. 2025, 172–190

[20]	Plappert M, Mandery C, Asfour T . The KIT motion-language dataset. Big Data, 2016, 4( 4): 236–252

[21]	Lin J, Zeng A, Lu S, Cai Y, Zhang R, Wang H, Zhang L. Motion-X: a large-scale 3D expressive whole-body human motion dataset. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 1099

[22]	Manning C D, Surdeanu M, Bauer J, Finkel J R, Bethard S, McClosky D. The Stanford CoreNLP natural language processing toolkit. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 2014, 55–60

[23]	Puig X, Ra K, Boben M, Li J, Wang T, Fidler S, Torralba A. VirtualHome: simulating household activities via programs. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 8494–8502

[24]	Zhang J Q, Xu X, Shen Z M, Huang Z H, Zhao Y, Cao Y P, Wan P, Wang M . Write-an-animation: high-level text-based animation editing with character-scene interaction. Computer Graphics Forum, 2021, 40( 7): 217–228

[25]	Fu Z, Liu F, Xu Q, Fu X, Qi J . LMR-CBT: learning modality-fused representations with CB-transformer for multimodal emotion recognition from unaligned multimodal sequences. Frontiers of Computer Science, 2024, 18( 4): 184314

[26]	Yang Y, Huang P, Cao J, Li J, Lin Y, Ma F . A prompt-based approach to adversarial example generation and robustness enhancement. Frontiers of Computer Science, 2024, 18( 4): 184318

[27]	Wang L, Ma C, Feng X, Zhang Z, Yang H, Zhang J, Chen Z, Tang J, Chen X, Lin Y, Zhao W X, Wei Z, Wen J . A survey on large language model based autonomous agents. Frontiers of Computer Science, 2024, 18( 6): 186345

[28]	Shi X, Luo C, Peng J, Zhang H, Sun Y. FG-MDM: towards zero-shot human motion generation via ChatGPT-refined descriptions. 2023, arXiv preprint arXiv: 2312.02772

[29]	Petrovich M, Black M J, Varol G. Action-conditioned 3D human motion synthesis with transformer VAE. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. 2021, 10985–10995

[30]	Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 8748–8763

[31]	Shafir Y, Tevet G, Kapon R, Bermano A H. Human motion diffusion as a generative prior. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[32]	Kim J, Kim J, Choi S. FLAME: free-form language-based motion synthesis & editing. In: Proceedings of the 37th AAAI Conference on Artificial Intelligence. 2023, 8255–8263

[33]	Xie Y, Jampani V, Zhong L, Sun D, Jiang H. OmniControl: control any joint at any time for human motion generation. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[34]	Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. 2019, arXiv preprint arXiv: 1910.01108

[35]	Athanasiou N, Petrovich M, Black M J, Varol G. TEACH: temporal action composition for 3D humans. In: Proceedings of 2022 International Conference on 3D Vision. 2022, 414–423

[36]	Zhong C, Hu L, Zhang Z, Xia S. AttT2M: text-driven human motion generation with multi-perspective attention mechanism. In: Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. 2023, 509–519

[37]	Kong H, Gong K, Lian D, Mi M B, Wang X. Priority-centric human motion generation in discrete latent space. In: Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. 2023, 14806–14816

[38]	Guo C, Zuo X, Wang S, Cheng L. TM2T: stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts. In: Proceedings of the 17th European Conference on Computer Vision. 2022, 580–597

[39]	Jiang B, Chen X, Liu W, Yu J, Yu G, Chen T. MotionGPT: human motion as a foreign language. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 20067–20079

[40]	van den Oord A, Vinyals O, Kavukcuoglu K. Neural discrete representation learning. In: Proceedings of the 30th Annual Conference on Neural Information Processing Systems. 2017

[41]	Esser P, Rombach R, Ommer B. Taming transformers for high-resolution image synthesis. In: Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 12873–12883

[42]	Huang M, Mao Z, Chen Z, Zhang Y. Towards accurate image coding: improved autoregressive image generation with dynamic vector quantization. In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 22596–22605

[43]	Zhang Y, Huang D, Liu B, Tang S, Lu Y, Chen L, Bai L, Chu Q, Yu N, Ouyang W. MotionGPT: finetuned LLMs are general-purpose motion generators. In: Proceedings of the 38th AAAI Conference on Artificial Intelligence. 2024, 7368–7376

[44]	Zou Q, Yuan S, Du S, Wang Y, Liu C, Xu Y, Chen J, Ji X. ParCo: part-coordinating text-to-motion synthesis. In: Proceedings of the 18th European Conference on Computer Vision. 2025

[45]	Yao H, Song Z, Zhou Y, Ao T, Chen B, Liu L . MoConVQ: unified physics-based motion control via scalable discrete representations. ACM Transactions on Graphics (TOG), 2024, 43( 4): 144

[46]	Williams W, Ringer S, Ash T, MacLeod D, Dougherty J, Hughes J. Hierarchical quantized autoencoders. In: Proceedings of the 34th Annual Conference on Neural Information Processing Systems. 2020, 4524–4535

[47]	Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019

[48]	Milletari F, Navab N, Ahmadi S A. V-net: fully convolutional neural networks for volumetric medical image segmentation. In: Proceedings of the 4th International Conference on 3D Vision. 2016, 565–571

[49]	Jugran S, Kumar A, Tyagi B S, Anand V. Extractive automatic text summarization using SpaCy in python & NLP. In: Proceedings of 2021 International Conference on Advance Computing and Innovative Technologies in Engineering. 2021, 582–585