Next-Gen AIGC: a review of multimodal foundation models for text-to-media innovations

Cong JIN; Jingru FAN; Jinfa HUANG; Jinyuan FU; Tao MEI; Li YUAN; Jiebo LUO

doi:10.1007/s11704-025-51171-9

Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (12) :2012368 DOI: 10.1007/s11704-025-51171-9

Artificial Intelligence

REVIEW ARTICLE

Next-Gen AIGC: a review of multimodal foundation models for text-to-media innovations

Author information +

History +

PDF (2557KB)

Abstract

Multimodal Foundation Models (MFMs), including diffusion models and multimodal large language models, have attracted significant interest owing to their scalable capabilities in tasks involving vision as well as vision-language understanding and generation. Despite the growing body of research focusing on MFMs’ advancements, a comprehensive review of their applications in the text-to-media domain remains limited. This review aims to bridge that gap by offering an exhaustive overview of MFMs’ development within the text-to-media landscape. We focus on four popular text prompt-based AI-generated content (AIGC) tasks: text-to-image, text-to-video, text-to-music, and text-to-motion generation. We delve into fundamental concepts, model architectures, training strategies, and dataset settings for each task. This work serves as a crucial resource for researchers aiming to develop text-to-media models using MFMs tailored to specific requirements. Moreover, we expose the evolution from traditional AIGC to Next-Gen AIGC by discussing the adaptation of advanced MFMs for text-to-media innovations. We identify existing challenges and outline future directions to help researchers gain deeper insights into the future trajectory of AIGC development and deployment. In summary, our review provides a comprehensive understanding of advancing the field of MFMs in text-to-media applications, which may have a far-reaching impact on the community.

Graphical abstract

Keywords

multimodal foundation models / text-to-media / next-gen AIGC / training strategy / dataset

Cite this article

Download citation ▾

Cong JIN, Jingru FAN, Jinfa HUANG, Jinyuan FU, Tao MEI, Li YUAN, Jiebo LUO. Next-Gen AIGC: a review of multimodal foundation models for text-to-media innovations. Front. Comput. Sci., 2026, 20(12): 2012368 DOI:10.1007/s11704-025-51171-9

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Bommasani R, Hudson D A, Adeli E, Altman R, Arora S, , et al. On the opportunities and risks of foundation models. 2021, arXiv preprint arXiv: 2108.07258

[2]	Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, S utskever I. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 8748−8763

[3]	Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M, S utskever I. Zero-shot text-to-image generation. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 8821−8831

[4]	Chen T, Kornblith S, Norouzi M, H inton G E. A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 1597−1607

[5]	He K, Fan H, Wu Y, Xie S, G irshick R. Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 9726−9735

[6]	Bao H, Dong L, Piao S, W ei F. BEiT: BERT pre-training of image transformers. In: Proceedings of the 10th International Conference on Learning Representations. 2022

[7]	He K, Chen X, Xie S, Li Y, Dollár P, G irshick R. Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 15979−15988

[8]	Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, , et al. GPT-4 technical report. 2023, arXiv preprint arXiv: 2303.08774

[9]	Anil R, Borgeaud S, Alayrac J B, Yu J H, Soricut R, , et al. Gemini: a family of highly capable multimodal models. 2023, arXiv preprint arXiv: 2312.11805

[10]	Jin Y, Li J, Liu Y, Gu T, Wu K, Jiang Z, He M, Zhao B, Tan X, Gan Z, Wang Y, Wang C, M a L. Efficient multimodal large language models: a survey. 2024, arXiv preprint arXiv: 2405.10739

[11]	Zhang D, Yu Y, Dong J, Li C, Su D, Chu C, Y u D. MM-LLMs: recent advances in multimodal large language models. In: Proceedings of Findings of the Association for Computational Linguistics. 2024, 12401−12430

[12]	Bai G, Chai Z, Ling C, Wang S, Lu J, Zhang N, Shi T, Yu Z, Zhu M, Zhang Y, Song X, Yang C, Cheng Y, Z hao L. Beyond efficiency: a systematic survey of resource-efficient large language models. 2024, arXiv preprint arXiv: 2401.00625

[13]	Sorscher B, Geirhos R, Shekhar S, Ganguli S, M orcos A. Beyond neural scaling laws: beating power law scaling via data pruning. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1419

[14]	Bai T, Liang H, Wan B, Xu Y, Li X, Li S, Yang L, Li B, Wang Y, Cui B, Huang P, Shan J, He C, Yuan B, Z hang W. A survey of multimodal large language model from a data-centric perspective. 2024, arXiv preprint arXiv: 2405.16640

[15]	Jin C, Liu X, Zhao Y, Zhu Y, Wang J, Wang H . ViolinBot: A framework for imitation learning of violin bowing using fuzzy logic and PCA. IEEE Transactions on Fuzzy Systems, 2024, 32( 9): 5005–5017

[16]	Li J, Li D, Xiong C, Hoi S C H. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 12888−12900

[17]	Fang Y, Wang W, Xie B, Sun Q, Wu L, Wang X, Huang T, Wang X, C ao Y. EVA: exploring the limits of masked visual representation learning at scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 19358−19369

[18]	Zhou J, Ke P, Qiu X, Huang M, Zhang J . ChatGPT: potential, prospects, and limitations. Frontiers of Information Technology & Electronic Engineering, 2024, 25( 1): 6–11

[19]	Chiang W L, Li Z, Lin Z, Sheng Y, Wu Z, Zhang H, Zheng L, Zhuang S, Zhuang Y, Gonzalez J E, Stoica I, X ing E P. Vicuna: An open-source chatbot impressing GPT-4 with 90%* chatGPT quality. See vicuna.lmsys website, 2023

[20]	Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E, L ample G. LLaMA: open and efficient foundation language models. 2023, arXiv preprint arXiv: 2302.13971

[21]

Alayrac J B, Donahue J, Luc P, Miech A, Barr I, Hasson Y, Lenc K, Mensch A, Millican K, Reynolds M, Ring R, Rutherford E, Cabi S, Han T, Gong Z, Samangooei S, Monteiro M, Menick J, Borgeaud S, Brock A, Nematzadeh A, Sharifzadeh S, Binkowski M, Barreira R, Vinyals O, Zisserman A, S imonyan K. Flamingo: a visual language model for few-shot learning. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1723

[22]	Li J, Li D, Savarese S, H oi S. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 814

[23]	Peng Z, Wang W, Dong L, Hao Y, Huang S, Ma S, W ei F. Kosmos-2: grounding multimodal large language models to the world. 2023, arXiv preprint arXiv: 2306.14824

[24]	Zheng K Z, He X H, W ang X E. MiniGPT-5: interleaved vision-and-language generation via generative vokens. 2023, arXiv preprint arXiv: 2310.02239

[25]	Zhang D, Li S, Zhang X, Zhan J, Wang P, Zhou Y, Q iu X. SpeechGPT: empowering large language models with intrinsic cross-modal conversational abilities. In: Proceedings of the Findings of the Association for Computational Linguistics. 2023, 15757−15773

[26]	Wu C, Yin S, Qi W, Wang X, Tang Z, D uan N. Visual chatGPT: talking, drawing and editing with visual foundation models. 2023, arXiv preprint arXiv: 2303.04671

[27]	Huang R, Li M, Yang D, Shi J, Chang X, Ye Z, Wu Y, Hong Z, Huang J, Liu J, Ren Y, Zou Y, Zhao Z, W atanabe S. AudioGPT: understanding and generating speech, music, sound, and talking head. In: Proceedings of the 38th AAAI Conference on Artificial Intelligence. 2024, 23802−23804

[28]	Farzan Y, RezaeiAkbarieh A . VDM: a model for vector dark matter. Journal of Cosmology and Astroparticle Physics, 2012, 2012( 10): 026

[29]	Zhou D, Wang W, Yan H, Lv W, Zhu Y, F eng J. MagicVideo: efficient video generation with latent diffusion models. 2022, arXiv preprint arXiv: 2211.11018

[30]	Khachatryan L, Movsisyan A, Tadevosyan V, Henschel R, Wang Z, Navasardyan S, S hi H. Text2Video-zero: text-to-image diffusion models are zero-shot video generators. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023, 15908−15918

[31]	Cui Y, Shan X, Chung J . A feasibility study on RUNWAY GEN-2 for generating realistic style images. International Journal of Internet, Broadcasting and Communication, 2024, 16( 1): 99–105

[32]	Yin S, Wu C, Yang H, Wang J, Wang X, Ni M, Yang Z, Li L, Liu S, Yang F, Fu J, Gong M, Wang L, Liu Z, Li H, D uan N. NUWA-XL: diffusion over diffusion for extremely long video generation. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, 1309−1320

[33]	Nilsback M E, Z isserman A. Automated flower classification over a large number of classes. In: Proceedings of the 6th Indian Conference on Computer Vision, Graphics & Image Processing. 2008, 722−729

[34]	Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Z itnick C L. Microsoft COCO: common objects in context. In: Proceedings of the 13th European Conference Computer Vision. 2014, 740−755

[35]	Xia W, Yang Y, Xue J H, W u B. TediGAN: text-guided diverse face image generation and manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 2256−2265

[36]	Jiang Y, Huang Z, Pan X, Loy C C, L iu Z. Talk-to-edit: fine-grained facial editing via dialog. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, 13779−13788

[37]	Z hou Y. Generative adversarial network for text-to-face synthesis and manipulation. In: Proceedings of the 29th ACM International Conference on Multimedia. 2021, 2940−2944

[38]	Sun J, Li Q, Wang W, Zhao J, S un Z. Multi-caption text-to-face synthesis: dataset and algorithm. In: Proceedings of the 29th ACM International Conference on Multimedia. 2021, 2290−2298

[39]	Jiang Y, Yang S, Qiu H, Wu W, Loy C C, Liu Z . Text2Human: text-driven controllable human image generation. ACM Transactions on Graphics, 2022, 41( 4): 162

[40]	Zhou Y, S himada N. ABLE: aesthetic box lunch editing. In: Proceedings of the 1st International Workshop on Multimedia for Cooking, Eating, and Related APPlications. 2022, 53−56

[41]	Yu J, Zhu H, Jiang L, Loy C C, Cai W, W u W. CelebV-text: a large-scale facial text-video dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 14805−14814

[42]	Hendricks L A, Wang O, Shechtman E, Sivic J, Darrell T, R ussell B. Localizing moments in video with natural language. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 5804−5813

[43]	Bain M, Nagrani A, Varol G, Z isserman A. Frozen in time: a joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, 1708−1718

[44]	Chen S, Li H, Wang Q, Zhao Z, Sun M, Zhu X, L iu J. VAST: a vision-audio-subtitle-text omni-modality foundation model and dataset. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 3185

[45]	Chen T S, Siarohin A, Menapace W, Deyneka E, Chao H W, Jeon B E, Fang Y, Lee H Y, Ren J, Yang M H, T ulyakov S. Panda-70M: Captioning 70M videos with multiple cross-modality teachers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, 13320−13331

[46]	Sigurdsson G A, Varol G, Wang X, Farhadi A, Laptev I, G upta A. Hollywood in homes: crowdsourcing data collection for activity understanding. In: Proceedings of the 14th European Conference on Computer Vision. 2016, 510−526

[47]	Krishna R, Hata K, Ren F, Li F F, N iebles J C. Dense-captioning events in videos. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 706−715

[48]	Miech A, Zhukov D, Alayrac J B, Tapaswi M, Laptev I, S ivic J. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, 2630−2640

[49]	Damen D, Doughty H, Farinella G M, Fidler S, Furnari A, Kazakos E, Moltisanti D, Munro J, Perrett T, Price W, W ray M. Scaling egocentric vision: the EPIC-KITCHENS dataset. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 753−771

[50]	Defferrard M, Benzi K, Vandergheynst P, B resson X. FMA: a dataset for music analysis. In: Proceedings of the 18th International Society for Music Information Retrieval Conference. 2017, 316−323

[51]	Rafii Z, Liutkus A, Stöter F R, Mimilakis S I, B ittner R. MUSDB18-HQ - an uncompressed version of MUSDB18. See sigsep.github.io/datasets/musdb.html#sisec-2018-evaluation-campaign website, 2019.

[52]	The ‘mixing secrets’ free multitrack download library. All downloads from this site are provided free of charge for educational purposes only.

[53]	Manilow E, Wichern G, Seetharaman P, Le Roux J. Cutting music source separation some slakh: A dataset to study the impact of training data quality and quantity. In: Proceedings of 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). 2019, 45−49

[54]	Stein M, Abeßer J, Dittmar C, S chuller G. Automatic detection of audio effects in guitar and bass recordings. In: Proceedings of the 128th Audio Engineering Society (AES) Convention. 2000

[55]	Bertin-Mahieux T, Ellis D P W, Whitman B, L amere P. The million song dataset. In: Proceedings of the 12th International Society for Music Information Retrieval Conference. 2011

[56]	Turian J, Shier J, Tzanetakis G, McNally K, H enry M. One billion audio sounds from GPU-enabled modular synthesis. In: Proceedings of the 24th International Conference on Digital Audio Effects (DAFx). 2021, 222−229

[57]	Cartwright M, P ardo B. Building a music search database using human computation. In: Proceedings of the 9th Sound and Music Computing Conference. 2012, 343−349

[58]	Plappert M, Mandery C, Asfour T . The KIT motion-language dataset. Big Data, 2016, 4( 4): 236–252

[59]	Punnakkal A R, Chandrasekaran A, Athanasiou N, Quirós-Ramírez A, B lack M J. BABEL: bodies, action and behavior with English labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 722−731

[60]	Mahmood N, Ghorbani N, Troje N F, Pons-Moll G, B lack M. AMASS: archive of motion capture as surface shapes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, 5441−5450

[61]	Liang H, Zhang W, Li W, Yu J, Xu L . InterGen: diffusion-based multi-human motion generation under complex interactions. International Journal of Computer Vision, 2024, 132( 9): 3462–3483

[62]	Guo C, Zuo X, Wang S, Zou S, Sun Q, Deng A, Gong M, C heng L. Action2Motion: conditioned generation of 3D human motions. In: Proceedings of the 28th ACM International Conference on Multimedia. 2020, 2021−2029

[63]	Shahroudy A, Liu J, Ng T T, W ang G. NTU RGB+D: A large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 1010−1019

[64]	Ionescu C, Papava D, Olaru V, Sminchisescu C . Human3. 6m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36( 7): 1325–1339

[65]	Kadu H, Kuo C C J . Automatic human mocap data classification. IEEE Transactions on Multimedia, 2014, 16( 8): 2191–2202

[66]

Saharia C, Chan W, Saxena S, Li L L, Whang J, Denton E L, Ghasemipour K, Lopes R G, Ayan B K, Salimans T, Ho J, Fleet D J, N orouzi M. Photorealistic text-to-image diffusion models with deep language understanding. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 2643

[67]	Rombach R, Blattmann A, Lorenz D, Esser P, O mmer B. High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 10674−10685

[68]	Singer U, Polyak A, Hayes T, Yin X, An J, Zhang S, Hu Q, Yang H, Ashual O, Gafni O, Parikh D, Gupta S, T aigman Y. Make-a-video: text-to-video generation without text-video data. In: Proceedings of the 11th International Conference on Learning Representations. 2023

[69]	Blattmann A, Rombach R, Ling H, Dockhorn T, Kim S W, Fidler S, K reis K. Align your latents: High-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 22563−22575

[70]	Agostinelli A, Denk T I, Borsos Z, Engel J, Verzetti M, Caillon A, Huang Q, Jansen A, Roberts A, Tagliasacchi M, Sharifi M, Zeghidour N, F rank C. MusicLM: generating music from text. 2023, arXiv preprint arXiv: 2301.11325

[71]	Zhang M, Cai Z, Pan L, Hong F, Guo X, Yang L, Liu Z . MotionDiffuse: text-driven human motion generation with diffusion model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46( 6): 4115–4128

[72]	Ho J, Jain A, A bbeel P. Denoising diffusion probabilistic models. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 574

[73]	Zhang B, Gu S, Zhang B, Bao J, Chen D, Wen F, Wang Y, G uo B. StyleSwin: Transformer-based GAN for high-resolution image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 11294−11304

[74]	Peebles W, X ie S. Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023, 4172−4182

[75]	Zhu R, Pan Y, Li Y, Yao T, Sun Z, Mei T, C hen C W. SD-DiT: Unleashing the power of self-supervised discrimination in diffusion transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, 8435−8445

[76]	Qian Y, Cai Q, Pan Y, Li Y, Yao T, Sun Q, M ei T. Boosting diffusion models with moving average sampling in frequency domain. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, 8911−8920

[77]	Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019, 4171−4186

[78]	Skorokhodov I, Tulyakov S, E lhoseiny M. StyleGAN-V: a continuous video generator with the price, image quality and perks of styleGAN2. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 3616−3626

[79]	Qiao T, Zhang J, Xu D, T ao D. MirrorGAN: learning text-to-image generation by redescription. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 1505−1514

[80]	Chen X, Wang Y, Zhang L, Zhuang S, Ma X, Yu J, Wang Y, Lin D, Qiao Y, L iu Z. SEINE: short-to-long video diffusion model for generative transition and prediction. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[81]	Xu J, Liu X, Wu Y, Tong Y, Li Q, Ding M, Tang J, D ong Y. ImageReward: learning and evaluating human preferences for text-to-image generation. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 700

[82]	Li B, Qi X, Lukasiewicz T, T orr P H S. Controllable text-to-image generation. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 185

[83]	Copet J, Kreuk F, Gat I, Remez T, Kant D, Synnaeve G, Adi Y, D éfossez A. Simple and controllable music generation. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2066

[84]	Yuan S, Huang J, Shi Y, Xu Y, Zhu R, Lin B, Cheng X, Yuan L, Luo J . MagicTime: time-lapse video generation models as metamorphic simulators. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025, 47( 9): 7340–7351

[85]	Ma X, Wang Y, Jia G, Chen X, Liu Z, Li Y F, Chen C, Q iao Y. Latte: latent diffusion transformer for video generation. Transactions on Machine Learning Research, 2025

[86]	Zheng Z, Peng X, Yang T, Shen C, Li S, Liu H, Zhou Y、Li T、You Y. Open-Sora: democratizing efficient video production for all. 2024, arXiv preprint arXiv:2412.20404

[87]	Kingma D P, W elling M. Auto-encoding variational Bayes. In: Proceedings of the 2nd International Conference on Learning Representations. 2014

[88]	Blattmann A, Dockhorn T, Kulal S, Mendelevitch D, Kilian M, Lorenz D, Levi Y, English Z, Voleti V, Letts A, Jampani V, R ombach R. Stable video diffusion: scaling latent video diffusion models to large datasets. 2023, arXiv preprint arXiv: 2311.15127

[89]	Sun P, Jiang Y, Chen S, Zhang S, Peng B, Luo P, Y uan Z. Autoregressive model beats diffusion: llama for scalable image generation. 2024, arXiv preprint arXiv: 2406.06525

[90]	Vahdat A, K autz J. NVAE: a deep hierarchical variational autoencoder. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1650

[91]	Karras T, Aila T, Laine S, L ehtinen J. Progressive growing of GANs for improved quality, stability, and variation. In: Proceedings of the 6th International Conference on Learning Representations. 2018

[92]	van den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A W, K avukcuoglu K. WaveNet: a generative model for raw audio. In: Proceedings of the 9th ISCA Speech Synthesis Workshop. 2016, 125

[93]	Jin C, Wang T, Liu S, Tie Y, Li J, Li X, Lui S . A transformer-based model for multi-track music generation. International Journal of Multimedia Data Engineering and Management, 2020, 11( 3): 36–54

[94]	Hernandez-Olivan C, B eltrán J R. Music composition with deep learning: a review. In: Biswas A, Wennekes E, Wieczorkowska A, Laskar R H, eds. Advances in Speech and Music Technology: Computational Aspects and Applications. Cham: Springer, 2023, 25−50

[95]	Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, P olosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000−6010

[96]	Lan Y H, Hsiao W Y, Cheng H C, Y ang Y H. MusiConGen: Rhythm and chord control for transformer-based text-to-music generation. In: Proceedings of the 25th International Society for Music Information Retrieval Conference. 2024, 311−318

[97]	Huang Y S, Y ang Y H. Pop music transformer: beat-based modeling and generation of expressive pop piano compositions. In: Proceedings of the 28th ACM International Conference on Multimedia. 2020, 1180−1188

[98]	Dash A, Agres K . AI-based affective music generation systems: a review of methods and challenges. ACM Computing Surveys, 2024, 56( 11): 287

[99]	Liu X, Zhu Z, Liu H, Yuan Y, Huang Q, Cui M, Liang J, Cao Y, Kong Q, Plumbley M D, Wang W . WavJourney: Compositional audio creation with large language models. IEEE Transactions on Audio, Speech and Language Processing, 2025, 33: 2830–2844

[100]

Mao H H, Shin T, Cottrell G. DeepJ: style-specific music generation. In: Proceedings of the 12th IEEE International Conference on Semantic Computing (ICSC). 2018, 377−382

[101]

Tamamori A, Hayashi T, Kobayashi K, Takeda K, T oda T. Speaker-dependent WaveNet vocoder. In: Proceedings of the 18th Annual Conference of the International Speech Communication Association. 2017, 1118−1122

[102]

Engel J, Resnick C, Roberts A, Dieleman S, Norouzi M, Eck D, S imonyan K. Neural audio synthesis of musical notes with WaveNet autoencoders. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 1068−1077

[103]

Liu H, Choi W, Liu X, Kong Q, Tian Q, W ang D L. Neural vocoder is all you need for speech super-resolution. In: Proceedings of the 23rd Annual Conference of the International Speech Communication Association. 2022, 4227−4231

[104]

Ericsson L, Gouk H, Loy C C, Hospedales T M . Self-supervised representation learning: introduction, advances, and challenges. IEEE Signal Processing Magazine, 2022, 39( 3): 42–62

[105]

Vanhatalo T, Legrand P, Desainte-Catherine M, Hanna P, Brusco A, Pille G, Bayle Y . A review of neural network-based emulation of guitar amplifiers. Applied Sciences, 2022, 12( 12): 5894

[106]

Wu S, Liu Z, Lu S, C heng L. Dual learning music composition and dance choreography. In: Proceedings of the 29th ACM International Conference on Multimedia. 2021, 3746−3754

[107]

Wang Y X, Ramanan D, H ebert M. Growing a brain: fine-tuning by increasing model capacity. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 3029−3038

[108]

Li S, Sung Y . MRBERT: pre-training of melody and rhythm for automatic music generation. Mathematics, 2023, 11( 4): 798

[109]

Tokui N Can GAN originate new electronic dance music genres?—Generating novel rhythm patterns using GAN with genre ambiguity loss. 2020, arXiv preprint arXiv: 2011.13062

[110]

Zhang M, Li H, Cai Z, Ren J, Yang L, L iu Z. FineMoGen: fine-grained spatio-temporal motion generation and editing. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 616

[111]

Guo C, Zou S, Zuo X, Wang S, Ji W, Li X, C heng L. Generating diverse and natural 3D human motions from text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 5142−5151

[112]

Sun H, Zheng R, Huang H, Ma C, Huang H, H u R. LGTM: local-to-global text-driven human motion diffusion model. In: Proceedings of the ACM SIGGRAPH 2024 Conference Papers. 2024, 66

[113]

Chen W, Xiao H, Zhang E, Hu L, Wang L, Liu M, C hen C. SATO: stable text-to-motion framework. In: Proceedings of the 32nd ACM International Conference on Multimedia. 2024, 6989−6997

[114]

Zhao K, Li G, T ang S. DART: a diffusion-based autoregressive motion model for real-time text-driven motion control. 2024, arXiv preprint arXiv: 2410.05260

[115]

Liu J, Dai W, Wang C, Cheng Y, Tang Y, T ong X. Plan, posture and go: towards open-vocabulary text-to-motion generation. In: Proceedings of the 18th European Conference on Computer Vision. 2025, 445−463

[116]

Li J, Zhang Y, Zeng Y, Ye C, Xu W, Ben X, Wang F Y, Zhang J . Rethinking appearance-based deep gait recognition: reviews, analysis, and insights from gait recognition evolution. IEEE Transactions on Neural Networks and Learning Systems, 2025, 36( 6): 9777–9797

[117]

Chi S, Chi H G, Ma H, Agarwal N, Siddiqui F, Ramani K, L ee K. M2D2M: multi-motion generation from text with discrete diffusion models. In: Proceedings of the 18th European Conference on Computer Vision. 2025, 18−36

[118]

Zhu W, Ma X, Ro D, Ci H, Zhang J, Shi J, Gao F, Tian Q, Wang Y . Human motion generation: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46( 4): 2430–2449

[119]

Sun Y, Liu J, Liu W, Han J, Ding E, L iu J. Chinese street view text: large-scale Chinese text reading with partially supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, 9085−9094

[120]

Mao Y, Liu X, Zhou W, Lu Z, L i H. Learning generalizable human motion generator with reinforcement learning. 2024, arXiv preprint arXiv: 2405.15541

[121]

Uchida K, Shibuya T, Takida Y, Murata N, Tanke J, Takahashi S, M itsufuji Y. MoLA: motion generation and editing with latent diffusion enhanced by adversarial training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2025, 2901−2910