Next-Gen AIGC: a review of multimodal foundation models for text-to-media innovations
Cong JIN , Jingru FAN , Jinfa HUANG , Jinyuan FU , Tao MEI , Li YUAN , Jiebo LUO
Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (12) : 2012368
Multimodal Foundation Models (MFMs), including diffusion models and multimodal large language models, have attracted significant interest owing to their scalable capabilities in tasks involving vision as well as vision-language understanding and generation. Despite the growing body of research focusing on MFMs’ advancements, a comprehensive review of their applications in the text-to-media domain remains limited. This review aims to bridge that gap by offering an exhaustive overview of MFMs’ development within the text-to-media landscape. We focus on four popular text prompt-based AI-generated content (AIGC) tasks: text-to-image, text-to-video, text-to-music, and text-to-motion generation. We delve into fundamental concepts, model architectures, training strategies, and dataset settings for each task. This work serves as a crucial resource for researchers aiming to develop text-to-media models using MFMs tailored to specific requirements. Moreover, we expose the evolution from traditional AIGC to Next-Gen AIGC by discussing the adaptation of advanced MFMs for text-to-media innovations. We identify existing challenges and outline future directions to help researchers gain deeper insights into the future trajectory of AIGC development and deployment. In summary, our review provides a comprehensive understanding of advancing the field of MFMs in text-to-media applications, which may have a far-reaching impact on the community.
multimodal foundation models / text-to-media / next-gen AIGC / training strategy / dataset
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
Li J, Li D, Xiong C, Hoi S C H. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 12888−12900 |
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
|
| [33] |
|
| [34] |
|
| [35] |
|
| [36] |
|
| [37] |
|
| [38] |
|
| [39] |
|
| [40] |
|
| [41] |
|
| [42] |
|
| [43] |
|
| [44] |
|
| [45] |
|
| [46] |
|
| [47] |
|
| [48] |
|
| [49] |
|
| [50] |
|
| [51] |
|
| [52] |
|
| [53] |
Manilow E, Wichern G, Seetharaman P, Le Roux J. Cutting music source separation some slakh: A dataset to study the impact of training data quality and quantity. In: Proceedings of 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). 2019, 45−49 |
| [54] |
|
| [55] |
|
| [56] |
|
| [57] |
|
| [58] |
|
| [59] |
|
| [60] |
|
| [61] |
|
| [62] |
|
| [63] |
|
| [64] |
|
| [65] |
|
| [66] |
|
| [67] |
|
| [68] |
|
| [69] |
|
| [70] |
|
| [71] |
|
| [72] |
|
| [73] |
|
| [74] |
|
| [75] |
|
| [76] |
|
| [77] |
Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019, 4171−4186 |
| [78] |
|
| [79] |
|
| [80] |
|
| [81] |
|
| [82] |
|
| [83] |
|
| [84] |
|
| [85] |
|
| [86] |
Zheng Z, Peng X, Yang T, Shen C, Li S, Liu H, Zhou Y、Li T、You Y. Open-Sora: democratizing efficient video production for all. 2024, arXiv preprint arXiv:2412.20404 |
| [87] |
|
| [88] |
|
| [89] |
|
| [90] |
|
| [91] |
|
| [92] |
|
| [93] |
|
| [94] |
|
| [95] |
|
| [96] |
|
| [97] |
|
| [98] |
|
| [99] |
|
| [100] |
Mao H H, Shin T, Cottrell G. DeepJ: style-specific music generation. In: Proceedings of the 12th IEEE International Conference on Semantic Computing (ICSC). 2018, 377−382 |
| [101] |
|
| [102] |
|
| [103] |
|
| [104] |
|
| [105] |
|
| [106] |
|
| [107] |
|
| [108] |
|
| [109] |
|
| [110] |
|
| [111] |
|
| [112] |
|
| [113] |
|
| [114] |
|
| [115] |
|
| [116] |
|
| [117] |
|
| [118] |
|
| [119] |
|
| [120] |
|
| [121] |
|
Higher Education Press
/
| 〈 |
|
〉 |