Next-Gen AIGC: a review of multimodal foundation models for text-to-media innovations
Cong JIN , Jingru FAN , Jinfa HUANG , Jinyuan FU , Tao MEI , Li YUAN , Jiebo LUO
Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (12) : 2012368
Next-Gen AIGC: a review of multimodal foundation models for text-to-media innovations
Multimodal Foundation Models (MFMs), including diffusion models and multimodal large language models, have attracted significant interest owing to their scalable capabilities in tasks involving vision as well as vision-language understanding and generation. Despite the growing body of research focusing on MFMs’ advancements, a comprehensive review of their applications in the text-to-media domain remains limited. This review aims to bridge that gap by offering an exhaustive overview of MFMs’ development within the text-to-media landscape. We focus on four popular text prompt-based AI-generated content (AIGC) tasks: text-to-image, text-to-video, text-to-music, and text-to-motion generation. We delve into fundamental concepts, model architectures, training strategies, and dataset settings for each task. This work serves as a crucial resource for researchers aiming to develop text-to-media models using MFMs tailored to specific requirements. Moreover, we expose the evolution from traditional AIGC to Next-Gen AIGC by discussing the adaptation of advanced MFMs for text-to-media innovations. We identify existing challenges and outline future directions to help researchers gain deeper insights into the future trajectory of AIGC development and deployment. In summary, our review provides a comprehensive understanding of advancing the field of MFMs in text-to-media applications, which may have a far-reaching impact on the community.
multimodal foundation models / text-to-media / next-gen AIGC / training strategy / dataset
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
Li J, Li D, Xiong C, Hoi S C H. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 12888−12900 |
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
|
| [33] |
|
| [34] |
|
| [35] |
|
| [36] |
|
| [37] |
|
| [38] |
|
| [39] |
|
| [40] |
|
| [41] |
|
| [42] |
|
| [43] |
|
| [44] |
|
| [45] |
|
| [46] |
|
| [47] |
|
| [48] |
|
| [49] |
|
| [50] |
|
| [51] |
|
| [52] |
|
| [53] |
Manilow E, Wichern G, Seetharaman P, Le Roux J. Cutting music source separation some slakh: A dataset to study the impact of training data quality and quantity. In: Proceedings of 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). 2019, 45−49 |
| [54] |
|
| [55] |
|
| [56] |
|
| [57] |
|
| [58] |
|
| [59] |
|
| [60] |
|
| [61] |
|
| [62] |
|
| [63] |
|
| [64] |
|
| [65] |
|
| [66] |
|
| [67] |
|
| [68] |
|
| [69] |
|
| [70] |
|
| [71] |
|
| [72] |
|
| [73] |
|
| [74] |
|
| [75] |
|
| [76] |
|
| [77] |
Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019, 4171−4186 |
| [78] |
|
| [79] |
|
| [80] |
|
| [81] |
|
| [82] |
|
| [83] |
|
| [84] |
|
| [85] |
|
| [86] |
Zheng Z, Peng X, Yang T, Shen C, Li S, Liu H, Zhou Y、Li T、You Y. Open-Sora: democratizing efficient video production for all. 2024, arXiv preprint arXiv:2412.20404 |
| [87] |
|
| [88] |
|
| [89] |
|
| [90] |
|
| [91] |
|
| [92] |
|
| [93] |
|
| [94] |
|
| [95] |
|
| [96] |
|
| [97] |
|
| [98] |
|
| [99] |
|
| [100] |
Mao H H, Shin T, Cottrell G. DeepJ: style-specific music generation. In: Proceedings of the 12th IEEE International Conference on Semantic Computing (ICSC). 2018, 377−382 |
| [101] |
|
| [102] |
|
| [103] |
|
| [104] |
|
| [105] |
|
| [106] |
|
| [107] |
|
| [108] |
|
| [109] |
|
| [110] |
|
| [111] |
|
| [112] |
|
| [113] |
|
| [114] |
|
| [115] |
|
| [116] |
|
| [117] |
|
| [118] |
|
| [119] |
|
| [120] |
|
| [121] |
|
Higher Education Press
/
| 〈 |
|
〉 |