Translation-based multimodal learning: a survey

Zhengyi Lu; Yunhong Liao; Jia Li

doi:10.20517/ir.2025.40

Intelligence & Robotics ›› 2025, Vol. 5 ›› Issue (3) :783 -804. DOI: 10.20517/ir.2025.40

Review

Translation-based multimodal learning: a survey

Author information +

History +

PDF

Abstract

Translation-based multimodal learning addresses the challenge of reasoning across heterogeneous data modalities by enabling translation between modalities or into a shared latent space. In this survey, we categorize the field into two primary paradigms: end-to-end translation and representation-level translation. End-to-end methods leverage architectures such as encoder–decoder networks, conditional generative adversarial networks, diffusion models, and text-to-image generators to learn direct mappings between modalities. These approaches achieve high perceptual fidelity but often depend on large paired datasets and entail substantial computational overhead. In contrast, representation-level methods focus on aligning multimodal signals within a common embedding space using techniques such as multimodal transformers, graph-based fusion, and self-supervised objectives, resulting in robustness to noisy inputs and missing data. We distill insights from over forty benchmark studies and highlight two notable recent models. The Explainable Diffusion Model via Schrödinger Bridge Multimodal Image Translation (xDSBMIT) framework employs stochastic diffusion combined with the Schrödinger Bridge to enable stable synthetic aperture radar-to-electro-optical image translation under limited data conditions, while TransTrans utilizes modality-specific backbones with a translation-driven transformer to impute missing views in multimodal sentiment analysis tasks. Both methods demonstrate superior performance on benchmarks such as UNICORN-2008 and CMU-MOSI, illustrating the efficacy of integrating optimal transport theory (via the Schrödinger Bridge in xDSBMIT) with transformer-based cross-modal attention mechanisms (in TransTrans). Finally, we identify open challenges and future directions, including the development of hybrid diffusion–transformer pipelines, cross-domain generalization to emerging modalities such as light detection and ranging and hyperspectral imaging, and the necessity for transparent, ethically guided generation techniques. This survey aims to inform the design of versatile, trustworthy multimodal systems.

Keywords

Multimodal / cross-modal translation / sensor fusion / representation / deep learning

Cite this article

Download citation ▾

Zhengyi Lu, Yunhong Liao, Jia Li. Translation-based multimodal learning: a survey. Intelligence & Robotics, 2025, 5(3): 783-804 DOI:10.20517/ir.2025.40

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Koehn P,Marcu D. Statistical phrase-based translation. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. 2003. pp. 127-33. https://aclanthology.org/N03-1017. (accessed 26 Sep 2025)

[2]	Kalchbrenner N. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, USA. Association for Computational Linguistics; 2013. pp. 1700-9. https://aclanthology.org/D13-1176/. (accessed 26 Sep 2025)

[3]	Sutskever I,Le QV. Sequence to sequence learning with neural networks. arXiv 2014, arXiv:1409.3215. Available online: https://doi.org/10.48550/arXiv.1409.3215. (accessed 26 Sep 2025)

[4]	Vaswani A,Parmar N. Atten-tion is all you need. arXiv 2017, arXiv:1706.03762. Available online: https://doi.org/10.48550/arXiv.1706.03762. (accessed 26 Sep 2025)

[5]	Bahdanau D,Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv 2016, arXiv:1409.0473. Available online: https://doi.org/10.48550/arXiv.1409.0473. (accessed 26 Sep 2025)

[6]	Pang Y,Qin T.Image-to-image translation: methods and applications.IEEE Trans Multimedia2021;24:3859-81

[7]	Goodfellow IJ,Mirza M. Generative adversarial networks. arXiv 2014, arXiv:1406.2661. Available online: https://doi.org/10.48550/arXiv.1406.2661. (accessed 26 Sep 2025)

[8]	Mirza M. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. Available online: https://doi.org/10.48550/arXiv.1411.1784. (accessed 26 Sep 2025)

[9]	Zhu JY,Isola P. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv 2020, arXiv:1703.10593. Available online: https://doi.org/10.48550/arXiv.1703.10593. (accessed 26 Sep 2025)

[10]	Kingma DP. Auto-encoding variational bayes. arXiv 2022, arXiv:1312.6114. Available online: https://doi.org/10.48550/arXiv.1312.6114. (accessed 26 Sep 2025)

[11]	Isola P,Zhou T.Image-to-image translation with conditional adversarial networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honululu, USA. July 21-26, 2017. IEEE; 2017. pp. 5967-76.

[12]	Wang TC,Zhu JY,Kautz J. High-resolution image synthesis and semantic manipulation with conditional GANs. arXiv 2018, arXiv:1711.11585. Available online: https://doi.org/10.48550/arXiv.1711.11585. (accessed 26 Sep 2025)

[13]	Gatys LA,Bethge M. A neural algorithm of artistic style. arXiv 2015, arXiv:1508.06576. Available online: https://doi.org/10.48550/arXiv.1508.06576. (accessed 26 Sep 2025)

[14]	Gatys LA,Bethge M.Image style transfer using convolutional neural networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA. June 27-30, 2016. IEEE; 2016. pp. 2414-23.

[15]	Nie D,Lian J.Medical image synthesis with deep convolutional adversarial networks.IEEE Trans Biomed Eng2018;65:2720-30

[16]	Shi Z,Zheng G. Frequency-supervised MR-to-CT image synthesis. arXiv 2021, arXiv:2107.08962. Available online: https://doi.org/10.48550/arXiv.2107.08962. (accessed 26 Sep 2025)

[17]	Shao X. SPatchGAN: a statistical feature based discriminator for unsupervised image-to-image translation. arXiv 2021, arXiv:2103.16219. Available online: https://doi.org/10.48550/arXiv.2103.16219. (accessed 26 Sep 2025)

[18]	Wang L,Yoon KJ. Dual transfer learning for event-based end-task prediction via pluggable event to image translation. arXiv 2021, arXiv:2109.01801. Available online: https://doi.org/10.48550/arXiv.2109.01801. (accessed 26 Sep 2025)

[19]	Yu J,Xie G. SAR2EO: a high-resolution image translation framework with denoising enhancement. arXiv 2023, arXiv:2304.04760. Available online: https://doi.org/10.48550/arXiv.2304.04760. (accessed 26 Sep 2025)

[20]	Anderson P,Johnson M. SPICE: semantic propositional image caption evaluation. arXiv 2016, arXiv:1607.08822. Available online: https://doi.org/10.48550/arXiv.1607.08822. (accessed 26 Sep 2025)

[21]	Li S,Li K.Visual to text: survey of image and video captioning.IEEE Trans Emerg Top Comput Intell2019;3:297-12

[22]	Żelaszczyk M.Cross-modal text and visual generation: a systematic review. Part 1: image to text.Inf Fusion2023;93:302-29

[23]	He X.Deep learning for image-to-text generation: a technical overview.IEEE Signal Process Mag2017;34:109-16

[24]	Indurthi S,Lakumarapu NK.Task aware multi-task learning for speech to text tasks. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Canada. June 06-11, 2021. IEEE; 2021. pp. 7723-7.

[25]	Gállego GI,Escolano C,Costa-jussà MR. End-to-end speech translation with pre-trained models and adapters: UPC at IWSLT 2021. arXiv 2021, arXiv:2105.04512. Available online: https://doi.org/10.48550/arXiv.2105.04512. (accessed 26 Sep 2025)

[26]	Wang X,Zhu J,Scharenborg O.Generating images from spoken descriptions.IEEE/ACM Trans Audio Speech Lang Process2021;29:850-65

[27]	Ning H,Yuan Y.Audio description from image by modal translation network.Neurocomputing2021;423:124-34

[28]	Parmar G,Narasimhan S. One-step image translation with text-to-image models. arXiv 2024, arXiv:2403.12036. Available online: https://doi.org/10.48550/arXiv.2403.12036. (accessed 26 Sep 2025)

[29]	Ngiam J,Kim M,Lee H. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning, Bellevue, USA. 2011. pp. 689-96. https://people.csail.mit.edu/khosla/papers/icml2011_ngiam.pdf. (accessed 26 Sep 2025)

[30]	Bengio Y,Vincent P.Representation learning: a review and new perspectives.IEEE Trans Pattern Anal Mach Intell2013;35:1798-828

[31]	Yu H,Madaio M,Cassell J.Temporally selective attention model for social and affective state recognition in multimedia content. In Proceedings of the 25th ACM International Conference on Multimedia. Association for Computing Machinery; 2017. pp. 1743-51.

[32]	Siriwardhana S,Weerasekera R. Jointly fine-tuning “BERT-like” self supervised models to improve multimodal speech emotion recognition. arXiv 2020, arXiv:2008.06682. Available online: https://doi.org/10.48550/arXiv.2008.06682. (accessed 26 Sep 2025)

[33]	Yu W,Yuan Z. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. arXiv 2021, arXiv:2102.04830. Available online: https://doi.org/10.48550/arXiv.2102.04830. (accessed 26 Sep 2025)

[34]	Lai S,Xu H,Liu Z. Multimodal sentiment analysis: a survey. arXiv 2023, arXiv:2305.07611. Available online: https://doi.org/10.48550/arXiv.2305.07611. (accessed 26 Sep 2025)

[35]	Le H,Chen NF. Multimodal transformer networks for end-to-end video-grounded dialogue systems. arXiv 2019, arXiv:1907.01166. Available online: https://doi.org/10.48550/arXiv.1907.01166. (accessed 26 Sep 2025)

[36]	Tsai YHH,Zadeh A,Salakhutdinov R. Learning factorized multimodal representations. arXiv 2019, arXiv:1806.06176. Available online: https://doi.org/10.48550/arXiv.1806.06176. (accessed 26 Sep 2025)

[37]	Pham H,Manzini T,Póczos B.Found in translation: learning robust joint representations by cyclic translations between modalities. In Proceedings of the AAAI Conference on Artificial Intelligence, Volume 33. 2019. pp. 6892-9.

[38]	Shang C,Sun J.VIGAN: missing view imputation with generative adversarial networks. In 2017 IEEE International Conference on Big Data (Big Data), Boston, USA. December 11-14, 2017. IEEE; 2017. pp. 766-75.

[39]	Zhang C,Han Z,Fu H.Deep partial multi-view learning.IEEE Trans Pattern Anal Mach Intell2020;44:2402-15

[40]	Zhou T,Vera P.Feature-enhanced generation and multi-modality fusion based deep neural network for brain tumor segmentation with missing MR modalities.Neurocomputing2021;466:102-12

[41]	Liu Z,Chu D,Meng L.Modality translation-based multimodal sentiment analysis under uncertain missing modalities.Inf Fusion2024;101:101973

[42]	Lu Z. Translation-based multimodal learning. Master’s thesis, Oakland University, 2024. https://www.secs.oakland.edu/~li4/research/student/MasterThesis_Lu2024.pdf. (accessed 26 Sep 2025)

[43]	Cordts M,Ramos S.The cityscapes dataset for semantic urban scene understanding. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA. June 27-30, 2016. IEEE; 2016. pp. 3213-23.

[44]	Tyleček R.Spatial pattern templates for recognition of objects with regular structure. In Weickert, J., Hein, M., Schiele, B.; editors. Pattern Recognition. GCPR 2013. Lecture Notes in Computer Science, vol 8142. Springer; 2013. pp. 364-74.

[45]	Leong C,Mendoza-Schrock O. Unified coincident optical and radar for recognition (UNICORN) 2008 dataset. https://github.com/AFRL-RY/data-unicorn-2008. (accessed 26 Sep 2025)

[46]	Tan WR,Aguirre HE.Improved ArtGAN for conditional synthesis of natural image and artwork.IEEE Trans Image Process2019;28:394-409

[47]	Plummer BA,Cervantes CM,Hockenmaier J.Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile. December 07-13, 2015. IEEE; 2015. pp. 2641-9.

[48]	Lin TY,Belongie S.Microsoft COCO: common objects in context. In Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T; editors. Computer Vision - ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8693. Springer; 2014. pp. 740-55.

[49]	Zhou B,Khosla A,Torralba A.Places: a 10 million image database for scene recognition.IEEE Trans Pattern Anal Mach Intell2017;40:1452-64

[50]	Gemmeke JF,Freedman D.Audio set: an ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA. March 05-09, 2017. IEEE; 2017. pp. 776-80.

[51]	Zadeh A,Pincus E. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv 2016, arXiv:1606.06259. Available online: https://doi.org/10.48550/arXiv.1606.06259. (accessed 26 Sep 2025)

[52]

Zadeh AB,Poria S,Morency LP.Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia. Association for Computational Linguistics; 2018. pp. 2236-46.

[53]	Busso C,Lee CC.IEMOCAP: interactive emotional dyadic motion capture database.Lang Resour Eval2008;42:335-59

[54]	Tsai YHH,Liang PP,Morency LP.Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. Association for Computational Linguistics; 2019. pp. 6558-69.

[55]	Sun Z,Sethares W. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. arXiv 2019, arXiv:1911.05544. Available online: https://doi.org/10.48550/arXiv.1911.05544. (accessed on 26 Sep 2025)

[56]	Yu J,Xia R.Hierarchical interactive multimodal transformer for aspect-based multimodal sentiment analysis.IEEE Trans Affective Comput2023;14:1966-78

[57]	Jiang K,An Z,Zhang C.Mutual retinex: combining transformer and CNN for image enhancement.IEEE Trans Emerg Top Comput Intell2024;8:2240-52

[58]	Xiao Y,Jiang K,Lin CW.TTST: a top-k token selective transformer for remote sensing image super-resolution.IEEE Trans Image Process2024;33:738-52

[59]	Jiang K,Chen C,Cui L.Magic ELF: image deraining meets association learning and transformer. In Proceedings of the 30th ACM International Conference on Multimedia. Association for Computing Machinery; 2022. pp, 827-36.

[60]	Wang Y,Wang D,Wan B.Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis.Neurocomputing2024;572:127181

[61]	Wang D,Wang Q,He L.Cross-modal enhancement network for multimodal sentiment analysis.IEEE Trans Multimedia2023;25:4909-21

[62]	Khan S,Hayat M,Khan FS.Transformers in vision: a survey.ACM Comput Surv2022;54:1-41

[63]	Song Y,Kingma DP,Ermon S. Score-based generative modeling through stochastic differential equations. arXiv 2020, arXiv:2011.13456. Available online: https://doi.org/10.48550/arXiv.2011.13456. (accessed on 26 Sep 2025)

[64]	Ho J,Abbeel P. Denoising diffusion probabilistic models. arXiv 2020, arXiv:2006.11239. Available online: https://doi.org/10.48550/arXiv.2006.11239. (accessed on 26 Sep 2025)

[65]	Song Y. Generative modeling by estimating gradients of the data distribution. arXiv 2020, arXiv:1907.05600. Available online: https://doi.org/10.48550/arXiv.1907.05600. (accessed on 26 Sep 2025)

[66]	Xiao Y,Jiang K,Jin X.EDiffSR: an efficient diffusion probabilistic model for remote sensing image super-resolution.IEEE Trans Geosci Remote Sens2024;62:1-14

[67]	Shi Y,Campbell A. Diffusion Schrödinger Bridge matching. arXiv 2023, arXiv:2303.16852. Available online: https://doi.org/10.48550/arXiv.2303.16852. (accessed on 26 Sep 2025)

[68]	Liu GH,Huang DA,Nie W. I²SB: Image-to-image Schrödinger Bridge. arXiv 2023, arXiv:2302.05872. Available online: https://doi.org/10.48550/arXiv.2302.05872. (accessed on 26 Sep 2025)

[69]	Tang Z,Gu S,Guo B. Simplified diffusion Schrödinger Bridge. arXiv 2024, arXiv:2403.14623. Available online: https://doi.org/10.48550/arXiv.2403.14623. (accessed on 26 Sep 2025)

[70]	Chen Z,Zheng K,Zhu J. Schrodinger bridges beat diffusion models on text-to-speech synthesis. arXiv 2023, arXiv:2312.03491. Available online: https://doi.org/10.48550/arXiv.2312.03491. (accessed on 26 Sep 2025)

[71]	Özbey M,Dar SUH,Özturk Ş.Unsupervised medical image translation with adversarial diffusion models.IEEE Trans Med Imaging2023;42:3524-39

[72]	Dhariwal P. Diffusion models beat GANs on image synthesis. arXiv 2021, arXiv:2105.05233. Available online: https://doi.org/10.48550/arXiv.2105.05233. (accessed on 26 Sep 2025)

[73]	Hazarika D,Poria S.MISA: modality-invariant and -specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia. Association for Computing Machinery; 2020. pp. 1122-31.

[74]	Zhao T,Liang T,Kuang K.CLAP: contrastive language-audio pre-training model for multi-modal sentiment analysis. In Proceedings of the 2023 ACM International Conference on Multimedia Retrieval. Association for Computing Machinery; 2023. pp. 622-6.

[75]	He K,Xie S,Dollár P.Masked autoencoders are scalable vision learners. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, USA. June 18-24, 2022. IEEE; 2022. pp. 15979-88.

[76]	Bischke B,König F,Dengel A. Overcoming missing and incomplete modal-ities with generative adversarial networks for building footprint segmentation. arXiv 2018, arXiv:1808.03195. Available online: https://doi.org/10.48550/arXiv.1808.03195. (accessed on 26 Sep 2025)

[77]	Hamghalam M,Lei B.Modality completion via Gaussian process prior variational autoencoders for multi-modal glioma segmentation. In Medical Image Computing and Computer Assisted Intervention - MICCAI 2021. Springer, Cham; 2021. pp. 442-52.

[78]	Zhou T,Vera P.Latent correlation representation learning for brain tumor segmentation with missing MRI modalities.IEEE Trans Image Process2021;30:4263-74

[79]	Sun J,Han S,Li T.RedCore: relative advantage aware cross-modal representation learning for missing modalities with imbalanced missing rates.Proc AAAI Conf Artif Intell2024;38:15173-82

[80]	Park KR,Kim JU. Learning trimodal relation for audio-visual question answering with missing modality. arXiv 2024, arXiv:2407.16171. Available online: https://doi.org/10.48550/arXiv.2407.16171. (accessed on 26 Sep 2025)

[81]	Kim D. Missing modality prediction for unpaired multimodal learning via joint embedding of unimodal models. arXiv 2024, arXiv:2407.12616. Available online: https://doi.org/10.48550/arXiv.2407.12616. (accessed on 26 Sep 2025)

[82]	Guo Z,Zhao Z. Multimodal prompt learning with missing modalities for sentiment analysis and emotion recognition. arXiv 2024, arXiv:2407.05374. Available online: https://doi.org/10.48550/arXiv.2407.05374. (accessed on 26 Sep 2025)

[83]	Lin X,Cai R. Suppress and rebalance: towards generalized multi-modal face anti-spoofing. arXiv 2024, arXiv:2402.19298. Available online: https://doi.org/10.48550/arXiv.2402.19298. (accessed on 26 Sep 2025)

[84]	Lu Z,Blasch E.Explainable diffusion model via Schrödinger Bridge in multimodal image translation. In Dynamic data driven applications systems. Springer, Cham; 2026. pp. 391-402.

[85]	Simonyan K. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. Available online: https://doi.org/10.48550/arXiv.1409.1556. (accessed on 26 Sep 2025)

[86]	Szegedy C,Ioffe S,Wojna Z.Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA. June 27-30, 2016. IEEE; 2016. pp. 2818-26.

[87]	Baldi P. Autoencoders, unsupervised learning, and deep architectures. In Proceedings of ICML Workshop on Unsupervised and Transfer Learning, Bellevue, USA, July 02, 2012. PMLR; 2012. pp. 37-49. https://proceedings.mlr.press/v27/baldi12a.html. (accessed on 26 Sep 2025)

[88]	Hafner S.Multi-modal deep learning for multi-temporal urban mapping with a partly missing optical modality. In IGARSS 2023 - 2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, USA. July 16-21, 2023. IEEE; 2023. pp. 6843-6.