Label distribution for multimodal machine learning

Yi REN, Ning XU, Miaogen LING, Xin GENG

PDF(724 KB)
PDF(724 KB)
Front. Comput. Sci. ›› 2022, Vol. 16 ›› Issue (1) : 161306. DOI: 10.1007/s11704-021-0611-6
Artificial Intelligence
RESEARCH ARTICLE

Label distribution for multimodal machine learning

Author information +
History +

Abstract

Multimodal machine learning (MML) aims to understand the world from multiple related modalities. It has attracted much attention as multimodal data has become increasingly available in real-world application. It is shown that MML can perform better than single-modal machine learning, since multi-modalities containing more information which could complement each other. However, it is a key challenge to fuse the multi-modalities in MML. Different from previous work, we further consider the side-information, which reflects the situation and influences the fusion of multi-modalities. We recover multimodal label distribution (MLD) by leveraging the side-information, representing the degree to which each modality contributes to describing the instance. Accordingly, a novel framework named multimodal label distribution learning (MLDL) is proposed to recover the MLD, and fuse the multimodalities with its guidance to learn an in-depth understanding of the jointly feature representation. Moreover, two versions of MLDL are proposed to deal with the sequential data. Experiments on multimodal sentiment analysis and disease prediction show that the proposed approaches perform favorably against state-of-the-art methods.

Keywords

multimodal machine learning / label distribution learning / sentiment analysis / disease prediction

Cite this article

Download citation ▾
Yi REN, Ning XU, Miaogen LING, Xin GENG. Label distribution for multimodal machine learning. Front. Comput. Sci., 2022, 16(1): 161306 https://doi.org/10.1007/s11704-021-0611-6

References

[1]
Baltrušaitis T, Ahuja C, Morency L P. Multimodal machine learning: a survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41(2): 423–443
CrossRef Google scholar
[2]
Snoek C G, Worring M. Multimodal video indexing: a review of the stateof-the-art. Multimedia Tools and Applications, 2005, 25(1): 5–35
CrossRef Google scholar
[3]
Yuhas B P, Goldstein M H, Sejnowski T J. Integration of acoustic and visual speech signals using neural networks. IEEE Communications Magazine, 1989, 27(11): 65–71
CrossRef Google scholar
[4]
McGurk H, MacDonald J. Hearing lips and seeing voices. Nature, 1976, 264(5588): 746–748
CrossRef Google scholar
[5]
Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng A Y. Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning. 2011, 689–696
[6]
Poria S, Cambria E, Hazarika D, Mazumder N, Zadeh A, Morency L P. Multi-level multiple attentions for contextual multimodal sentiment analysis. In: Proceedings of 2017 IEEE International Conference on Data Mining. 2017, 1033–1038
CrossRef Google scholar
[7]
Tsai Y H H, Bai S, Liang P P, Kolter J Z, Morency L P, Salakhutdinov R. Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 6558–6569
CrossRef Google scholar
[8]
Xu K, Lam M, Pang J, Gao X, Band C, Mathur P, Papay F, Khanna A K, Cywinski J B, Maheshwari K, et al. Multimodal machine learning for automated icd coding. In: Proceedings of Machine Learning for Healthcare Conference. 2019, 197–215
[9]
Phan-Minh T, Grigore E C, Boulton F A, Beijbom O, Wolff E M. Covernet: multimodal behavior prediction using trajectory sets. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 14074–14083
CrossRef Google scholar
[10]
Geng X. Label distribution learning. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(7): 1734–1748
CrossRef Google scholar
[11]
Weston J, Bengio S, Usunier N. Wsabie: scaling up to large vocabulary image annotation. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence. 2011
[12]
Kiros R, Salakhutdinov R, Zemel R S. Unifying visual-semantic embeddings with multimodal neural language models. 2014, arXiv preprint arXiv:1411.2539
[13]
Wang J, Shen H T, Song J, Ji J. Hashing for similarity search: a survey. 2014, arXiv preprint arXiv:1408.2927
[14]
Rasiwasia N, Pereira J C, Coviello E, Doyle G, Lanckriet G R, Levy R, Vasconcelos N. A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM International Conference on Multimedia. 2010, 251–260
CrossRef Google scholar
[15]
Sargin M E, Yemez Y, Erzin E, Tekalp A M. Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Transactions on Multimedia, 2007, 9(7): 1396–1403
CrossRef Google scholar
[16]
Poria S, Chaturvedi I, Cambria E, Hussain A. Convolutional MKL based multimodal emotion recognition and sentiment analysis. In: Proceedings of the 16th IEEE International Conference on Data Mining. 2016, 439–448
CrossRef Google scholar
[17]
Zadeh A, Zellers R, Pincus E, Morency L P. Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intelligent Systems, 2016, 31(6): 82–88
CrossRef Google scholar
[18]
Morvant E, Habrard A, Ayache S. Majority vote of diverse classifiers for late fusion. In: Proceedings of Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). 2014, 153–162
CrossRef Google scholar
[19]
Potamianos G, Neti C, Gravier G, Garg A, Senior A W. Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE, 2003, 91(9): 1306–1326
CrossRef Google scholar
[20]
Evangelopoulos G, Zlatintsi A, Potamianos A, Maragos P, Rapantzikos K, Skoumas G, Avrithis Y. Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. IEEE Transactions on Multimedia, 2013 15(7): 1553–1568
CrossRef Google scholar
[21]
Srivastava N, Salakhutdinov R R. Multimodal learning with deep boltzmann machines. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. 2012, 2222–2230
[22]
Mroueh Y, Marcheret E, Goel V. Deep multimodal learning for audiovisual speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. 2015, 2130–2134
CrossRef Google scholar
[23]
Zadeh A, Liang P P, Poria S, Vij P, Cambria E, Morency L P. Multiattention recurrent network for human communication comprehension. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018
[24]
Zadeh A, Liang P P,Morency L P.Foundations of multimodal co-learning. Information Fusion, 2020, 64: 188–193
CrossRef Google scholar
[25]
Geng X, Yin C, Zhou Z H. Facial age estimation by learning from label distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(10): 2401–2412
CrossRef Google scholar
[26]
Geng X, Xia Y. Head pose estimation based on multivariate label distribution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, 1837–1842
CrossRef Google scholar
[27]
Su K, Yu D, Xu Z, Geng X, Wang C. Multi-person pose estimation with enhanced channel-wise and spatial information. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019, 5674–5682
CrossRef Google scholar
[28]
Ren Y, Geng X. Sense beauty by label distribution learning. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017, 2648–2654
CrossRef Google scholar
[29]
Chen S, Wang J, Chen Y, Shi Z, Geng X, Rui Y. Label distribution learning on auxiliary label space graphs for facial expression recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 13984–13993
CrossRef Google scholar
[30]
Lv J, Xu M, Feng L, Niu G, Geng X, Sugiyama M. Progressive identification of true labels for partial-label learning. In: Proceedings of International Conference on Machine Learning. 2020, 6500–6510
[31]
Xu N, Tao A, Geng X. Label enhancement for label distribution learning. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2018, 2926–2932
CrossRef Google scholar
[32]
Xu N, Liu Y P, Geng X. Label enhancement for label distribution learning. IEEE Transactions on Knowledge and Data Engineering, 2021, 33(4): 1632–1643
CrossRef Google scholar
[33]
Xu N, Shu J, Liu Y P, Geng X. Variational label enhancement. In: Proceedings of International Conference on Machine Learning. 2020, 10597–10606
[34]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 5998–6008
[35]
Graves A, Jaitly N, Mohamed A R. Hybrid speech recognition with deep bidirectional lstm. In: Proceedings of 2013 IEEEWorkshop on Automatic Speech Recognition and Understanding. 2013, 273–278
CrossRef Google scholar
[36]
Zadeh A B, Liang P P, Poria S, Cambria E, Morency L P. Multimodal language analysis in the wild: cmu-mosei dataset and interpretable dy namic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 2236–2246
[37]
Pennington J, Socher R, Manning C D. Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014, 1532–1543
CrossRef Google scholar
[38]
Tian Y L, Kanade T, Cohn J F. Facial expression analysis. In: Handbook of Face Recognition. Springer, New York, 2005
[39]
Degottex G, Kane J, Drugman T, Raitio T, Scherer S. Covarep—a collaborative voice analysis repository for speech technologies. In: Proceedings of 2014 IEEE International Conference on Acoustics, Speech and Signal Processing. 2014, 960–964
CrossRef Google scholar
[40]
Yuan J, Liberman M. Speaker identification on the scotus corpus. Journal of the Acoustical Society of America, 2008, 123(5): 3878
CrossRef Google scholar
[41]
Wang Y, Shen Y, Liu Z, Liang P P, Zadeh A, Morency L P. Words can shift: dynamically adjusting word representations using nonverbal behaviors. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 7216–7223
CrossRef Google scholar
[42]
Johnson A E, Pollard T J, Shen L, Li-Wei H L, Feng M, Ghassemi G, Moody B, Szolovits P, Celi L A, Roger GMark R G. Mimic-iii, a freely accessible critical care database. Scientific Data, 2016, 3:160035
CrossRef Google scholar
[43]
Choi E, Bahadori M T, Song L, Stewart W F, Sun J. Gram: graph-based attention model for healthcare representation learning. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017, 787–795
CrossRef Google scholar
[44]
Mikolov T, Sutskever I, Chen K, Corrado G S, Dean J. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 3111–3119
[45]
Schuster M, Paliwal K K. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 1997, 45(11): 2673–2681
CrossRef Google scholar
[46]
Choi E, Bahadori M T, Schuetz A, Stewart W F, Sun J. Doctor AI: predicting clinical events via recurrent neural networks. In: Proceedings of Machine Learning for Healthcare Conference. 2016, 301–318

RIGHTS & PERMISSIONS

2022 Higher Education Press
AI Summary AI Mindmap
PDF(724 KB)

Accesses

Citations

Detail

Sections
Recommended

/