LMR-CBT: learning modality-fused representations with CB-Transformer for multimodal emotion recognition from unaligned multimodal sequences

Ziwang FU; Feng LIU; Qing XU; Xiangling FU; Jiayin QI

doi:10.1007/s11704-023-2444-y

PDF(2506 KB)

Front. Comput. Sci. ›› 2024, Vol. 18 ›› Issue (4) : 184314. DOI: 10.1007/s11704-023-2444-y

Artificial Intelligence

RESEARCH ARTICLE

LMR-CBT: learning modality-fused representations with CB-Transformer for multimodal emotion recognition from unaligned multimodal sequences

Ziwang FU¹^,² ,
Feng LIU³^,⁴ ,
Qing XU¹^,² ,
Xiangling FU¹^,² ,
Jiayin QI¹^,²^,⁵

Author information +

History +

Abstract

Learning modality-fused representations and processing unaligned multimodal sequences are meaningful and challenging in multimodal emotion recognition. Existing approaches use directional pairwise attention or a message hub to fuse language, visual, and audio modalities. However, these fusion methods are often quadratic in complexity with respect to the modal sequence length, bring redundant information and are not efficient. In this paper, we propose an efficient neural network to learn modality-fused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition from unaligned multi-modal sequences. Specifically, we first perform feature extraction for the three modalities respectively to obtain the local structure of the sequences. Then, we design an innovative asymmetric transformer with cross-modal blocks (CB-Transformer) that enables complementary learning of different modalities, mainly divided into local temporal learning, cross-modal feature fusion and global self-attention representations. In addition, we splice the fused features with the original features to classify the emotions of the sequences. Finally, we conduct word-aligned and unaligned experiments on three challenging datasets, IEMOCAP, CMU-MOSI, and CMU-MOSEI. The experimental results show the superiority and efficiency of our proposed method in both settings. Compared with the mainstream methods, our approach reaches the state-of-the-art with a minimum number of parameters.

Graphical abstract

Keywords

modality-fused representations / cross-model blocks / multimodal emotion recognition / unaligned multimodal sequences / computational affection

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Ziwang FU, Feng LIU, Qing XU, Xiangling FU, Jiayin QI. LMR-CBT: learning modality-fused representations with CB-Transformer for multimodal emotion recognition from unaligned multimodal sequences. Front. Comput. Sci., 2024, 18(4): 184314 https://doi.org/10.1007/s11704-023-2444-y

This is a preview of subscription content, contact us for subscripton.

Ziwang Fu is currently pursuing the Master degree candidate in Beijing University of Posts and Telecommunications, China. His main research interests include multimodal machine learning and computer vision

Feng Liu is now a PhD candidate in the School of Computer Science and Technology, East China Normal University, China since 2020. He is a senior member of China Computer Federation, a member of blockchain technical committee of Chinese Association of Automation, a member of IEEE and a member of the Chinese Psychological Society. His research interests include deep learning, computational affection and blockchain technology

Qing Xu is currently pursuing the Master degree candidate in Beijing University of Posts and Telecommunications, China. Her main research interests include computer vision, attack and defense

Xiangling Fu received the PhD degree in information science from Peking University, China in 2004. She is currently an associate professor and PhD supervisor in the School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, China. She has published more than 50 papers in international journals and conferences. Her research interests include information management, data mining, and medical informatics

Jiayin Qi is a professor and PhD supervisor in the School of Cyberspace Security, Guangzhou University, China. She has been listed among the “Leading Talents of Shanghai” in 2017, and recognized by the “New Century Excellent Talents Program” in 2009 of the Ministry of Education, China. She has been conducting research work in the field of Advanced Technologies and Management Innovation. She has been listed in Elsevier’s “Highly Cited Scholars in China” list

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Nguyen D, Nguyen K, Sridharan S, Dean D, Fookes C . Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition. Computer Vision and Image Understanding, 2018, 174: 33–42

[2]	Poria S, Hazarika D, Majumder N, Mihalcea R . Beneath the tip of the iceberg: Current challenges and new directions in sentiment analysis research. IEEE Transactions on Affective Computing, 2023, 14( 1): 108–132

[3]	Dai W, Cahyawijaya S, Liu Z, Fung P. Multimodal end-to-end sparse model for emotion recognition. In: Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021

[4]	Chandra R, Krishna A . Covid-19 sentiment analysis via deep learning during the rise of novel cases. PLoS One, 2021, 16( 8): e0255615

[5]	Tsai Y H H, Liang P P, Zadeh A, Morency L P, Salakhutdinov R. Learning factorized multimodal representations. In: Proceedings of the 7th International Conference on Learning Representations. 2019

[6]

Pham H, Liang P P, Manzini T, Morency L P, Póczos B. Found in translation: Learning robust joint representations by cyclic translations between modalities. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence and 31st Innovative Applications of Artificial Intelligence Conference and 9th AAAI Symposium on Educational Advances in Artificial Intelligence. 2019

[7]	Sahay S, Okur E, Kumar S H, Nachman L. Low rank fusion based transformers for multimodal sequences. In: Proceedings of the 2nd Grand-Challenge and Workshop on Multimodal Language (Challenge-HML). 2020

[8]	Rahman W, Hasan M K, Lee S, Zadeh A B, Mao C, Morency L P, Hoque E. Integrating multimodal information in large pretrained transformers. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020

[9]	Hazarika D, Zimmermann R, Poria S. MISA: Modality-invariant and -specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM International Conference on Multimedia. 2020

[10]	Yu W, Xu H, Yuan Z, Wu J. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021

[11]	Dai W, Cahyawijaya S, Bang Y, Fung P. Weakly-supervised multi-task learning for multimodal affect recognition. 2021, arXiv preprint arXiv: 2104.11560

[12]	Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000−6010

[13]	Tsai Y H H, Bai S, Liang P P, Kolter J Z, Morency L P, Salakhutdinov R. Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019

[14]	Lv F, Chen X, Huang Y, Duan L, Lin G. Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In: Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021, 2554−2562

[15]	Busso C, Bulut M, Lee C C, Kazemzadeh A, Mower E, Kim S, Chang J, Lee S, Narayanan S S . IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 2008, 42( 4): 335–359

[16]	Zadeh A, Zellers R, Pincus E, Morency L P . Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intelligent Systems, 2016, 31( 6): 82–88

[17]	Zadeh A B, Liang P P, Poria S, Cambria E, Morency L P. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018, 2236−2246

[18]	Morency L P, Mihalcea R, Doshi P. Towards multimodal sentiment analysis: Harvesting opinions from the web. In: Proceedings of the 13th International Conference on Multimodal Interfaces. 2011, 169−176

[19]	Pérez-Rosas V, Mihalcea R, Morency L P. Utterance-level multimodal sentiment analysis. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2013

[20]	Zadeh A, Zellers R, Pincus E, Morency L P. MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. 2016, arXiv preprint arXiv: 1606.06259

[21]	Wang H, Meghawat A, Morency L P, Xing E P. Select-additive learning: Improving generalization in multimodal sentiment analysis. In: Proceedings of 2017 IEEE International Conference on Multimedia and Expo (ICME). 2017, 949−954

[22]	Wang Y, Shen Y, Liu Z, Liang P P, Zadeh A, Morency L P . Words can shift: Dynamically adjusting word representations using nonverbal behaviors. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33( 1): 7216–7223

[23]	Zeng Z, Tu J, Pianfetti B, Liu M, Zhang T, Zhang Z, Huang T S, Levinson S. Audio-visual affect recognition through multi-stream fused HMM for HCI. In: Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). 2005, 967−972

[24]

Dai W, Liu Z, Yu T, Fung P. Modality-transferable emotion embeddings for low-resource multimodal emotion recognition. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. 2020

[25]	Pennington J, Socher R, Manning C. Glove: Global vectors for word representation. In: Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014, 1532−1543

[26]	Baltrusaitis T, Robinson P, Morency L P. OpenFace: An open source facial behavior analysis toolkit. In: Proceedings of 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). 2016, 1−10

[27]	Degottex G, Kane J, Drugman T, Raitio T, Scherer S. COVAREP — a collaborative voice analysis repository for speech technologies. In: Proceedings of 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2014, 960−964

[28]	Graves A, Fernández S, Gomez F, Schmidhuber J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning. 2006, 369−376