LMR-CBT: learning modality-fused representations with CB-Transformer for multimodal emotion recognition from unaligned multimodal sequences
Ziwang FU , Feng LIU , Qing XU , Xiangling FU , Jiayin QI
Front. Comput. Sci. ›› 2024, Vol. 18 ›› Issue (4) : 184314
LMR-CBT: learning modality-fused representations with CB-Transformer for multimodal emotion recognition from unaligned multimodal sequences
Learning modality-fused representations and processing unaligned multimodal sequences are meaningful and challenging in multimodal emotion recognition. Existing approaches use directional pairwise attention or a message hub to fuse language, visual, and audio modalities. However, these fusion methods are often quadratic in complexity with respect to the modal sequence length, bring redundant information and are not efficient. In this paper, we propose an efficient neural network to learn modality-fused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition from unaligned multi-modal sequences. Specifically, we first perform feature extraction for the three modalities respectively to obtain the local structure of the sequences. Then, we design an innovative asymmetric transformer with cross-modal blocks (CB-Transformer) that enables complementary learning of different modalities, mainly divided into local temporal learning, cross-modal feature fusion and global self-attention representations. In addition, we splice the fused features with the original features to classify the emotions of the sequences. Finally, we conduct word-aligned and unaligned experiments on three challenging datasets, IEMOCAP, CMU-MOSI, and CMU-MOSEI. The experimental results show the superiority and efficiency of our proposed method in both settings. Compared with the mainstream methods, our approach reaches the state-of-the-art with a minimum number of parameters.
modality-fused representations / cross-model blocks / multimodal emotion recognition / unaligned multimodal sequences / computational affection
| [1] |
|
| [2] |
|
| [3] |
Dai W, Cahyawijaya S, Liu Z, Fung P. Multimodal end-to-end sparse model for emotion recognition. In: Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021 |
| [4] |
|
| [5] |
Tsai Y H H, Liang P P, Zadeh A, Morency L P, Salakhutdinov R. Learning factorized multimodal representations. In: Proceedings of the 7th International Conference on Learning Representations. 2019 |
| [6] |
Pham H, Liang P P, Manzini T, Morency L P, Póczos B. Found in translation: Learning robust joint representations by cyclic translations between modalities. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence and 31st Innovative Applications of Artificial Intelligence Conference and 9th AAAI Symposium on Educational Advances in Artificial Intelligence. 2019 |
| [7] |
Sahay S, Okur E, Kumar S H, Nachman L. Low rank fusion based transformers for multimodal sequences. In: Proceedings of the 2nd Grand-Challenge and Workshop on Multimodal Language (Challenge-HML). 2020 |
| [8] |
|
| [9] |
|
| [10] |
Yu W, Xu H, Yuan Z, Wu J. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021 |
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
Lv F, Chen X, Huang Y, Duan L, Lin G. Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In: Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021, 2554−2562 |
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
Pennington J, Socher R, Manning C. Glove: Global vectors for word representation. In: Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014, 1532−1543 |
| [26] |
|
| [27] |
|
| [28] |
|
Higher Education Press
Supplementary files
/
| 〈 |
|
〉 |