PDF
Abstract
Although speech emotion recognition is challenging, it has broad application prospects in human-computer interaction. Building a system that can accurately and stably recognize emotions from human languages can provide a better user experience. However, the current unimodal emotion feature representations are not distinctive enough to accomplish the recognition, and they do not effectively simulate the inter-modality dynamics in speech emotion recognition tasks. This paper proposes a multimodal method that utilizes both audio and semantic content for speech emotion recognition. The proposed method consists of three parts: two high-level feature extractors for text and audio modalities, and an autoencoder-based feature fusion. For audio modality, we propose a structure called Temporal Global Feature Extractor (TGFE) to extract the high-level features of the time-frequency domain relationship from the original speech signal. Considering that text lacks frequency information, we use only a Bidirectional Long Short-Term Memory network (BLSTM) and attention mechanism to simulate an intra-modal dynamic. Once these steps have been accomplished, the high-level text and audio features are sent to the autoencoder in parallel to learn their shared representation for final emotion classification. We conducted extensive experiments on three public benchmark datasets to evaluate our method. The results on Interactive Emotional Motion Capture (IEMOCAP) and Multimodal EmotionLines Dataset (MELD) outperform the existing method. Additionally, the results of CMU Multi-modal Opinion-level Sentiment Intensity (CMU-MOSI) are competitive. Furthermore, experimental results show that compared to unimodal information and autoencoder-based feature level fusion, the joint multimodal information (audio and text) improves the overall performance and can achieve greater accuracy than simple feature concatenation.
Keywords
Attention mechanism
/
Autoencoder
/
Bimodal fusion
/
Emotion recognition
Cite this article
Download citation ▾
Peng Shixin, Chen Kai, Tian Tian, Chen Jingying.
An autoencoder-based feature level fusion for speech emotion recognition.
, 2024, 10(5): 1341-1351 DOI:10.1016/j.dcan.2022.10.018
| [1] |
C.-H. Wu, J.-F. Yeh, Z.-J. Chuang, Emotion perception and recognition from speech, in: Affective Information Processing, Springer, London, 2009, pp. 93-110.
|
| [2] |
D. Morrison, R. Wang, L.C. De Silva, Ensemble methods for spoken emotion recognition in call-centres, Speech Commun. 49 (2) (2007) 98-112.
|
| [3] |
C.-H. Wu, Senior member, ieee, and wei-bin liang, emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels, IEEE Trans. Affect. Comput 2 (1) (2011) 10-21.
|
| [4] |
I. Luengo, E. Navas, I. Hernáez, J. Sánchez,Automatic emotion recognition using prosodic parameters, in:Ninth European Conference on Speech Communication and Technology, 2005.
|
| [5] |
S.G. Koolagudi, N. Kumar, K.S. Rao, Speech emotion recognition using segmental level prosodic analysis, in: 2011 International Conference on Devices and Communications (ICDeCom), IEEE, 2011, pp. 1-5.
|
| [6] |
C. Strapparava, A. Valitutti, et al., Wordnet Affect: an Affective Extension of Wordnet, vol. 4, Lrec, Lisbon, 2004, p. 40.
|
| [7] |
C.O. Alm, D. Roth, R. Sproat, Emotions from text: machine learning for text-based emotion prediction,in: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, 2005, pp. 579-586.
|
| [8] |
G. Mishne, et al., Experiments with mood classification in blog posts,in:Proceedings of ACM SIGIR 2005 Workshop on Stylistic Analysis of Text for Information Access vol. 19, Citeseer, 2005, pp. 321-327.
|
| [9] |
L. Oneto, F. Bisio, E. Cambria, D. Anguita, Statistical learning theory and elm for big social data analysis, IEEE Comput. Intell. Mag. 11 (3) (2016) 45-55.
|
| [10] |
S.K. D’mello, J. Kory, A review and meta-analysis of multimodal affect detection systems, ACM Comput. Surv. 47 (3) (2015) 1-36.
|
| [11] |
M. Neumann, N.T. Vu, Attentive Convolutional Neural Network Based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech, Computing Research Repository 12 (2017) 1263-1267.
|
| [12] |
V. Chernykh,P. Prikhodko, Emotion Recognition from Speech with Recurrent Neural Networks, arXiv preprint arXiv:1701.08071.
|
| [13] |
W. Jiao, M. Lyu, I. King,Real-time emotion recognition via attention gated hierarchical memory network, Proc. AAAI Conf. Artif. Intell. 34 (2020) 8002-8009.
|
| [14] |
S. Latif, R. Rana, S. Khalifa, R. Jurdak, J. Epps, B.W. Schuller, Multi-task semi-supervised adversarial autoencoding for speech emotion recognition, IEEE Trans. Affect. Comput. 13 (2) (2022) 992-1004.
|
| [15] |
D. Hazarika, S. Poria, A. Zadeh, E. Cambria, L.-P. Morency, R. Zimmermann,Conversational memory network for emotion recognition in dyadic dialogue videos, in:Proceedings of the Conference. Association for Computational Linguistics. North American Chapter. Meeting vol. 2018, NIH Public Access, 2018, p. 2122.
|
| [16] |
D. Hazarika, S. Poria, R. Mihalcea, E. Cambria, R. Zimmermann, Icon: interactive conversational memory network for multimodal emotion detection,in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 2594-2604.
|
| [17] |
N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, E. Cambria, Dialoguernn: an attentive rnn for emotion detection in conversations, Proc. AAAI Conf. Artif. Intell. 33 (2019) 6818-6825.
|
| [18] |
C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J.N. Chang, S. Lee, S.S. Narayanan, Iemocap: interactive emotional dyadic motion capture database, Comput. Humanit. 42 (4) (2008) 335.
|
| [19] |
A. Zadeh, R. Zellers, E. Pincus,L.-P. Morency, Mosi: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos, arXiv preprint arXiv:1606.06259.
|
| [20] |
S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, R. Mihalcea, Meld: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations, Computation and Language 12 (2019) 527-536.
|
| [21] |
I.R. Murray, J.L. Arnott, Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion, J. Acoust. Soc. Am. 93 (2) (1993) 1097-1108.
|
| [22] |
F. Dellaert, T. Polzin, A. Waibel, Recognizing emotion in speech, in: Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP’96 vol. 3, IEEE, 1996, pp. 1970-1973.
|
| [23] |
J. Nicholson, K. Takahashi, R. Nakatsu, Emotion recognition in speech using neural networks, Neural Comput. Appl. 9 (4) (2000) 290-296.
|
| [24] |
D. Neiberg, K. Elenius, K. Laskowski,Emotion recognition in spontaneous speech using gmms, in:Ninth International Conference on Spoken Language Processing, 2006.
|
| [25] |
A. Nogueiras, A. Moreno, A. Bonafonte, J.B. Mari-no,Speech emotion recognition using hidden markov models, in:Seventh European Conference on Speech Communication and Technology, 2001.
|
| [26] |
A. Milton, S.S. Roy, S.T. Selvi, Svm scheme for speech emotion recognition using mfcc feature, Int. J. Comput. Appl. 69 (9) (2013) 34-39.
|
| [27] |
C.-C. Lee, E. Mower, C. Busso, S. Lee, S. Narayanan, Emotion recognition using a hierarchical binary decision tree approach, Speech Commun. 53 (9-10) (2011) 1162-1171.
|
| [28] |
K. Han, D. Yu, I. Tashev,Speech emotion recognition using deep neural network and extreme learning machine, in:Fifteenth Annual Conference of the International Speech Communication Association, 2014.
|
| [29] |
J. Lee, I. Tashev,High-level feature representation using recurrent neural network for speech emotion recognition, in:Sixteenth Annual Conference of the International Speech Communication Association, 2015.
|
| [30] |
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, in: Advances in Neural Information Processing Systems, 2017, pp. 5998-6008.
|
| [31] |
Y. Ahn, S.J. Lee, J.W. Shin, Cross-corpus speech emotion recognition based on few-shot learning and domain adaptation, IEEE Signal Process. Lett. 28 (2021) 1190-1194.
|
| [32] |
B. Schuller, G. Rigoll, M. Lang, Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture, in: 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing vol. 1, IEEE, 2004, pp. I-577.
|
| [33] |
Q. Jin, C. Li, S. Chen, H. Wu, Speech emotion recognition with acoustic and lexical features, in: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2015, pp. 4749-4753.
|
| [34] |
A. Haque, M. Guo, P. Verma, L. Fei-Fei,Audio-linguistic embeddings for spoken sentences, in: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 7355-7359.
|
| [35] |
S. Yoon, S. Byun, K. Jung, Multimodal speech emotion recognition using audio and text, in: 2018 IEEE Spoken Language Technology Workshop (SLT), IEEE, 2018, pp. 112-118.
|
| [36] |
J. Sebastian, P. Pierucci, et al., Fusion Techniques for Utterance-Level Emotion Recognition Combining Speech and Transcripts, Interspeech, 2019, pp. 51-55.
|
| [37] |
S. Sahu, V. Mitra, N. Seneviratne, C.Y. Espy-Wilson, Multi-modal Learning for Speech Emotion Recognition: an Analysis and Comparison of Asr Outputs with Ground Truth Transcription, Interspeech, 2019, pp. 3302-3306.
|
| [38] |
B. Zhang, S. Khorram, E.M. Provost,Exploiting acoustic and lexical properties of phonemes to recognize valence from speech, in: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 5871-5875.
|
| [39] |
L. Pepino, P. Riera, L. Ferrer, A. Gravano,Fusion approaches for emotion recognition from speech using acoustic and text-based features, in: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp. 6484-6488.
|
| [40] |
N.-H. Ho, H.-J. Yang, S.-H. Kim, G. Lee, Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network, IEEE Access 8 (2020) 61672-61686.
|
| [41] |
W. Chan, I. Lane, Deep convolutional neural networks for acoustic modeling in low resource languages, in: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2015, pp. 2056-2060.
|
| [42] |
S. Ioffe, C. Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML'15: Proceedings of the 32nd International Conference on International Conference on Machine Learning, ICML, 2015, pp. 448-456.
|
| [43] |
K. He, X. Zhang, S. Ren, J. Sun,Deep residual learning for image recognition, in:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770-778.
|
| [44] |
D. Bahdanau, K. Cho,Y. Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, arXiv:1409.0473.
|
| [45] |
Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, E. Hovy,Hierarchical attention networks for document classification, in:Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 1480-1489.
|
| [46] |
K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, Y. Bengio, Show, attend and tell: neural image caption generation with visual attention,in: International Conference on Machine Learning, 2015, pp. 2048-2057.
|
| [47] |
Y.-H.H. Tsai, S. Bai, P.P. Liang, J.Z. Kolter, L.-P. Morency, R. Salakhutdinov,Multimodal transformer for unaligned multimodal language sequences, in:Proceedings of the Conference. Association for Computational Linguistics. Meeting vol. 2019, NIH Public Access, 2019, p. 6558.
|
| [48] |
C.-W. Huang, S.S. Narayanan, Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition, in: 2017 IEEE International Conference on Multimedia and Expo (ICME), IEEE, 2017, pp. 583-588.
|
| [49] |
S. Mirsamadi, E. Barsoum, C. Zhang, Automatic speech emotion recognition using recurrent neural networks with local attention, in: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2017, pp. 2227-2231.
|
| [50] |
D. Hazarika, S. Gorantla, S. Poria, R. Zimmermann, Self-attentive feature-level fusion for multimodal emotion detection, in: 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), IEEE, 2018, pp. 196-201.
|
| [51] |
Y. Wang, Y. Shen, Z. Liu, P.P. Liang, A. Zadeh, L.-P. Morency, Words can shift: dynamically adjusting word representations using nonverbal behaviors, Proc. AAAI Conf. Artif. Intell. 33 (2019) 7216-7223.
|
| [52] |
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al., Tensorflow: Large-Scale Machine Learning on Heterogeneous Distributed Systems, arXiv:1603.04467.
|
| [53] |
D.P. Kingma,J. Ba, Adam: A Method for Stochastic Optimization, arXiv:1412.6980.
|
| [54] |
M. Sahidullah, G. Saha, Design, analysis and experimental evaluation of block based transformation in mfcc computation for speaker recognition, Speech Commun. 54 (4) (2012) 543-565.
|
| [55] |
J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding, North American Chapter of the Association for Computational Linguistics 54 (4) (2018) 4171-4186.
|
| [56] |
S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, L.-P. Morency,Context-dependent sentiment analysis in user-generated videos, Proc. 55th annual meet. assoc. comput. linguist. ume 1 (2017) 873-883. Long papers).
|
| [57] |
D. Ghosal, M.S. Akhtar, D. Chauhan, S. Poria, A. Ekbal, P. Bhattacharyya,Contextual inter-modal attention for multi-modal sentiment analysis, in:Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 3454-3466.
|
| [58] |
H. Pham, P.P. Liang, T. Manzini, L.-P. Morency, B. Póczos,Found in translation: learning robust joint representations by cyclic translations between modalities,in:Proceedings of the AAAI Conference on Artificial Intelligence vol. 33, 2019, pp. 6892-6899.
|
| [59] |
N. Majumder, D. Hazarika, A. Gelbukh, E. Cambria, S. Poria, Multimodal sentiment analysis using hierarchical fusion with context modeling, Knowl. Base Syst. 161 (2018) 124-133.
|
| [60] |
C. Xi, G. Lu, J. Yan,Multimodal sentiment analysis based on multi-head attention mechanism, in:Proceedings of the 4th International Conference on Machine Learning and Soft Computing, 2020, pp. 34-39.
|
| [61] |
H. Wang, A. Meghawat, L.-P. Morency, E.P. Xing, Select-additive learning: improving generalization in multimodal sentiment analysis, in: 2017 IEEE International Conference on Multimedia and Expo (ICME), IEEE, 2017, pp. 949-954.
|
| [62] |
D. Zhang, L. Wu, C. Sun, S. Li, Q. Zhu, G. Zhou, Modeling Both Context-And Speaker-Sensitive Dependence for Emotion Detection in Multi-Speaker Conversations, IJCAI, 2019, pp. 5415-5421.
|
| [63] |
U. Nadeem, M. Bennamoun, F. Sohel, R. Togneri, Learning-based confidence estimation for multi-modal classifier fusion, in: International Conference on Neural Information Processing, Springer, 2019, pp. 299-312.
|
| [64] |
M.R. Makiuchi, K. Uto, K. Shinoda, Multimodal Emotion Recognition with High-Level Speech and Text Features, 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU, 2021, pp. 350-357.
|
| [65] |
W. Wu, C. Zhang, P.C. Woodland,Emotion recognition by fusing time synchronous and time asynchronous representations, in: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp. 6269-6273.
|
| [66] |
H. Feng, S. Ueno, T. Kawahara, End-to-end Speech Emotion Recognition Combined with Acoustic-to-word Asr Model, INTERSPEECH, 2020, pp. 501-505.
|
| [67] |
S. Sahoo, P. Kumar, B. Raman, P.P. Roy, A segment level approach to speech emotion recognition using transfer learning, in: Asian Conference on Pattern Recognition, Springer, 2019, pp. 435-448.
|
| [68] |
K. Simonyan,A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv:1409.1556.
|
| [69] |
A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, Adv. Neural Inf. Process. Syst. 25 (2012) 1097-1105.
|
| [70] |
F. Eyben, K.R. Scherer, B.W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S.S. Narayanan, et al., The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing, IEEE trans. affect. comput. 7 (2) (2015) 190-202.
|