Cross-feature fusion speech emotion recognition based on attention mask residual network and Wav2vec 2.0

Xiaoke Li , Zufan Zhang

›› 2025, Vol. 11 ›› Issue (5) : 1567 -1577.

PDF
›› 2025, Vol. 11 ›› Issue (5) :1567 -1577. DOI: 10.1016/j.dcan.2024.10.007
Regular Papers
research-article

Cross-feature fusion speech emotion recognition based on attention mask residual network and Wav2vec 2.0

Author information +
History +
PDF

Abstract

Speech Emotion Recognition (SER) has received widespread attention as a crucial way for understanding human emotional states. However, the impact of irrelevant information on speech signals and data sparsity limit the development of SER system. To address these issues, this paper proposes a framework that incorporates the Attentive Mask Residual Network (AM-ResNet) and the self-supervised learning model Wav2vec 2.0 to obtain AM-ResNet features and Wav2vec 2.0 features respectively, together with a cross-attention module to interact and fuse these two features. The AM-ResNet branch mainly consists of maximum amplitude difference detection, mask residual block, and an attention mechanism. Among them, the maximum amplitude difference detection and the mask residual block act on the pre-processing and the network, respectively, to reduce the impact of silent frames, and the attention mechanism assigns different weights to unvoiced and voiced speech to reduce redundant emotional information caused by unvoiced speech. In the Wav2vec 2.0 branch, this model is introduced as a feature extractor to obtain general speech features (Wav2vec 2.0 features) through pre-training with a large amount of unlabeled speech data, which can assist the SER task and cope with data sparsity problems. In the cross-attention module, AM-ResNet features and Wav2vec 2.0 features are interacted with and fused to obtain the cross-fused features, which are used to predict the final emotion. Furthermore, multi-label learning is also used to add ambiguous emotion utterances to deal with data limitations. Finally, experimental results illustrate the usefulness and superiority of our proposed framework over existing state-of-the-art approaches.

Keywords

Speech emotion recognition / Residual network / Mask / Attention / Wav2vec 2.0 / Cross-feature fusion

Cite this article

Download citation ▾
Xiaoke Li, Zufan Zhang. Cross-feature fusion speech emotion recognition based on attention mask residual network and Wav2vec 2.0. , 2025, 11(5): 1567-1577 DOI:10.1016/j.dcan.2024.10.007

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

P.E. Ekman, R.J. Davidson (Eds.), The Nature of Emotion: New York, Fundamental Questions, Series in Affective Science, Oxford University Press, 1994.

[2]

D. Wu, S. Si, S. Wu, R. Wang, Dynamic trust relationships aware data privacy pro- tection in mobile crowd-sensing, IEEE Int. Things J. 5 (4) (2018) 2958-2970.

[3]

D. Wu, H. Shi, H. Wang, R. Wang, H. Fang, A feature-based learning system for Internet of Things applications, IEEE Int. Things J. 6 (2) (2019) 1928-1937.

[4]

A.B. Nassif, I. Shahin, I. Attili, M. Azzeh, K. Shaalan, Speech recognition using deep neural networks: a systematic review, IEEE Access 7 (2019) 19143-19165.

[5]

S. Zhang, S. Zhang, T. Huang, W. Gao, Speech emotion recognition using deep convo- lutional neural network and discriminant temporal pyramid matching, IEEE Trans. Multimed. 20 (6) (2018) 1576-1590.

[6]

M.B. Akçay, K. Oğuz, Speech emotion recognition: emotional models, databases features, preprocessing methods, supporting modalities, and classifiers, Speech Com- mun. 116 (2020) 56-76.

[7]

Ch. Rakesh, R. R. Rao, S. R. Krishna, A comparative study of silence and non silence regions of speech signal using prosody features, in: Proceedings of the 2016 In- ternational Conference on Communication and Electronics Systems (ICCES), IEEE, Coimbatore, India, 2016, pp. 1-4.

[8]

Y. Ma, Y. Hao, M. Chen, J. Chen, P. Lu, A. Košir, Audio-visual emotion fusion (avef): a deep efficient weighted approach, Inf. Fusion 46 ( 2019) 184-192.

[9]

A. Satt, S. Rozenberg, R. Hoory,Efficient emotion recognition from speech using deep learning on spectrograms, in:Proceedings of Interspeech 2017, ISCA, Stock- holm, Sweden, 2017, pp. 1089-1093.

[10]

Y. Gao, C. Chu, T. Kawahara,Two-stage finetuning of wav2vec 2.0 for speech emo- tion recognition with asr and gender pretraining, in:Proceedings of Interspeech 2023, ISCA, Dublin, Ireland, 2023, pp. 3637-3641.

[11]

P. Harar, R. Burget, M.K. Dutta, Speech emotion recognition with deep learning, in: Proceedings of the 2017 4th International Conference on Signal Processing and Integrated Networks (SPIN), IEEE, Noida, India, 2017, pp. 137-140.

[12]

I.J. Tashev, Zhong-Qiu Wang, K. Godin, Speech emotion recognition based on Gaus- sian mixture models and deep neural networks, in: Proceedings of the 2017 Infor- mation Theory and Applications Workshop (ITA), IEEE, San Diego, CA, USA, 2017, pp. 1-4.

[13]

M. Pandharipande, R. Chakraborty, A. Panda, S.K. Kopparapu, An unsupervised frame selection technique for robust emotion recognition in noisy speech, in: Pro- ceedings of the 2018 26th European Signal Processing Conference (EUSIPCO), IEEE, Rome, Italy, 2018, pp. 2055-2059.

[14]

L. Tian, C. Lai, J.D. Moore,Recognizing emotions in dialogues with disfluencies and non-verbal vocalisations, in:Proceedings of Interdisciplinary Workshop on Laughter and Other Non-Verbal Vocalisations in Speech 2015, IEEE, Enschede, The Nether- lands, 2015, pp. 39-41.

[15]

Y. Li, Y. Mohamied, P. Bell, C. Lai, Exploration of a self-supervised speech model: a study on emotional corpora,in:Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT), IEEE, Doha, Qatar, 2023, pp. 868-875.

[16]

L.-W. Chen, A. Rudnicky, Exploring wav2vec 2.0 fine tuning for improved speech emotion recognition, in: Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Rhodes Island, Greece, 2023, pp. 1-5.

[17]

S. Schneider, A. Baevski, R. Collobert, M. Auli,Wav2vec: unsupervised pre-training for speech recognition,in:Proceedings of Interspeech 2019, ISCA, Graz, Austria, 2019, pp. 3465-3469.

[18]

A. Baevski, Y. Zhou, A. Mohamed, M. Auli, Wav2vec 2.0: a framework for self- supervised learning of speech representations,in:Proceedings of the 33st Inter-national Conference on Neural Information Processing Systems (NeurIPS), Curran Associates Inc., 2020, pp. 12449-12460.

[19]

W.-N. Hsu, B. Bolte, Y.-H.H. Tsai, K. Lakhotia, R. Salakhutdinov, A. Mohamed, Hu-bert: self-supervised speech representation learning by masked prediction of hidden units, IEEE/ACM Trans. Audio Speech Lang. Process. 29 (2021) 3451-3460.

[20]

S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu, F. Wei, Wavlm: large-scale self-supervised pre-training for full stack speech processing, IEEE J. Sel. Top. Signal Process. 16 (6) (2022) 1505-1518.

[21]

L. Pepino, P. Riera, L. Ferrer,Emotion recognition from speech using wav2vec 2.0 embeddings, in:Proceedings of Interspeech 2021, ISCA, Brno, Czechia, 2021, pp. 3400-3404.

[22]

Y. Xia, L.-W. Chen, A. Rudnicky, R.M. Stern,Temporal context in speech emo- tion recognition, in:Proceedings of Interspeech 2021, ISCA, Brno, Czechia, 2021, pp. 3370-3374.

[23]

P. Yue, L. Qu, S. Zheng, T. Li, Multi-task learning for speech emotion and emotion intensity recognition, in: Proceedings of the 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), IEEE, Chiang Mai, Thailand, 2022, pp. 1232-1237.

[24]

Y. Li, P. Bell, C. Lai, Fusing asr outputs in joint training for speech emotion recog-nition, in: Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Singapore, Singapore, 2022, pp. 7362-7366.

[25]

E. Mower, A. Metallinou, C.-C. Lee, A. Kazemzadeh, C. Busso, S. Lee, S. Narayanan, Interpreting ambiguous emotional expressions, in: Proceedings of the 2009 3rd Inter- national Conference on Affective Computing and Intelligent Interaction and Work- shops, IEEE, Amsterdam, The Netherlands, 2009, pp. 1-8.

[26]

C. Etienne, G. Fidanza, A. Petrovskii, L. Devillers, B. Schmauch,Cnn+lstm archi-tecture for speech emotion recognition with data augmentation, in:Proceedings of Workshop on Speech, Music and Mind (SMM 2018), ISCA, Hyderabad, India, 2018, pp. 21-25.

[27]

A. Ando, S. Kobashikawa, H. Kamiyama, R. Masumura, Y. Ijima, Y. Aono, Soft-target training with ambiguous emotional utterances for dnn-based speech emotion clas- sification, in: Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Calgary, AB, 2018, pp. 4964-4968.

[28]

Q. Mao, M. Dong, Z. Huang, Y. Zhan, Learning salient features for speech emo-tion recognition using convolutional neural networks, IEEE Trans. Multimed. 16 (8) (2014) 2203-2213.

[29]

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Pro- ceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Las Vegas, NV, USA, 2016, pp. 770-778.

[30]

K. He, X. Zhang, S. Ren, J. Sun, Identity mappings in deep residual networks, in: Proceedings of European Conference on Computer Vision (ECCV), Cham: Springer International Publishing, Amsterdam, The Netherlands, 2016, pp. 630-645.

[31]

S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by re- ducing internal covariate shift,in:Proceedings of the 32nd International Conference on Machine Learning, PMLR, Lille, France, 2015, pp. 448-456.

[32]

J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, UT, 2018, pp. 7132-7141.

[33]

L. Sun, B. Liu, J. Tao, Z. Lian, Multimodal cross- and self-attention network for speech emotion recognition, in: Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Toronto, ON, Canada, 2021, pp. 4275-4279.

[34]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, in: Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), Curran Associates Inc., Long Beach, California, 2017, pp. 5998-6008.

[35]

C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J.N. Chang, S. Lee, S.S. Narayanan, Iemocap: interactive emotional dyadic motion capture database, Lang. Resour. Eval. 42 (4) (2008) 335-359.

[36]

M. Chen, X. He, J. Yang, H. Zhang, 3-d convolutional recurrent neural networks with attention model for speech emotion recognition, IEEE Signal Process. Lett. 25 (10) (2018) 1440-1444.

[37]

H.-C. Chou, C.-C. Lee, Every rating matters: joint learning of subjective labels and individual annotators for speech emotion classification,in:Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Brighton, United Kingdom, 2019, pp. 5886-5890.

[38]

M.A. Pastor, D. Ribas, A. Ortega, A. Miguel, E. Lleida, Cross-corpus training strat- egy for speech emotion recognition using self-supervised representations, Appl. Sci. 13 (16) (2023) 9062.

[39]

S.G. Upadhyay, W.-S. Chien, B.-H. Su, C.-C. Lee, Learning with rater-expanded label space to improve speech emotion recognition, IEEE Trans. Affect. Comput. (2024) 1-15.

PDF

467

Accesses

0

Citation

Detail

Sections
Recommended

/