Self-supervised Learning for Speech Emotion Recognition Task Using Audio-visual Features and Distil Hubert Model on BAVED and RAVDESS Databases

Karim Dabbabi , Abdelkarim Mars

Journal of Systems Science and Systems Engineering ›› : 1 -31.

PDF
Journal of Systems Science and Systems Engineering ›› : 1 -31. DOI: 10.1007/s11518-024-5607-y
Article

Self-supervised Learning for Speech Emotion Recognition Task Using Audio-visual Features and Distil Hubert Model on BAVED and RAVDESS Databases

Author information +
History +
PDF

Abstract

Existing pre-trained models like Distil HuBERT excel at uncovering hidden patterns and facilitating accurate recognition across diverse data types, such as audio and visual information. We harnessed this capability to develop a deep learning model that utilizes Distil HuBERT for jointly learning these combined features in speech emotion recognition (SER). Our experiments highlight its distinct advantages: it significantly outperforms Wav2vec 2.0 in both offline and real-time accuracy on RAVDESS and BAVED datasets. Although slightly trailing HuBERT’s offline accuracy, Distil HuBERT shines with comparable performance at a fraction of the model size, making it an ideal choice for resource-constrained environments like mobile devices. This smaller size does come with a slight trade-off: Distil HuBERT achieved notable accuracy in offline evaluation, with 96.33% on the BAVED database and 87.01% on the RAVDESS database. In real-time evaluation, the accuracy decreased to 79.3% on the BAVED database and 77.87% on the RAVDESS database. This decrease is likely a result of the challenges associated with real-time processing, including latency and noise, but still demonstrates strong performance in practical scenarios. Therefore, Distil HuBERT emerges as a compelling choice for SER, especially when prioritizing accuracy over real-time processing. Its compact size further enhances its potential for resource-limited settings, making it a versatile tool for a wide range of applications.

Keywords

Wav2vec 2.0 / Distil HuBERT / HuBERT / SER / audio and audio-visual features

Cite this article

Download citation ▾
Karim Dabbabi, Abdelkarim Mars. Self-supervised Learning for Speech Emotion Recognition Task Using Audio-visual Features and Distil Hubert Model on BAVED and RAVDESS Databases. Journal of Systems Science and Systems Engineering 1-31 DOI:10.1007/s11518-024-5607-y

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Aggarwal A, Srivastava A, Agarwal A, Chahal N, Singh D, Alnuaim A A, Alhadlaq A, Lee H N. Two-way feature extraction for speech emotion recognition using deep learning. Sensors, 2022, 22(6): 2378.

[2]

Ahmad F, Shahid M, Alam M, Ashraf Z, Sajid M, Kotecha K, Dhiman G. Levelized multiple workflow allocation strategy under precedence constraints with task merging in IaaS cloud environment. IEEE Access, 2022, 10: 92809-27.

[3]

Aouf A (2019). Basic arabic vocal emotions dataset (BAVED) - GitHub. Retrieved from https://github.com/40uf411/Basic-Arabic-VocalEmotions-Dataset.

[4]

Atmaja B T, Sasou A. Sentiment analysis and emotion recognition from speech using universal speech representations. Sensors, 2022, 22(17): 6369.

[5]

Baevski A, Zhou H, Mohamed A, Auli M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 2020, 33: 12449-12460.

[6]

Beale R, Peter C (2008). The role of affect and emotion in HCI. Affect and Emotion in Human-computer Interaction: 1–11.

[7]

Boateng G, Kowatsch T (2020). Speech emotion recognition among elderly individuals using multimodal fusion and transfer learning. International Conference on Multimodal Interaction: 12–16.

[8]

Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W F, Weiss B A (2005). Database of german emotional speech. Proceedings of the Ninth European Conference on Speech Communication and Technology: 4–8.

[9]

Busso C, Bulut M, Lee C C, Kazemzadeh A, Mower E, Kim S, Chang J N, Lee S, Narayanan S S. IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 2008, 42(4): 335-359.

[10]

Butt S A, Iqbal U, Ghazali R, Shoukat I A, Lasisi A, Al-Saedi A K (2022). An improved convolutional neural network for speech emotion recognition. International Conference on Soft Computing and Data Mining: 194–201.

[11]

Chatterjee I. Artificial intelligence and patentability: Review and discussions. International Journal of Modern Research, 2021, 1: 15-21.

[12]

Chang H, Yang S, Lee H (2022). Distil HuBERT: Speech representation learning by layer-wise distillation of hidden-unit BERT. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): 7087–7091.

[13]

Chang Y, Ren Z, Nguyen T T, Qian K, Schuller B W (2023). Knowledge transfer for on-device speech emotion recognition with neural structured learning. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): 1–5.

[14]

Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor J G. Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine, 2001, 18(1): 32-80.

[15]

Dehghani M, Montazeri Z, Dhiman G, Malik O P, Morales-Menendez R, Ramirez-Mendoza R A, Parra-Arroyo L. A spring search algorithm applied to engineering optimization problems. Applied Sciences, 2020, 10(18): 6173.

[16]

Devlin J, Chang M W, Lee K, Toutanova K (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

[17]

Dhiman G. ESA: A hybrid bio-inspired metaheuristic optimization approach for engineering problems. Engineering with Computers, 2021, 37: 323-353.

[18]

Dhiman G, Kumar V. Emperor penguin optimizer: A bio-inspired algorithm for engineering problems. Knowledge-Based Systems, 2018, 159: 20-50.

[19]

Dhiman G, Kaur A. STOA: A bio-inspired based optimization algorithm for industrial engineering problems. Engineering Applications of Artificial Intelligence, 2019, 82: 148-174.

[20]

Dhiman G, Garg M, Nagar A, Kumar V, Dehghani M. A novel algorithm for global optimization: Rat swarm optimizer. Journal of Ambient Intelligence and Humanized Computing, 2021, 12: 8457-8482.

[21]

Feng K, Chaspari T. A review of generalizable transfer learning in automatic emotion recognition. Frontiers in Computer Science, 2020, 2: 9.

[22]

Gao M, Dong J, Zhou D, Zhang Q, Yang D (2019). End-to-end speech emotion recognition based on one-dimensional convolutional neural network. Proceedings of the 2019 3rd International Conference on Innovation in Artificial Intelligence: 78–82.

[23]

Hinton G E, Sabour S, Frosst N (2018). Matrix capsules with EM routing. International Conference on Learning Epresentations: 1–15.

[24]

Georgiou E, Paraskevopoulos G, Potamianos A (2021). M3: MultiModal masking applied to sentiment analysis. Interspeech: 2876–2880.

[25]

Ghosh S, Laksana E, Morency L P, Scherer S (2016). Representation learning for speech emotion recognition. Interspeech: 3603–3607.

[26]

Gideon J, Khorram S, Aldeneh Z, Dimitriadis D, Provost E M (2017). Progressive neural networks for transfer learning in emotion recognition. arXiv: 1706.03256.

[27]

Gideon J, McInnis M, Provost E M. Improving cross-corpus speech emotion recognition with adversarial discriminative domain generalization (ADDoG). IEEE Transactions on Affective Computing, 2019, 12(4): 1055-1068.

[28]

Guo J. Deep learning approach to text analysis for human emotion detection from big data. Journal of Intelligent Systems, 2022, 31(1): 113-126.

[29]

Guo Y, Xiong X, Liu Y, Xu L, Li Q. A novel speech emotion recognition method based on feature construction and ensemble learning. PLoS ONE, 2022, 17(8): e0267132.

[30]

Gupta V K, Shukla S K, Rawat R S. Crime tracking system and people’s safety in India using machine learning approaches. International Journal of Modern Research, 2022, 2(1): 1-7.

[31]

Halabi N (2021). Arabic speech corpus. Retrieved from http://ar.arabicspeechcorpus.com/.

[32]

Han K, Yu D, Tashev I (2014). Speech emotion recognition using deep neural network and extreme learning machine. Interspeech: 223–227.

[33]

Hastie Trevor (2017). Generalized additive models. Statistical Models in S: 249–307.

[34]

Hsu W N, Bolte B, Tsai Y H H, Lakhotia K, Salakhutdinov R, Mohamed A (2021). HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. arXiv preprint arXiv:2106.07447.

[35]

Kanwal S, Asghar S, Ali H. Feature selection enhancement and feature space visualization for speech-based emotion recognition. PeerJ Computer Science, 2022, 8: e1091.

[36]

Kaur S, Awasthi L K, Sangal A L, Dhiman G. Tunicate swarm algorithm: A new bio-inspired based metaheuristic paradigm for global optimization. Engineering Applications of Artificial Intelligence, 2020, 90: 103541.

[37]

Kwon O W, Chan K, Hao J, Lee T W (2003). Emotion recognition by speech signals. Interspeech: 125–128.

[38]

Latif S, Rana R, Younis S, Qadir J, Epps J (2018). Transfer learning for improving speech emotion classification accuracy. Interspeech: 257–261.

[39]

Liang P P, Salakhutdinov R, Morency L P (2018). Computational modeling of human multimodal language: The MOSEI dataset and interpretable dynamic fusion. Proceedings of the First Workshop and Grand Challenge on Computational Modelling of Human Multimodal Language.

[40]

Livingstone S R, Russo F A. The ryerson audio-Visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in north American English. PLoS ONE, 2018, 13(5): e0196391.

[41]

Ma X, Wu Z, Jia J, Xu M, Meng H, Cai L (2018). Emotion recognition from variable-length speech segments using deep learning on spectrograms. Interspeech: 3683–3687.

[42]

Mohamed A R (2021). DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.

[43]

Nair R, Gomathi S. Breast cancer image classification using transfer learning and convolutional neural network. International Journal of Modern Research, 2022, 2(1): 8-16.

[44]

Neumann M, Vu N T (2017). Attentive convolutional neural network-based speech emotion recognition: A study on the impact of input features, signal length, and acted speech. Interspeech: 1263–1267.

[45]

Padi S, Sadjadi S O, Manocha D, Sriram R D (2022). Multimodal emotion recognition using transfer learning from speaker recognition and BERT-based models. arXiv preprint arXiv:2202.08974.

[46]

Parada-Cabaleiro E, Costantini G, Batliner A, Schmitt M, Schuller B (2019). DEMoS: An Italian emotional speech corpus. Language Resources and Evaluation: 1–43.

[47]

Patnaik S. Speech emotion recognition by using complex MFCC and deep sequential model. Multimedia Tools and Applications, 2022, 82: 11897-11922.

[48]

Pennington J, Socher R, Manning C D (2014). Glove: Global vectors for word representation. Empirical Methods in Natural Language Processing (EMNLP): 1532–1543.

[49]

Piastowski A, Czyzewski A, Nadachowski P, Operlejn M, Kaczor K. Recognition of emotions in speech using convolutional neural networks on different datasets. Electronics, 2022, 11: 3831.

[50]

Provost E M, Gideon J, McInnis M (2018). Emotion identification from raw speech signals using DNNs. Interspeech: 3097–3101.

[51]

Puri T, Soni M, Dhiman G, Khalaf O I, Khan I R (2022). Detection of emotion of speech for ravdess audio using hybrid convolution neural network. Journal of Healthcare Engineering. Doi: https://doi.org/10.1155/2022/8472947.

[52]

Rawat R S, Shukla S K, Gupta V K. A review of predictive modeling of antimicrobial peptides using machine learning techniques. International Journal of Modern Research, 2022, 2(1): 28-38.

[53]

Ren J, Zhang Y, Wang L, Zhang M, Lu H. Speech emotion recognition using deep neural network with an ensemble learning strategy. Multimedia Tools and Applications, 2022, 81(10): 15147-15170.

[54]

Richard H, Tom R, Yvonne R, Abigail S (2008). Being human: Human-computer interaction in the year 2020. Report, Microsoft Corporation.

[55]

Sanh V, Debut L, Chaumond J, Wolf T (2019). DistilBERT, A distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.

[56]

Sarma M, Ghahremani P, Povey D, Goel N K, Sarma K K, Dehak N (2018). Emotion identification from raw speech signals using DNNs. Interspeech: 3097–3101.

[57]

Satt A, Rozenberg S, Hoory R (2017). Efficient emotion recognition from speech using deep learning on spectrograms. Interspeech: 1089–1093.

[58]

Schuller B W. Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends. Communications of the ACM, 2018, 61(5): 90-99.

[59]

Sepp H, Jürgen S. Long short-term memory. Neural Computation, 1997, 9(8): 1735-1780.

[60]

Shukla S K, Gupta V K, Joshi K, Gupta A, Singh M K. Self-aware execution environment model (SAE2) for the performance improvement of multicore systems. International Journal of Modern Research, 2022, 2(1): 17-27.

[61]

Singamaneni K K, Dhiman G, Juneja S, Muhammad G, AlQahtani S A, Zaki J. A novel QKD approach to enhance IIOT privacy and computational knacks. Sensors, 2022, 22(18): 6741.

[62]

Song P, Jin Y, Zhao L, Xin M. Speech emotion recognition using transfer learning. IEICE Trans. Information and Systems, 2014, 97(9): 2530-2532.

[63]

Trinh V L, Dao T L T, Le X T, Castelli E. Emotional speech recognition using deep neural networks. Sensors, 2022, 22: 1414.

[64]

Tsai Y H, Bai S, Liang P P, Kolter J Z, Morency L, Salakhutdinov R (2019). Multimodal transformer for unaligned multimodal language sequences. arXiv preprint arXiv:1906.00295.

[65]

Vaishnav P K, Sharma S, Sharma P. Analytical review analysis for screening COVID-19. International Journal of Modern Research, 2021, 1: 22-29.

[66]

Venkataramanan K, Rajamohan R, Haresh R (2019). Emotion recognition from speech. arXiv preprint arXiv:1912.10458.

[67]

Ververidis D, Kotropoulos C. Emotional speech recognition: Resources, features, and methods. Speech Communication, 2006, 48(9): 1162-1181.

[68]

Yang Z, Hirschberg J (2018). Predicting arousal and valence from waveforms and spectrograms using deep neural networks. Interspeech: 3092–3096.

[69]

Yenigalla P, Kumar A, Tripathi S, Singh C, Kar, S, Vepa J (2018). Speech emotion recognition using spectrogram & phoneme embedding. Interspeech: 3688–3692.

[70]

Zadeh A, Liang P P, Vanbriesen J, Poria S, Tong E, Cambria E, Morency, L P (2018). Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: 2236–2246.

[71]

Zhang L, Yan Y, Mao Q, Song Y (2019). Speech emotion recognition based on the combination of pre-trained CNN and GAN discriminator. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).

[72]

Zhang W, Jia Y (2022). A study on speech emotion recognition model based on mel-spectrogram and CapsNet. International Academic Exchange Conference on Science and Technology Innovation (IAECST).

[73]

Zhao J, Mao X, Chen L. Speech emotion recognition using deep 1d & 2d CNN LSTM networks. Biomedical Signal Processing and Control, 2019, 47: 312-323.

[74]

Zhao L, Song P, Jin Y, Xin M (2014). Speech emotion recognition based on deep neural network. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[75]

Zhou Y, Zhang Y, Wang L (2022). Speech emotion recognition based on multi-head self-attention mechanism and convolutional neural network. Multimedia Tools and Applications.

[76]

Zolfaghari A, Fakhreddin M (2022). Speech emotion recognition using multi-task learning with deep neural networks. Neural Computing & Applications. https://arxiv.org/pdf/2203.16794.

[77]

Zolfaghari A, Fakhreddin M. Speech emotion recognition using a hybrid deep neural network. Applied Acoustics, 2021, 288: 108480

[78]

Zolfaghari A, Fakhreddin M. Speech emotion recognition using a multi-stream deep neural network. Speech Communications, 2021, 133: 1-12.

[79]

Zou Y, Wang L (2022a). Speech emotion recognition based on convolutional neural network and attention mechanism. Neural Computing & Applications.

[80]

Zou Y, Wang L (2022b). Speech emotion recognition based on multi-head self-attention mechanism and convolutional neural network. Multimedia Tools and Applications.

[81]

Zou Y, Wang L, Zhang Y (2022). Speech emotion recognition based on bidirectional long short-term memory network and attention mechanism. Multimedia Tools and Applications.

AI Summary AI Mindmap
PDF

386

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/