Self-supervised Learning for Speech Emotion Recognition Task Using Audio-visual Features and Distil Hubert Model on BAVED and RAVDESS Databases
Karim Dabbabi , Abdelkarim Mars
Journal of Systems Science and Systems Engineering ›› : 1 -31.
Self-supervised Learning for Speech Emotion Recognition Task Using Audio-visual Features and Distil Hubert Model on BAVED and RAVDESS Databases
Existing pre-trained models like Distil HuBERT excel at uncovering hidden patterns and facilitating accurate recognition across diverse data types, such as audio and visual information. We harnessed this capability to develop a deep learning model that utilizes Distil HuBERT for jointly learning these combined features in speech emotion recognition (SER). Our experiments highlight its distinct advantages: it significantly outperforms Wav2vec 2.0 in both offline and real-time accuracy on RAVDESS and BAVED datasets. Although slightly trailing HuBERT’s offline accuracy, Distil HuBERT shines with comparable performance at a fraction of the model size, making it an ideal choice for resource-constrained environments like mobile devices. This smaller size does come with a slight trade-off: Distil HuBERT achieved notable accuracy in offline evaluation, with 96.33% on the BAVED database and 87.01% on the RAVDESS database. In real-time evaluation, the accuracy decreased to 79.3% on the BAVED database and 77.87% on the RAVDESS database. This decrease is likely a result of the challenges associated with real-time processing, including latency and noise, but still demonstrates strong performance in practical scenarios. Therefore, Distil HuBERT emerges as a compelling choice for SER, especially when prioritizing accuracy over real-time processing. Its compact size further enhances its potential for resource-limited settings, making it a versatile tool for a wide range of applications.
Wav2vec 2.0 / Distil HuBERT / HuBERT / SER / audio and audio-visual features
| [1] |
|
| [2] |
|
| [3] |
Aouf A (2019). Basic arabic vocal emotions dataset (BAVED) - GitHub. Retrieved from https://github.com/40uf411/Basic-Arabic-VocalEmotions-Dataset. |
| [4] |
|
| [5] |
|
| [6] |
Beale R, Peter C (2008). The role of affect and emotion in HCI. Affect and Emotion in Human-computer Interaction: 1–11. |
| [7] |
Boateng G, Kowatsch T (2020). Speech emotion recognition among elderly individuals using multimodal fusion and transfer learning. International Conference on Multimodal Interaction: 12–16. |
| [8] |
Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W F, Weiss B A (2005). Database of german emotional speech. Proceedings of the Ninth European Conference on Speech Communication and Technology: 4–8. |
| [9] |
|
| [10] |
Butt S A, Iqbal U, Ghazali R, Shoukat I A, Lasisi A, Al-Saedi A K (2022). An improved convolutional neural network for speech emotion recognition. International Conference on Soft Computing and Data Mining: 194–201. |
| [11] |
|
| [12] |
Chang H, Yang S, Lee H (2022). Distil HuBERT: Speech representation learning by layer-wise distillation of hidden-unit BERT. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): 7087–7091. |
| [13] |
Chang Y, Ren Z, Nguyen T T, Qian K, Schuller B W (2023). Knowledge transfer for on-device speech emotion recognition with neural structured learning. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): 1–5. |
| [14] |
|
| [15] |
|
| [16] |
Devlin J, Chang M W, Lee K, Toutanova K (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. |
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
Gao M, Dong J, Zhou D, Zhang Q, Yang D (2019). End-to-end speech emotion recognition based on one-dimensional convolutional neural network. Proceedings of the 2019 3rd International Conference on Innovation in Artificial Intelligence: 78–82. |
| [23] |
Hinton G E, Sabour S, Frosst N (2018). Matrix capsules with EM routing. International Conference on Learning Epresentations: 1–15. |
| [24] |
Georgiou E, Paraskevopoulos G, Potamianos A (2021). M3: MultiModal masking applied to sentiment analysis. Interspeech: 2876–2880. |
| [25] |
Ghosh S, Laksana E, Morency L P, Scherer S (2016). Representation learning for speech emotion recognition. Interspeech: 3603–3607. |
| [26] |
Gideon J, Khorram S, Aldeneh Z, Dimitriadis D, Provost E M (2017). Progressive neural networks for transfer learning in emotion recognition. arXiv: 1706.03256. |
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
Halabi N (2021). Arabic speech corpus. Retrieved from http://ar.arabicspeechcorpus.com/. |
| [32] |
Han K, Yu D, Tashev I (2014). Speech emotion recognition using deep neural network and extreme learning machine. Interspeech: 223–227. |
| [33] |
Hastie Trevor (2017). Generalized additive models. Statistical Models in S: 249–307. |
| [34] |
Hsu W N, Bolte B, Tsai Y H H, Lakhotia K, Salakhutdinov R, Mohamed A (2021). HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. arXiv preprint arXiv:2106.07447. |
| [35] |
|
| [36] |
|
| [37] |
Kwon O W, Chan K, Hao J, Lee T W (2003). Emotion recognition by speech signals. Interspeech: 125–128. |
| [38] |
Latif S, Rana R, Younis S, Qadir J, Epps J (2018). Transfer learning for improving speech emotion classification accuracy. Interspeech: 257–261. |
| [39] |
Liang P P, Salakhutdinov R, Morency L P (2018). Computational modeling of human multimodal language: The MOSEI dataset and interpretable dynamic fusion. Proceedings of the First Workshop and Grand Challenge on Computational Modelling of Human Multimodal Language. |
| [40] |
|
| [41] |
Ma X, Wu Z, Jia J, Xu M, Meng H, Cai L (2018). Emotion recognition from variable-length speech segments using deep learning on spectrograms. Interspeech: 3683–3687. |
| [42] |
Mohamed A R (2021). DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. |
| [43] |
|
| [44] |
Neumann M, Vu N T (2017). Attentive convolutional neural network-based speech emotion recognition: A study on the impact of input features, signal length, and acted speech. Interspeech: 1263–1267. |
| [45] |
Padi S, Sadjadi S O, Manocha D, Sriram R D (2022). Multimodal emotion recognition using transfer learning from speaker recognition and BERT-based models. arXiv preprint arXiv:2202.08974. |
| [46] |
Parada-Cabaleiro E, Costantini G, Batliner A, Schmitt M, Schuller B (2019). DEMoS: An Italian emotional speech corpus. Language Resources and Evaluation: 1–43. |
| [47] |
|
| [48] |
Pennington J, Socher R, Manning C D (2014). Glove: Global vectors for word representation. Empirical Methods in Natural Language Processing (EMNLP): 1532–1543. |
| [49] |
|
| [50] |
Provost E M, Gideon J, McInnis M (2018). Emotion identification from raw speech signals using DNNs. Interspeech: 3097–3101. |
| [51] |
Puri T, Soni M, Dhiman G, Khalaf O I, Khan I R (2022). Detection of emotion of speech for ravdess audio using hybrid convolution neural network. Journal of Healthcare Engineering. Doi: https://doi.org/10.1155/2022/8472947. |
| [52] |
|
| [53] |
|
| [54] |
Richard H, Tom R, Yvonne R, Abigail S (2008). Being human: Human-computer interaction in the year 2020. Report, Microsoft Corporation. |
| [55] |
Sanh V, Debut L, Chaumond J, Wolf T (2019). DistilBERT, A distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. |
| [56] |
Sarma M, Ghahremani P, Povey D, Goel N K, Sarma K K, Dehak N (2018). Emotion identification from raw speech signals using DNNs. Interspeech: 3097–3101. |
| [57] |
Satt A, Rozenberg S, Hoory R (2017). Efficient emotion recognition from speech using deep learning on spectrograms. Interspeech: 1089–1093. |
| [58] |
|
| [59] |
|
| [60] |
|
| [61] |
|
| [62] |
|
| [63] |
|
| [64] |
Tsai Y H, Bai S, Liang P P, Kolter J Z, Morency L, Salakhutdinov R (2019). Multimodal transformer for unaligned multimodal language sequences. arXiv preprint arXiv:1906.00295. |
| [65] |
|
| [66] |
Venkataramanan K, Rajamohan R, Haresh R (2019). Emotion recognition from speech. arXiv preprint arXiv:1912.10458. |
| [67] |
|
| [68] |
Yang Z, Hirschberg J (2018). Predicting arousal and valence from waveforms and spectrograms using deep neural networks. Interspeech: 3092–3096. |
| [69] |
Yenigalla P, Kumar A, Tripathi S, Singh C, Kar, S, Vepa J (2018). Speech emotion recognition using spectrogram & phoneme embedding. Interspeech: 3688–3692. |
| [70] |
Zadeh A, Liang P P, Vanbriesen J, Poria S, Tong E, Cambria E, Morency, L P (2018). Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: 2236–2246. |
| [71] |
Zhang L, Yan Y, Mao Q, Song Y (2019). Speech emotion recognition based on the combination of pre-trained CNN and GAN discriminator. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). |
| [72] |
Zhang W, Jia Y (2022). A study on speech emotion recognition model based on mel-spectrogram and CapsNet. International Academic Exchange Conference on Science and Technology Innovation (IAECST). |
| [73] |
|
| [74] |
Zhao L, Song P, Jin Y, Xin M (2014). Speech emotion recognition based on deep neural network. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). |
| [75] |
Zhou Y, Zhang Y, Wang L (2022). Speech emotion recognition based on multi-head self-attention mechanism and convolutional neural network. Multimedia Tools and Applications. |
| [76] |
Zolfaghari A, Fakhreddin M (2022). Speech emotion recognition using multi-task learning with deep neural networks. Neural Computing & Applications. https://arxiv.org/pdf/2203.16794. |
| [77] |
|
| [78] |
|
| [79] |
Zou Y, Wang L (2022a). Speech emotion recognition based on convolutional neural network and attention mechanism. Neural Computing & Applications. |
| [80] |
Zou Y, Wang L (2022b). Speech emotion recognition based on multi-head self-attention mechanism and convolutional neural network. Multimedia Tools and Applications. |
| [81] |
Zou Y, Wang L, Zhang Y (2022). Speech emotion recognition based on bidirectional long short-term memory network and attention mechanism. Multimedia Tools and Applications. |
/
| 〈 |
|
〉 |