Speech emotion recognitionwith unsupervised feature learning

Zheng-wei HUANG; Wen-tao XUE; Qi-rong MAO

doi:10.1631/FITEE.1400323

PDF(796 KB)

Front. Inform. Technol. Electron. Eng ›› 2015, Vol. 16 ›› Issue (5) : 358-366. DOI: 10.1631/FITEE.1400323

Speech emotion recognitionwith unsupervised feature learning

Author information +

History +

Abstract

Emotion-based features are critical for achieving high performance in a speech emotion recognition (SER) system. In general, it is difficult to develop these features due to the ambiguity of the ground-truth. In this paper, we apply several unsupervised feature learning algorithms (including K-means clustering, the sparse auto-encoder, and sparse restricted Boltzmann machines), which have promise for learning task-related features by using unlabeled data, to speech emotion recognition. We then evaluate the performance of the proposed approach and present a detailed analysis of the effect of two important factors in the model setup, the content window size and the number of hidden layer nodes. Experimental results show that larger content windows and more hidden nodes contribute to higher performance. We also show that the two-layer network cannot explicitly improve performance compared to a single-layer network.

Keywords

Speech emotion recognition / Unsupervised feature learning / Neural network / Affect computing

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Zheng-wei HUANG, Wen-tao XUE, Qi-rong MAO. Speech emotion recognitionwith unsupervised feature learning. Front. Inform. Technol. Electron. Eng, 2015, 16(5): 358‒366 https://doi.org/10.1631/FITEE.1400323

This is a preview of subscription content, contact us for subscripton.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Abdel-Hamid, O., Mohamed, A.R., Jiang, H., , 2012. Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, p.4277-4280. [ CrossRef Google scholar

[2]	Burkhardt, F., Paeschke, A., Rolfes, M., , 2005. A database of German emotional speech. Interspeech, p.1517-1520.

[3]	Chan, T.H., Jia, K., Gao, S., , 2014. PCANet: a simple deep learning baseline for image classification? arXiv preprint, arXiv:1404.3606.

[4]	Coates, A., Ng, A.Y., Lee, H., 2011. An analysis of singlelayer networks in unsupervised feature learning. Int. Conf. on Artificial Intelligence and Statistics, p.215-223.

[5]	Dahl, G.E., Yu, D., Deng, L., , 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process., 20(1): 30-42. [ CrossRef Google scholar

[6]	El Ayadi, M., Kamel, M.S., Karray, F., 2011. Survey on speech emotion recognition: features, classification schemes, and databases. Patt. Recogn., 44(3): 572-587. [ CrossRef Google scholar

[7]	Feraru, M., Zbancioc, M., 2013. Speech emotion recognition for SROL database using weighted KNN algorithm. Int. Conf. on Electronics, Computers and Artificial Intelligence, p.1-4. [ CrossRef Google scholar

[8]	Gao, H., Chen, S.G., An, P., , 2012. Emotion recognition of Mandarin speech for different speech corpora based on nonlinear features. IEEE 11th Int. Conf. on Signal Processing, p.567-570. [ CrossRef Google scholar

[9]	Gunes, H., Schuller, B., 2013. Categorical and dimensional affect analysis in continuous input: current trends and future directions. Image Vis. Comput., 31(2): 120-136. [ CrossRef Google scholar

[10]	Haq, S., Jackson, P.J., 2009. Speaker-dependent audiovisual emotion recognition. Auditory-Visual Speech Processing, p.53-58.

[11]	Hinton, G., Deng, L., Yu, D., , 2012. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag., 29(6): 82-97. [ CrossRef Google scholar

[12]	Kim, Y., Lee, H., Provost, E.M., 2013. Deep learning for robust feature generation in audiovisual emotion recognition. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, p.3687-3691. [ CrossRef Google scholar

[13]	Koolagudi, S.G., Devliyal, S., Barthwal, A., , 2012. Emotion recognition from semi natural speech using artificial neural networks and excitation source features. In: Contemporary Computing. Springer Berlin Heidelberg, p.273-282.

[14]	Le, D., Provost, E.M., 2013. Emotion recognition from spontaneous speech using hidden Markov models with deep belief networks. IEEE Workshop on Automatic Speech Recognition and Understanding, p.216-221. [ CrossRef Google scholar

[15]	Lee, H., Pham, P., Largman, Y., , 2009. Unsupervised feature learning for audio classification using convolutional deep belief networks. Advances in Neural Information Processing Systems, p.1096-1104.

[16]	Li, L., Zhao, Y., Jiang, D., , 2013. Hybrid deep neural network–hidden Markov model (DNN-HMM) based speech emotion recognition. Humaine Association Conf. on Affective Computing and Intelligent Interaction, p.312-317. [ CrossRef Google scholar

[17]	Mao, Q., Wang, X., Zhan, Y., 2010. Speech emotion recognition method based on improved decision tree and layered feature selection. Int. J. Human. Robot., 7(2): 245-261. [ CrossRef Google scholar

[18]	Mao, Q.R., Zhao, X.L., Huang, Z.W., , 2013. Speakerindependent speech emotion recognition by fusion of functional and accompanying paralanguage features. J. Zhejiang Univ.-Sci. C (Comput. & Electron.), 14(7): 573-582. [ CrossRef Google scholar

[19]	Martin, O., Kotsia, I., Macq, B., , 2006. The eNTERFACE’ 05 audio-visual emotion database. Proc. Int. Conf. on Data Engineering Workshops, p.8. [ CrossRef Google scholar

[20]	Mencattini, A., Martinelli, E., Costantini, G., , 2014. Speech emotion recognition using amplitude modulation parameters and a combined feature selection procedure. Knowl.-Based Syst., 63: 68-81. [ CrossRef Google scholar

[21]	Mohamed, A.R., Dahl, G.E., Hinton, G., 2012. Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process., 20(1): 14-22. [ CrossRef Google scholar

[22]	Nicolaou, M.A., Gunes, H., Pantic, M., 2011. Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Trans. Affect. Comput., 2(2): 92-105. [ CrossRef Google scholar

[23]	Pantic, M., Nijholt, A., Pentland, A., , 2008. Humancentred intelligent human? Computer interaction (HCI2): how far are we from attaining it? Int. J. Auton. Adapt. Commun. Syst., 1(2): 168-187. [ CrossRef Google scholar

[24]	Ramakrishnan, S., El Emary, I.M., 2013. Speech emotion recognition approaches in human computer interaction. Telecommun. Syst., 52(3): 1467-1478. [ CrossRef Google scholar

[25]	Ranzato, M., Huang, F.J., Boureau, Y.L., , 2007. Unsupervised learning of invariant feature hierarchies with applications to object recognition. IEEE Conf. on Computer Vision and Pattern Recognition, p.1-8. [ CrossRef Google scholar

[26]	Razavian, A.S., Azizpour, H., Sullivan, J., , 2014. CNN features off-the-shelf: an astounding baseline for recognition. arXiv preprint, arXiv:1403.6382.

[27]	Schmidt, E.M., Kim, Y.E., 2011. Learning emotion-based acoustic features with deep belief networks. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, p.65-68. [ CrossRef Google scholar

[28]	Stuhlsatz, A., Meyer, C., Eyben, F., , 2011. Deep neural networks for acoustic emotion recognition: raising the benchmarks. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, p.5688-5691. [ CrossRef Google scholar

[29]	Sun, R., Moore, E.II, 2011. Investigating glottal parameters and Teager energy operators in emotion recognition. LNCS, 6975: 425-434. [ CrossRef Google scholar

[30]	Sun, Y., Wang, X., Tang, X., 2013. Deep learning face representation from predicting 10,000 classes. Proc. IEEE Conf. on Computer Vision and Pattern Recognition, p.1891-1898. [ CrossRef Google scholar

[31]	Thapliyal, N., Amoli, G., 2012. Speech based emotion recognition with Gaussian mixture model. Int. J. Adv. Res. Comput. Eng. Technol., 1(5): 65-69.

[32]	Wu, C.H., Liang, W.B., 2011. Emotion recognition of affective speech based on multiple classifiers using acousticprosodic information and semantic labels. IEEE Trans. Affect. Comput., 2(1): 10-21. [ CrossRef Google scholar

[33]	Wu, S., Falk, T.H., Chan, W.Y., 2011. Automatic speech emotion recognition using modulation spectral features. Speech Commun., 53(5): 768-785. [ CrossRef Google scholar