
Speech emotion recognitionwith unsupervised feature learning
Zheng-wei HUANG, Wen-tao XUE, Qi-rong MAO
Front. Inform. Technol. Electron. Eng ›› 2015, Vol. 16 ›› Issue (5) : 358-366.
Speech emotion recognitionwith unsupervised feature learning
Emotion-based features are critical for achieving high performance in a speech emotion recognition (SER) system. In general, it is difficult to develop these features due to the ambiguity of the ground-truth. In this paper, we apply several unsupervised feature learning algorithms (including K-means clustering, the sparse auto-encoder, and sparse restricted Boltzmann machines), which have promise for learning task-related features by using unlabeled data, to speech emotion recognition. We then evaluate the performance of the proposed approach and present a detailed analysis of the effect of two important factors in the model setup, the content window size and the number of hidden layer nodes. Experimental results show that larger content windows and more hidden nodes contribute to higher performance. We also show that the two-layer network cannot explicitly improve performance compared to a single-layer network.
Speech emotion recognition / Unsupervised feature learning / Neural network / Affect computing
[1] |
Abdel-Hamid, O., Mohamed, A.R., Jiang, H.,
CrossRef
Google scholar
|
[2] |
Burkhardt, F., Paeschke, A., Rolfes, M.,
|
[3] |
Chan, T.H., Jia, K., Gao, S.,
|
[4] |
Coates, A., Ng, A.Y., Lee, H., 2011. An analysis of singlelayer networks in unsupervised feature learning. Int. Conf. on Artificial Intelligence and Statistics, p.215-223.
|
[5] |
Dahl, G.E., Yu, D., Deng, L.,
CrossRef
Google scholar
|
[6] |
El Ayadi, M., Kamel, M.S., Karray, F., 2011. Survey on speech emotion recognition: features, classification schemes, and databases. Patt. Recogn., 44(3): 572-587. [
CrossRef
Google scholar
|
[7] |
Feraru, M., Zbancioc, M., 2013. Speech emotion recognition for SROL database using weighted KNN algorithm. Int. Conf. on Electronics, Computers and Artificial Intelligence, p.1-4. [
CrossRef
Google scholar
|
[8] |
Gao, H., Chen, S.G., An, P.,
CrossRef
Google scholar
|
[9] |
Gunes, H., Schuller, B., 2013. Categorical and dimensional affect analysis in continuous input: current trends and future directions. Image Vis. Comput., 31(2): 120-136. [
CrossRef
Google scholar
|
[10] |
Haq, S., Jackson, P.J., 2009. Speaker-dependent audiovisual emotion recognition. Auditory-Visual Speech Processing, p.53-58.
|
[11] |
Hinton, G., Deng, L., Yu, D.,
CrossRef
Google scholar
|
[12] |
Kim, Y., Lee, H., Provost, E.M., 2013. Deep learning for robust feature generation in audiovisual emotion recognition. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, p.3687-3691. [
CrossRef
Google scholar
|
[13] |
Koolagudi, S.G., Devliyal, S., Barthwal, A.,
|
[14] |
Le, D., Provost, E.M., 2013. Emotion recognition from spontaneous speech using hidden Markov models with deep belief networks. IEEE Workshop on Automatic Speech Recognition and Understanding, p.216-221. [
CrossRef
Google scholar
|
[15] |
Lee, H., Pham, P., Largman, Y.,
|
[16] |
Li, L., Zhao, Y., Jiang, D.,
CrossRef
Google scholar
|
[17] |
Mao, Q., Wang, X., Zhan, Y., 2010. Speech emotion recognition method based on improved decision tree and layered feature selection. Int. J. Human. Robot., 7(2): 245-261. [
CrossRef
Google scholar
|
[18] |
Mao, Q.R., Zhao, X.L., Huang, Z.W.,
CrossRef
Google scholar
|
[19] |
Martin, O., Kotsia, I., Macq, B.,
CrossRef
Google scholar
|
[20] |
Mencattini, A., Martinelli, E., Costantini, G.,
CrossRef
Google scholar
|
[21] |
Mohamed, A.R., Dahl, G.E., Hinton, G., 2012. Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process., 20(1): 14-22. [
CrossRef
Google scholar
|
[22] |
Nicolaou, M.A., Gunes, H., Pantic, M., 2011. Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Trans. Affect. Comput., 2(2): 92-105. [
CrossRef
Google scholar
|
[23] |
Pantic, M., Nijholt, A., Pentland, A.,
CrossRef
Google scholar
|
[24] |
Ramakrishnan, S., El Emary, I.M., 2013. Speech emotion recognition approaches in human computer interaction. Telecommun. Syst., 52(3): 1467-1478. [
CrossRef
Google scholar
|
[25] |
Ranzato, M., Huang, F.J., Boureau, Y.L.,
CrossRef
Google scholar
|
[26] |
Razavian, A.S., Azizpour, H., Sullivan, J.,
|
[27] |
Schmidt, E.M., Kim, Y.E., 2011. Learning emotion-based acoustic features with deep belief networks. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, p.65-68. [
CrossRef
Google scholar
|
[28] |
Stuhlsatz, A., Meyer, C., Eyben, F.,
CrossRef
Google scholar
|
[29] |
Sun, R., Moore, E.II, 2011. Investigating glottal parameters and Teager energy operators in emotion recognition. LNCS, 6975: 425-434. [
CrossRef
Google scholar
|
[30] |
Sun, Y., Wang, X., Tang, X., 2013. Deep learning face representation from predicting 10,000 classes. Proc. IEEE Conf. on Computer Vision and Pattern Recognition, p.1891-1898. [
CrossRef
Google scholar
|
[31] |
Thapliyal, N., Amoli, G., 2012. Speech based emotion recognition with Gaussian mixture model. Int. J. Adv. Res. Comput. Eng. Technol., 1(5): 65-69.
|
[32] |
Wu, C.H., Liang, W.B., 2011. Emotion recognition of affective speech based on multiple classifiers using acousticprosodic information and semantic labels. IEEE Trans. Affect. Comput., 2(1): 10-21. [
CrossRef
Google scholar
|
[33] |
Wu, S., Falk, T.H., Chan, W.Y., 2011. Automatic speech emotion recognition using modulation spectral features. Speech Commun., 53(5): 768-785. [
CrossRef
Google scholar
|
/
〈 |
|
〉 |