Performance analysis of ASR system in hybrid DNN-HMM framework using a PWL euclidean activation function

Anirban DUTTA; Gudmalwar ASHISHKUMAR; Ch V Rama RAO

doi:10.1007/s11704-020-9419-z

PDF(513 KB)

Front. Comput. Sci. ›› 2021, Vol. 15 ›› Issue (4) : 154705. DOI: 10.1007/s11704-020-9419-z

RESEARCH ARTICLE

Performance analysis of ASR system in hybrid DNN-HMM framework using a PWL euclidean activation function

Author information +

History +

Abstract

Automatic Speech Recognition (ASR) is the process of mapping an acoustic speech signal into a human readable text format. Traditional systems exploit the Acoustic Component of ASR using the Gaussian Mixture Model- Hidden Markov Model (GMM-HMM) approach.Deep NeuralNetwork (DNN) opens up new possibilities to overcome the shortcomings of conventional statistical algorithms. Recent studies modeled the acoustic component of ASR system using DNN in the so called hybrid DNN-HMM approach. In the context of activation functions used to model the non-linearity in DNN, Rectified Linear Units (ReLU) and maxout units are mostly used in ASR systems. This paper concentrates on the acoustic component of a hybrid DNN-HMM system by proposing an efficient activation function for the DNN network. Inspired by previous works, euclidean norm activation function is proposed to model the non-linearity of the DNN network. Such non-linearity is shown to belong to the family of Piecewise Linear (PWL) functions having distinct features. These functions can capture deep hierarchical features of the pattern. The relevance of the proposal is examined in depth both theoretically and experimentally. The performance of the developed ASR system is evaluated in terms of Phone Error Rate (PER) using TIMIT database. Experimental results achieve a relative increase in performance by using the proposed function over conventional activation functions.

Keywords

deep learning / euclidean / piecewise linear / speech recognition / activation function

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Anirban DUTTA, Gudmalwar ASHISHKUMAR, Ch V Rama RAO. Performance analysis of ASR system in hybrid DNN-HMM framework using a PWL euclidean activation function. Front. Comput. Sci., 2021, 15(4): 154705 https://doi.org/10.1007/s11704-020-9419-z

This is a preview of subscription content, contact us for subscripton.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Baker J M, Deng L, Glass J, Khudanpur S, Lee C H, Morgan N, O’Shaughnessy D. Developments and directions in speech recognition and understanding, Part 1 [DSP Education]. IEEE Signal Processing Magazine, 2009, 26(3): 75–80 CrossRef Google scholar

[2]	Lawrence R. Fundamentals of Speech Recognition. India: Pearson Education, 2008

[3]	Young S. A review of large vocabulary continuous speech. IEEE Signal Processing Magazine, 1996, 13(5): 45 CrossRef Google scholar

[4]	Hinton G E, Salakhutdinov R R. Reducing the dimensionality of data with neural networks. Science, 2006, 313(5786): 504–507 CrossRef Google scholar

[5]	McDermott E, Hazen T J, Le Roux J, Nakamura A, Katagiri S. Discriminative training for large vocabulary speech recognition using minimum classification error. IEEE Transactions on Audio, Speech and Language Processing, 2006, 15(1): 203–223 CrossRef Google scholar

[6]	Saon G, Chien J T. Large vocabulary continuous speech recognition systems: a look at some recent advances. IEEE Signal Processing Magazine, 2012, 29(6): 18–33 CrossRef Google scholar

[7]	Hinton G E, Osindero S, Teh Y W. A fast learning algorithm for deep belief nets. Neural Computation, 2006, 18(7): 1527–1554 CrossRef Google scholar

[8]	He K, Zhang X, Ren S, Sun J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1904–1916 CrossRef Google scholar

[9]	Chen K, Ding G, Han J. Attribute based supervised deep learning model for action recognition. Frontiers of Computer Science, 2017, 11(2): 219–229 CrossRef Google scholar

[10]	Graves A, Jaitly N. Towards end to end speech recognition with recurrent neural networks. In: Proceedings of International Conference on Machine Learning. 2014, 1764–1772

[11]	Ying W, Zhang L, Deng H. Sichuan dialect speech recognition with deep LSTM network. Frontiers of Computer Science, 2020, 14(2): 378–387 CrossRef Google scholar

[12]	Yan Y, Chen Z, Liu Z. Semi-tensor product of matrices approach to reachability of finite automata with application to language recognition. Frontiers of Computer Science, 2014, 8(6): 948–957 CrossRef Google scholar

[13]	Young T, Hazarika D, Poria S, Cambria E. Recent trends in deep learning based natural language processing. IEEE Computational Intelligence Magazine, 2018, 13(3): 55–75 CrossRef Google scholar

[14]	Dahl G E, Yu D, Deng L, Acero A. Context dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Transactions on Audio, Speech and Language Processing, 2011, 20(1): 30–42 CrossRef Google scholar

[15]	Zhang Q, Zhang L. Convolutional adaptive denoising autoencoders for hierarchical feature extraction. Frontiers of Computer Science, 2018, 12(6): 1140–1148 CrossRef Google scholar

[16]	Rong W, Peng B, Ouyang Y, Li C, Xiong Z. Structural information aware deep semi-supervised recurrent neural network for sentiment analysis. Frontiers of Computer Science, 2015, 9(2): 171–184 CrossRef Google scholar

[17]	Peddinti V, Povey D, Khudanpur S. A time delay neural network architecture for efficient modeling of long temporal contexts. In: Proceedings of the 16th Annual Conference of the International Speech Communication Association. 2015

[18]	Chan W, Jaitly N, Le Q, Vinyals O. Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. 2016, 4960–4964 CrossRef Google scholar

[19]	Bellegarda J R, Monz C. State of the art in statistical methods for language and speech processing. Computer Speech & Language, 2016, 35: 163–184 CrossRef Google scholar

[20]	Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. 2010, 249–256

[21]	Zhao W, San Y. RBF neural network based on q Gaussian function in function approximation. Frontiers of Computer Science in China, 2011, 5(4): 381–386 CrossRef Google scholar

[22]	Wan L, Zeiler M, Zhang S, LeCun Y, Fergus R. Regularization of neural networks using dropconnect. In: Proceedings of International Conference on Machine Learning. 2013, 1058–1066

[23]	Hsu W N, Zhang Y, Glass J. Unsupervised domain adaptation for robust speech recognition via variational autoencoder based data augmentation. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop. 2017, 16–23 CrossRef Google scholar

[24]	Nair V, Hinton G E. Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning. 2010, 807–814

[25]	Agostinelli F, Hoffman M, Sadowski P, Baldi P. Learning activation functions to improve deep neural networks. 2014, arXiv preprint arXiv:1412.6830

[26]	Springenberg J T, Riedmiller M. Improving deep neural networks with probabilistic maxout units. 2013, arXiv preprint arXiv:1312.6116

[27]	Le Q V, Jaitly N, Hinton G E. A simple way to initialize recurrent networks of rectified linear units. 2015, arXiv preprint arXiv:1504.00941

[28]	Graves A, Jaitly N, Mohamed A R. Hybrid speech recognition with deep bidirectional LSTM. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. 2013, 273–278 CrossRef Google scholar

[29]	Chen J, Zhang Q, Liu P, Qiu X, Huang X J. Implicit discourse relation detection via a deep architecture with gated relevance network. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016, 1726–1735 CrossRef Google scholar

[30]	Glorot X, Bordes A, Bengio Y. Deep sparse rectifier neural networks. In: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. 2011, 315–323

[31]	Goodfellow I J, Warde-Farley D, Mirza M, Courville A, Bengio Y. Maxout networks. In: Proceedings of International Conference on Machine Learning. 2013, 1319–1327

[32]	Maas A L, Hannun A Y, Ng A Y. Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of International Conference on Machine Learning. 2013

[33]	Dahl G E, Sainath T N, Hinton G E. Improving deep neural networks for LVCSR using rectified linear units and dropout. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. 2013, 8609–8613 CrossRef Google scholar

[34]	Zhang X, Trmal J, Povey D, Khudanpur S. Improving deep neural network acoustic models using generalized maxout networks. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. 2014, 215–219 CrossRef Google scholar

[35]	Cai M, Liu J. Maxout neurons for deep convolutional and LSTM neural networks in speech recognition. Speech Communication, 2016, 77: 53–64 CrossRef Google scholar

[36]	Aggarwal C C, Hinneburg A, Keim D A. On the surprising behavior of distance metrics in high dimensional space. In: Proceedings of International Conference on Database Theory. 2001, 420–434 CrossRef Google scholar

[37]	Hinton G, Deng L, Yu D, Dahl G E, Mohamed A R, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath T N, Kingsbury B. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Processing Magazine, 2012, 29(6): 82–97 CrossRef Google scholar

[38]	Gales M, Young S. The application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing, 2008, 1(3): 195–304 CrossRef Google scholar

[39]	Montufar G, Pascanu R, Cho K, Bengio Y. On the number of linear regions of deep neural networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. 2014, 2924–2932

[40]	Pascanu R, Montufar G, Bengio Y. On the number of response regions of deep feed forward networks with piece wise linear activations. 2013, arXiv preprint arXiv:1312.6098

[41]	Srivastava R K, Masci J, Gomez F, Schmidhuber J. Understanding locally competitive networks. 2014, arXiv preprint arXiv:1410.1165

[42]	Arora R, Basu A, Mianjy P, Mukherjee A. Understanding deep neural networks with rectified linear units. In: Proceedings of International Conference on Learning Pepresentation. 2018

[43]	Stallings J. The piecewise linear structure of euclidean space. In: Proceedings of the Cambridge Philosophical Society. 1962, 481–488 CrossRef Google scholar

[44]	Amin H, Curtis K M, Hayes-Gill B R. Piecewise linear approximation applied to nonlinear function of a neural network. IEE Proceedings-Circuits, Devices and Systems, 1997, 144(6): 313–317 CrossRef Google scholar

[45]	Serra T, Tjandraatmadja C, Ramalingam S. Bounding and counting linear regions of deep neural networks. In: Proceedings of International Conference on Machine Learning. 2018, 4558–4566

[46]	Beyer K, Goldstein J, Ramakrishnan R, Shaft U. When is “nearest neighbor” meaningful? In: Proceedings of International Conference on Database Theory. 1999, 217–235 CrossRef Google scholar

[47]	Gold B, Morgan N, Ellis D. Speech and Audio Signal Processing: Processing and Perception of Speech and Music. John Wiley & Sons, 2011 CrossRef Google scholar

[48]	Rath S P, Povey D, Vesely K, Cernocky J. Improved feature processing for deep neural networks. In: Proceedings of the 14th Annual Conference of the International Speech Communication Association. 2013, 109–113

[49]	Povey D, Zhang X, Khudanpur S. Parallel training of DNNs with natural gradient and parameter averaging. 2014, arXiv preprint arXiv:1410.7455

[50]	Garofolo J S, Lamel L F, Fisher W M, Fiscus J G, Pallett D S. DARPA TIMIT acoustic phonetic continous speech corpus CD ROM. NIST Speech Disc 1-1.1. NASA STI/Recon Technical Report n, 1993 CrossRef Google scholar

[51]	Hifny Y, Renals S. Speech recognition using augmented conditional random fields. IEEE Transactions on Audio, Speech and Language Processing, 2009, 17(2): 354–365 CrossRef Google scholar

[52]	Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, Hannemann M, Motlicek P, Qian Y, Schwarz P, Silovsky J. The Kaldi speech recognition toolkit. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding. 2011

[53]	Goodfellow I, Bengio Y, Courville A. Deep Learning. MIT Press, 2016

[54]	LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553): 436–444 CrossRef Google scholar

[55]	Huang P S, Avron H, Sainath T N, Sindhwani V, Ramabhadran B. Kernel methods match deep neural networks on timit. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2014, 205–209 CrossRef Google scholar