Pattern Recognition and Intelligent System Laboratory, Beijing University of Posts and Telecommunications, Beijing 100876, China
anniefangyu@gmail.com
Show less
History+
Received
Accepted
Published
2011-05-06
2011-07-14
2011-12-05
Issue Date
Revised Date
2011-12-05
PDF
(116KB)
Abstract
To solve the frame delay problem and match the previous frame, Plapous et al. [IEEE Transactions on Audio, Speech, and Language Processing, 2006, 14(6): 2098–2108] introduced a novel approach called two-step noise reduction (TSNR) technique to improve the performance of the speech enhancement system. However, TSNR approach results in spectral peaks of short duration and the broken spectral outlier, which degrade the spectral characteristics of the speech. To solve this problem, a cepstral smoothing step is added in order to remove these spectral peaks brought by TSNR approach. Theory analysis shows that the proposed approach can effectively smooth the spectral peaks and keep the spectral outlier so as to protect the speech characteristics. Experiment results also show that the proposed approach can bring significant improvement compared to decision-directed (DD) and TSNR approaches, especially in non-stationary noisy environments.
Yu FANG, Gang LIU, Jun GUO.
Speech enhancement based on modified a priori SNR estimation.
Front. Electr. Electron. Eng., 2011, 6(4): 542-546 DOI:10.1007/s11460-011-0181-8
With the development of mobile communication, new challenges are introduced. To improve the communication quality in mobile environment and achieve communication “anywhere and anytime”, noisy speech signal should be enhanced before transmitting to far end. Therefore, the suppression of background acoustic noise is becoming more and more important.
Many speech enhancement approaches have been investigated in the past few decades, such as minimum mean square error (MMSE) [1], spectral subtraction (SS) [2], etc. In the speech enhancement process, the estimation of the a priori signal-to-noise ratio (SNR) is one of the most important parts, especially in non-stationary environments. Decision-directed (DD) approach is a practical way of estimating the a priori SNR [1] and has been widely used. DD approach can reduce musical noise, which is a very common phenomenon in speech enhancement approaches. Some modifications based on DD approach are given in Refs. [3,4]. However, DD approach considers the previous frames rather than the current frame and introduces a simple delay of one short time, which degrades the performance of the noise reduction. To solve this problem, Ref. [5] proposes a new method, called two-step noise reduction (TSNR), to refine the estimation of the a priori SNR. However, musical noise phenomenon still exists, which can lead to speech distortion and have impact on personal perception.
Recently, applying temporal smoothing in cepstral domain was found to be a promising approach for speech enhancement, especially in non-stationary environments [6]. Noisy speech can be decomposed into several coefficients in cepstral domain, such as speech envelop, excitation, and noise [7]. Applying selective temporal smoothing to different kinds of coefficients can keep the speech characteristics compared to the approaches that apply spectral smoothing in short-time Fourier transform (STFT) domain. In this paper, a modified a priori SNR estimation using cepstral smoothing for noisy speech enhancement is proposed, which can suppress musical noise. Experimental results show that the proposed approach can improve performance over DD approach and TSNR approach.
The paper is organized as follows. In Sect. 2, DD approach, which is most widely used in state-of-the-art estimator, is reviewed. Section 3 describes TSNR approach and gives a modified a priori SNR estimation. Then, experiments and evaluation are given in Sect. 4. Finally, conclusions are given in Sect. 5.
Review of a priori SNR estimation
Speech enhancement in DFT domain
Consider an additive noise model where speech and noise are independent:where , , and represent the observed noisy speech, clean speech, and noise, respectively, and are considered to be random processes. The noisy signal is transformed into frequency domain by applying a Hanning window , , to a frame of M consecutive samples of and by computing the discrete Fourier transform (DFT) of size M on the windowed data. Before the next DFT computation, the window is shifted by M/2 samples. The sliding window DFT analysis that results in a set of frequency domain signals can be written aswhere denotes the frame index, denotes the frequency bin index, and and denote the speech and noise in the DFT domain. In this paper, the sampling frequency of the signal is , and the DFT length is M = 256.
It is useful to consider the amplitude estimator as being obtained from by a multiplicative nonlinear gain function, which is defined bywhere is the gain function. Generally, the a posteriori SNR and a priori SNR are defined as follows:where Y(k,l) denotes the observed signal in DFT domain. In this paper, the gain function can be described as the Wiener filter similar to [1]where denotes the estimation of a priori SNR.
DD approach
We will describe the DD approach as follows [1], which is found to be very useful when it is combined with either the MMSE and Wiener amplitude estimator. The derivation of the a priori SNR estimator is based on the definition of (see Eq. (5)) and its relation to the a posteriori SNR ,Using Eqs. (5) and (7), we can writeThe proposed estimator of is deduced from Eq. (8) and is given bywhere is the amplitude estimator of the kth signal spectral component in the (l-1)th analysis frame; is usually set to be 0.98, which can result in a great reduction of the noise, and provide enhanced speech with colorless residual noise. P[ ] is an operator, which is defined byP[ ] is used to ensure the positiveness of the proposed estimator in case is negative. It is also possible to apply the operator P on the right-hand side of Eq. (9) rather than on only. Equation (9) can also be written in the following way:where is a minimum value that is used to control the distortion of speech transients.
Modified a priori SNR estimation
TSNR approach
Some analysis of DD approach has been given in Refs. [5,8], and the results indicated that DD approach can limit the level of musical noise, but the estimated a priori SNR is biased since it depends on the speech spectrum estimation in the previous frame. Therefore, the gain function matches the previous frame rather than the current one, which degrades the noise reduction performance. Reference [5] proposed a method called TSNR technique, which can solve this problem while maintaining the benefits of DD approach. TSNR can be divided into the following steps.
In the first step, the gain function is computed using DD approach, which is shown in Sect. 2, that is,
In the second step, the estimation of the a priori SNR is refined to remove the bias of DD approach, thus removing the reverberation effect. The estimation of the a priori SNR using TSNR is obtained as follows:Without loss of generality, in the following, the chosen spectral gain is also the Wiener filter:
Modified approach
DD approach and TSNR approach are all short-time spectrum estimation approaches. In short-time spectrum estimation approaches, the estimation of the a priori SNR estimation usually results in spectrum peaks in short-time duration, which will destroy the spectral outliers and then degrade the performance of the speech enhance system and the speech characteristics. In cepstral domain, coefficients can be decomposed into different kinds with relation to speech envelop, excitation, and the noise. The speech envelope is always represented by the same small set of cepstral coefficients. The coefficients that represent the excitation can be found by searching for a cepstral peak in a defined range [7]. The remaining coefficients are dominated by noise. The speech characteristics and the noise can also be represented by the a priori SNR function in cepstral domain. Applying selective cepstral smoothing to the coefficient dominated by noise and speech respectively for the estimated a priori SNR function obtained by DD and TSNR approaches can degrade the artificial noise obviously. The block diagram of the modified noise reduction system is depicted in Fig. 1. It is described in detail as follows.
At the end of the first step of TSNR approach, a cepstral representation of is calculated for each frame l aswhere IDFT{ } is the inverse DFT of length M resulting in cepstral bins . Note that the symmetry condition holds.
Define as the cepstral index that most likely represents . It is found via a maximum search for a given frame l [6],where the search is limited to possible fundamental frequencies between and , resulting in the range of to , with the sampling frequency and flooring operator toward the nearest integer number.
The detection of speech sounds is based on the following criteria [6]. First, the voiced speech sound has a comparatively high energy, so the cepstral peak value should be above a certain threshold. Second, speech has more energy at low frequencies since it is spectrally tilted. Therefore, the range of cepstral bins that are most likely to represent the fundamental frequency iswhere is a small margin.
Therefore, all future modifications applied to the cepstral bins are applied accordingly to the symmetric counterpart . Then, smoothing is applied in the cepstral domain:where is the smoothing factor. In addition, the cepstral bins that are likely to represent the fundamental frequency should not be smoothed, which means that the smoothing is applied to cepstral bins , and for , no smoothing is applied at all.
A smoothed estimation of the DD approach in the spectral domain is finally obtained by transforming back to the spectral domain, that is,
where is additionally constrained to values below or equal to one.
Then, is used to calculate the a priori SNR of TSNR approach by Eqs. (12) and (13), and the a priori SNR estimation of TSNR can be obtained, representing by . Applying cepstral smoothing to again by Eqs. (15)-(18), can be obtained. In the end, we can get the gain of TSNR through Eq. (14).
Experiments and evaluation
In this section, the performance of the proposed approach is tested for speech enhancement, compared to that of DD approach and TSNR approach. The test consisted of ten speech utterances, five male and five female, from the TIMIT database [9], resampled at 8 kHz. The noise types considered were white Gaussian noise, speech babble noise, and HF channel noise (from NOISEX-92) [10].
The parameters used in the proposed approach are listed as follows: , , , , and .
Table 1 shows the segmental SNR improvements of various noise types, in dB, defined bywhere M is the number of frames that contain active speech, N is the number of samples per frame (corresponding to 32 ms frames), and l is a discrete-time index. The SNR at each frame is limited to perceptually meaning range between 35 dB and -10 dB. This prevents the segmental SNR measure from being biased in either a positive or negative direction due to a few silence or unusually high SNR frames that do not contribute significantly to the overall speech quality. For each noise type and SNR value, the average segmental SNR is the result of the averaging of the segmental SNRs obtained from ten sentences [11].
As can be seen in Table 1, the modified approach can get higher output averaged segmental SNR and achieve better performance. Improvement for noise reduction is obtained both in stationary environment and non-stationary environment compared to DD approach and TSNR approach.
Figure 2 shows the magnitude of clean speech, noisy speech signal corrupted by HF channel noise at 5 dB, noisy speech enhanced by DD, TSNR, and the proposed approach. Figure 2 indicates that in the proposed approach, the speech onsets and low-energy speech components can be preserved compared to DD and TSNR approach. Informal subjective listening also reveals that signals using the proposed approach sound clearer, as analyzed above that the spectral outlier was kept.
Conclusions
A modified the a priori SNR estimation approach for speech enhancement is proposed in this paper, which applies cepstral smoothing to TSNR approach, thus avoiding the disadvantages of TSNR by degrading the artificial musical noise. Simulation results show that the proposed algorithm has good performances for speech enhancement in different kinds of noisy environment, especially in non-stationary environments.
Ephraim Y, Malah D. Speech enhancement using a minimum mean square error short time spectral amplitude estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1984, 32(6): 1109-1121
[2]
Boll S F. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1979, 27(2): 113-120
[3]
Cohen I. Relaxed statistical model for speech enhancement and a priori SNR estimation. IEEE Transactions on Speech and Audio Processing, 2005, 13(5): 870-881
[4]
Cohen I. Speech enhancement using a noncausal a priori SNR estimator. IEEE Signal Processing Letters, 2004, 11(9): 725-728
[5]
Plapous C, Marro C, Scalart P. Improved signal-to-noise ratio estimation for speech enhancement. IEEE Transactions on Audio, Speech, and Language Processing, 2006, 14(6): 2098-2108
[6]
Mauler D, Gerkmann T, Martin R. An analysis of quefrency selective temporal smoothing of the cepstrum in speech enhancement. In: Proceedings of the 11th International Workshop on Acoustic Echo and Noise Control. 2008, 1-4
[7]
Noll A M. Cepstrum pitch estimation. Journal of the Acoustical Society of America, 1967, 41(2): 293-309
[8]
Cappe O. Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor. IEEE Transactions on Speech and Audio Processing, 1994, 2(2): 345-349
[9]
Garofolo J S, Lamel L F, Fisher W M, Fiscus J G, Pallett D S, Dahlgren N L, Zue V. DARPA TIMIT Acoustic-phonetic continuous speech corpus. NIST Speech Disc1-1.1, 1993
[10]
Varga A, Steeneken H J M, Tomlinson M, Jones D. The NOISEX-92 study on the effect of additive noise on automatic speech recognition. The NOISEX-92 CD-ROMs, 1992
[11]
Deller J R Jr, Hansen J H L, Proakis J G. Discrete-Time Processing of Speech Signals. 2nd ed. New York: IEEE Press, 2000
RIGHTS & PERMISSIONS
Higher Education Press and Springer-Verlag Berlin Heidelberg
AI Summary 中Eng×
Note: Please be aware that the following content is generated by artificial intelligence. This website is not responsible for any consequences arising from the use of this content.