Copyright © 2008 The Institute of Electronics, Information and Communication Engineers
Special Section on Robust Speech Processing in Realistic Environments -- Papers -- ASR under Reverberant Conditions |
Robust Speech Recognition by Combining Short-Term and Long-Term Spectrum Based Position-Dependent CMN with Conventional CMN
1 The authors are with Toyohashi University of Technology, Toyohashi-shi, 441–8580 Japan. E-mail: wang{at}slp.ics.tut.ac.jp, 2 The author is with Nagoya University, Nagoya-shi, 464–8603 Japan.
In a distant-talking environment, the length of channel impulse response is longer than the short-term spectral analysis window. Conventional short-term spectrum based Cepstral Mean Normalization (CMN) is therefore, not effective under these conditions. In this paper, we propose a robust speech recognition method by combining a short-term spectrum based CMN with a long-term one. We assume that a static speech segment (such as a vowel, for example) affected by reverberation, can be modeled by a long-term cepstral analysis. Thus, the effect of long reverberation on a static speech segment may be compensated by the long-term spectrum based CMN. The cepstral distance of neighboring frames is used to discriminate the static speech segment (long-term spectrum) and the non-static speech segment (short-term spectrum). The cepstra of the static and non-static speech segments are normalized by the corresponding cepstral means. In a previous study, we proposed an environmentally robust speech recognition method based on Position-Dependent CMN (PDCMN) to compensate for channel distortion depending on speaker position, and which is more efficient than conventional CMN. In this paper, the concept of combining short-term and long-term spectrum based CMN is extended to PDCMN. We call this Variable Term spectrum based PDCMN (VT-PDCMN). Since PDCMN/VT-PDCMN cannot normalize speaker variations because a position-dependent cepstral mean contains the average speaker characteristics over all speakers, we also combine PDCMN/VT-PDCMN with conventional CMN in this study. We conducted the experiments based on our proposed method using limited vocabulary (100 words) distant-talking isolated word recognition in a real environment. The proposed method achieved a relative error reduction rate of 60.9% over the conventional short-term spectrum based CMN and 30.6% over the short-term spectrum based PDCMN.
Key Words: robust speech recognition, distant-talking environment, CMN, long-term spectrum
Manuscript received July 2, 2007. Manuscript revised September 6, 2007.
Reference
[1] T.B. Hughes, H.S. Kim, J.H. DiBiase, and H.F. Silverman, "Performance of an HMM speech recognizer using a real-time tracking microphone array as input," IEEE Trans. Speech Audio Process., vol.7, no.3, pp.346–349, May 1999. [2] T. Takiguchi, S. Nakamura, and K. Shikano, "HMM-separation-based speech recognition for a distant moving speaker," IEEE Trans. Speech Audio Process., vol.9, no.2, pp.127–140, Feb. 2001. [3] M.L. Seltzer, B. Raj, and R.M. Stern, "Likelihood-maximizing beamforming for robust hands-free speech recognition," IEEE Trans. Speech Audio Process., vol.12, no.5, pp.489–498, Sept. 2004. [4] S. Furui, "Cepstral analysis technique for automatic speaker verification," IEEE Trans. Acoust. Speech Signal Process., vol.29, no.2, pp.254–272, 1981. [5] F. Liu, R. Stern, X. Huang, and A. Acero, "Efficient cepstral normalization for robust speech recognition," Proc. ARPA Speech and Nat. Language Workshop, pp.69–74, 1993. [6] A. Vikki and K. Laurila, "Cepstral domain segmental feature vector normalization for noise robust speech recognition," Speech Commun., vol.25, no.1–3, pp.133–147, 1998. [7] P. Pujol, D. Macho, and C. Nadeu, "On real-time mean-and-variance normalization of speech recognition features," Proc. ICASSP-2006, pp.773–776, 2006. [8] C. Raut, T. Nishimoto, and S. Sagayama, "Model adaptation by splitting of HMM for long reverberation," Proc. INTERSPEECH-2005, pp.277–280, 2005. [9] C. Raut, T. Nishimoto, and S. Sagayama, "Adaptation for long convolutional distortion by maximum likelihood based state filtering approach," Proc. ICASSP-2006, vol.1, pp.1133–1136, 2006. [10] C. Avendano, Temporal processing of speech in a time feature space, Ph.D. Thesis, Oregon Graduate Institute of Science & Technology, April 1997. [11] C. Avendano, S. Tibrewala, and H. Hermansky, "Multiresolution channel normalization for ASR in reverberation environments," Proc. EUROSPEECH-1997, pp.1107–1110, 1997. [12] L. Wang, N. Kitaoka, and S. Nakagawa, "Robust distant speech recognition based on position dependent CMN using a novel multiple microphone processing technique," Proc. EUROSPEECH-2005, pp.2661–2664, 2005. [13] L. Wang, N. Kitaoka, and S. Nakagawa, "Robust distant speech recognition by combining multiple microphone-array processing with position-dependent CMN," EURASIP J. Appl. Signal Process., vol.2006, Article ID 95491, pp.1–11, 2006. [14] J. Fiscus, "A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER)," IEEE ASRU Workshop, pp.347–352, 1997. [15] Y. Obuchi, "Mixture weight optimization for dual-microphone MFCC combination," IEEE ASRU Workshop, pp.325–330, 2005. [16] B.S. Atal, "Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification," J. Acoust. Soc. Am., vol.55, pp.1304–1312, 1974. [17] Q. Jin, Y. Pan, and T. Schultz, "Far-field speaker recognition," Proc. ICASSP-2006, vol.1, pp.937–940, 2006. [18] L. Wang, N. Kitaoka, and S. Nakagawa, "Robust speech recognition by combining short-term spectrum based CMN with long-term spectrum based CMN," The Japan-China Joint Conference on Acoustics (JCA2007), P–2–13, June 2007. [19] L. Wang, N. Kitaoka, and S. Nakagawa, "Robust distant speaker recognition based on position-dependent CMN by combining speaker-specific GMM with speaker-adapted HMM," Speech Commun., vol.49, no.6, pp.501–513, June 2007. [20] C.H. Knapp and G.C. Carter, "The generalized correlation method for estimation of time delay," IEEE Trans. Acoust. Speech Signal Process., vol.ASSP-24, no.4, pp.320–327, Aug. 1976. [21] M. Omologo and P. Svaizer, "Use of the crosspower-spectrum phase in acoustic event location," IEEE Trans. Speech Audio Process., vol.5, no.3, pp.288–292, 1997. [22] S. Doclo and M. Moonen, "Robust adaptive time delay estimation for speaker localisation in noisy and reverberant acoustic environments," EURASIP J. Applied Signal Processing, vol.2003, no.11, pp.1110–1124, Oct. 2003. [23] L. Wang, N. Kitaoka, and S. Nakagawa, "Robust distant speech recognition by combining position-dependent CMN with Conventional CMN," Proc. ICASSP-2007, vol.4, pp.817–820, 2007. [24] A. Viterbi, "Error bounds for convolutional codes and an asymptotically optimum decoding algorithm," IEEE Trans. Inf. Theory, vol.13, no.2, pp.260–269, 1967. [25] B. Van Veen and K. Buckley, "Beamforming: A versatile approach to spatial filtering," IEEE Acoust. Speech Signal Process. Mag., vol.5, no.2, pp.4–24, April 1988. [26] J. Flanagan, J. Johnston, R. Zahn, and G. Elko, "Computer-steered microphone arrays for sound transduction in large rooms," J. Acoust. Soc. Am., vol.78, pp.1508–1518, June 1985. [27] S. Makino, K. Niyada, Y. Mafune, and K. Kido, "Tohoku University and Panasonic isolated spoken word database," J. Acoust. Soc. Jpn., vol.48, no.12, pp.899–905, Dec. 1992. [28] S. Nakagawa, K. Hanai, K. Yamamoto, and N. Minematsu, "Comparison of syllable-based HMMs and triphone-based HMMs in Japanese speech recognition," Proc. International Workshop on Automatic Speech Recognition and Understanding, pp.393–396, 1999. [29] H. Hermansky and N. Morgan, "RASTA processing of speech," IEEE Trans. Speech Audio Process., vol.2, no.4, pp.578–589, Oct. 1994. [30] X., Xiao, E. Chng, and H. Li, "Normalizing the speech modulation spectrum for robust speech recognition," Proc. ICASSP-2007, vol.4, pp.1021–1024, 2007. [31] C. Chen, J. Blimes, and K. Kirchhoff, "Low-resource noise-robust feature post-processing on AURORA 2.0," Proc. ICSLP-2004, pp.2445–2448, 2004. [32] H. Hermansky and N. Morgan, "RASTA processing of speech," Proc. 1993 IEEE Speech Recogn. Workshop, Snowbird, UT, Dec. 1993. [33] B. Milner, "A comparason of front-end configurations for robust speech recognition," Proc. ICASSP-2002, vol.1, pp.797–800, 2002. [34] J. Veth and L. Boves, "On the efficiency of classical RASTA filtering for continuous speech recognition: Keeping the balance between acoustic pre-processing and acoustic modelling," Speech Commun., vol.39, no.3–4, pp.269–286, Feb. 2003.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This Article ![]()
![]()
Abstract
![]()
Full Text (PDF)
![]()
Alert me when this article is cited
![]()
Alert me if a correction is posted
![]()
Services ![]()
![]()
Email this article to a friend
![]()
Similar articles in this journal
![]()
Alert me to new issues of the journal
![]()
Add to My Personal Archive
![]()
Download to citation manager
![]()
Request Permissions
![]()
Google Scholar ![]()
![]()
Articles by WANG, L.
![]()
Articles by KITAOKA, N.
![]()
Social Bookmarking ![]()
![]()
What's this?