Skip Navigation

IEICE Transactions on Information and Systems 2008 E91-D(3):457-466; doi:10.1093/ietisy/e91-d.3.457
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Request Permissions
Google Scholar
Right arrow Articles by WANG, L.
Right arrow Articles by KITAOKA, N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Copyright © 2008 The Institute of Electronics, Information and Communication Engineers

Special Section on Robust Speech Processing in Realistic Environments -- Papers -- ASR under Reverberant Conditions

Robust Speech Recognition by Combining Short-Term and Long-Term Spectrum Based Position-Dependent CMN with Conventional CMN

Longbiao WANG1, Seiichi NAKAGAWA1 and Norihide KITAOKA2

1 The authors are with Toyohashi University of Technology, Toyohashi-shi, 441–8580 Japan. E-mail: wang{at}slp.ics.tut.ac.jp, 2 The author is with Nagoya University, Nagoya-shi, 464–8603 Japan.

In a distant-talking environment, the length of channel impulse response is longer than the short-term spectral analysis window. Conventional short-term spectrum based Cepstral Mean Normalization (CMN) is therefore, not effective under these conditions. In this paper, we propose a robust speech recognition method by combining a short-term spectrum based CMN with a long-term one. We assume that a static speech segment (such as a vowel, for example) affected by reverberation, can be modeled by a long-term cepstral analysis. Thus, the effect of long reverberation on a static speech segment may be compensated by the long-term spectrum based CMN. The cepstral distance of neighboring frames is used to discriminate the static speech segment (long-term spectrum) and the non-static speech segment (short-term spectrum). The cepstra of the static and non-static speech segments are normalized by the corresponding cepstral means. In a previous study, we proposed an environmentally robust speech recognition method based on Position-Dependent CMN (PDCMN) to compensate for channel distortion depending on speaker position, and which is more efficient than conventional CMN. In this paper, the concept of combining short-term and long-term spectrum based CMN is extended to PDCMN. We call this Variable Term spectrum based PDCMN (VT-PDCMN). Since PDCMN/VT-PDCMN cannot normalize speaker variations because a position-dependent cepstral mean contains the average speaker characteristics over all speakers, we also combine PDCMN/VT-PDCMN with conventional CMN in this study. We conducted the experiments based on our proposed method using limited vocabulary (100 words) distant-talking isolated word recognition in a real environment. The proposed method achieved a relative error reduction rate of 60.9% over the conventional short-term spectrum based CMN and 30.6% over the short-term spectrum based PDCMN.

Key Words: robust speech recognition, distant-talking environment, CMN, long-term spectrum


Manuscript received July 2, 2007. Manuscript revised September 6, 2007.

Reference

[1] T.B. Hughes, H.S. Kim, J.H. DiBiase, and H.F. Silverman, "Performance of an HMM speech recognizer using a real-time tracking microphone array as input," IEEE Trans. Speech Audio Process., vol.7, no.3, pp.346–349, May 1999.

[2] T. Takiguchi, S. Nakamura, and K. Shikano, "HMM-separation-based speech recognition for a distant moving speaker," IEEE Trans. Speech Audio Process., vol.9, no.2, pp.127–140, Feb. 2001.

[3] M.L. Seltzer, B. Raj, and R.M. Stern, "Likelihood-maximizing beamforming for robust hands-free speech recognition," IEEE Trans. Speech Audio Process., vol.12, no.5, pp.489–498, Sept. 2004.

[4] S. Furui, "Cepstral analysis technique for automatic speaker verification," IEEE Trans. Acoust. Speech Signal Process., vol.29, no.2, pp.254–272, 1981.

[5] F. Liu, R. Stern, X. Huang, and A. Acero, "Efficient cepstral normalization for robust speech recognition," Proc. ARPA Speech and Nat. Language Workshop, pp.69–74, 1993.

[6] A. Vikki and K. Laurila, "Cepstral domain segmental feature vector normalization for noise robust speech recognition," Speech Commun., vol.25, no.1–3, pp.133–147, 1998.

[7] P. Pujol, D. Macho, and C. Nadeu, "On real-time mean-and-variance normalization of speech recognition features," Proc. ICASSP-2006, pp.773–776, 2006.

[8] C. Raut, T. Nishimoto, and S. Sagayama, "Model adaptation by splitting of HMM for long reverberation," Proc. INTERSPEECH-2005, pp.277–280, 2005.

[9] C. Raut, T. Nishimoto, and S. Sagayama, "Adaptation for long convolutional distortion by maximum likelihood based state filtering approach," Proc. ICASSP-2006, vol.1, pp.1133–1136, 2006.

[10] C. Avendano, Temporal processing of speech in a time feature space, Ph.D. Thesis, Oregon Graduate Institute of Science & Technology, April 1997.

[11] C. Avendano, S. Tibrewala, and H. Hermansky, "Multiresolution channel normalization for ASR in reverberation environments," Proc. EUROSPEECH-1997, pp.1107–1110, 1997.

[12] L. Wang, N. Kitaoka, and S. Nakagawa, "Robust distant speech recognition based on position dependent CMN using a novel multiple microphone processing technique," Proc. EUROSPEECH-2005, pp.2661–2664, 2005.

[13] L. Wang, N. Kitaoka, and S. Nakagawa, "Robust distant speech recognition by combining multiple microphone-array processing with position-dependent CMN," EURASIP J. Appl. Signal Process., vol.2006, Article ID 95491, pp.1–11, 2006.

[14] J. Fiscus, "A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER)," IEEE ASRU Workshop, pp.347–352, 1997.

[15] Y. Obuchi, "Mixture weight optimization for dual-microphone MFCC combination," IEEE ASRU Workshop, pp.325–330, 2005.

[16] B.S. Atal, "Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification," J. Acoust. Soc. Am., vol.55, pp.1304–1312, 1974.

[17] Q. Jin, Y. Pan, and T. Schultz, "Far-field speaker recognition," Proc. ICASSP-2006, vol.1, pp.937–940, 2006.

[18] L. Wang, N. Kitaoka, and S. Nakagawa, "Robust speech recognition by combining short-term spectrum based CMN with long-term spectrum based CMN," The Japan-China Joint Conference on Acoustics (JCA2007), P–2–13, June 2007.

[19] L. Wang, N. Kitaoka, and S. Nakagawa, "Robust distant speaker recognition based on position-dependent CMN by combining speaker-specific GMM with speaker-adapted HMM," Speech Commun., vol.49, no.6, pp.501–513, June 2007.

[20] C.H. Knapp and G.C. Carter, "The generalized correlation method for estimation of time delay," IEEE Trans. Acoust. Speech Signal Process., vol.ASSP-24, no.4, pp.320–327, Aug. 1976.

[21] M. Omologo and P. Svaizer, "Use of the crosspower-spectrum phase in acoustic event location," IEEE Trans. Speech Audio Process., vol.5, no.3, pp.288–292, 1997.

[22] S. Doclo and M. Moonen, "Robust adaptive time delay estimation for speaker localisation in noisy and reverberant acoustic environments," EURASIP J. Applied Signal Processing, vol.2003, no.11, pp.1110–1124, Oct. 2003.

[23] L. Wang, N. Kitaoka, and S. Nakagawa, "Robust distant speech recognition by combining position-dependent CMN with Conventional CMN," Proc. ICASSP-2007, vol.4, pp.817–820, 2007.

[24] A. Viterbi, "Error bounds for convolutional codes and an asymptotically optimum decoding algorithm," IEEE Trans. Inf. Theory, vol.13, no.2, pp.260–269, 1967.

[25] B. Van Veen and K. Buckley, "Beamforming: A versatile approach to spatial filtering," IEEE Acoust. Speech Signal Process. Mag., vol.5, no.2, pp.4–24, April 1988.

[26] J. Flanagan, J. Johnston, R. Zahn, and G. Elko, "Computer-steered microphone arrays for sound transduction in large rooms," J. Acoust. Soc. Am., vol.78, pp.1508–1518, June 1985.

[27] S. Makino, K. Niyada, Y. Mafune, and K. Kido, "Tohoku University and Panasonic isolated spoken word database," J. Acoust. Soc. Jpn., vol.48, no.12, pp.899–905, Dec. 1992.

[28] S. Nakagawa, K. Hanai, K. Yamamoto, and N. Minematsu, "Comparison of syllable-based HMMs and triphone-based HMMs in Japanese speech recognition," Proc. International Workshop on Automatic Speech Recognition and Understanding, pp.393–396, 1999.

[29] H. Hermansky and N. Morgan, "RASTA processing of speech," IEEE Trans. Speech Audio Process., vol.2, no.4, pp.578–589, Oct. 1994.

[30] X., Xiao, E. Chng, and H. Li, "Normalizing the speech modulation spectrum for robust speech recognition," Proc. ICASSP-2007, vol.4, pp.1021–1024, 2007.

[31] C. Chen, J. Blimes, and K. Kirchhoff, "Low-resource noise-robust feature post-processing on AURORA 2.0," Proc. ICSLP-2004, pp.2445–2448, 2004.

[32] H. Hermansky and N. Morgan, "RASTA processing of speech," Proc. 1993 IEEE Speech Recogn. Workshop, Snowbird, UT, Dec. 1993.

[33] B. Milner, "A comparason of front-end configurations for robust speech recognition," Proc. ICASSP-2002, vol.1, pp.797–800, 2002.

[34] J. Veth and L. Boves, "On the efficiency of classical RASTA filtering for continuous speech recognition: Keeping the balance between acoustic pre-processing and acoustic modelling," Speech Commun., vol.39, no.3–4, pp.269–286, Feb. 2003.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Request Permissions
Google Scholar
Right arrow Articles by WANG, L.
Right arrow Articles by KITAOKA, N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?