Skip Navigation

IEICE Transactions on Information and Systems 2008 E91-D(3):393-401; doi:10.1093/ietisy/e91-d.3.393
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Request Permissions
Google Scholar
Right arrow Articles by ASANO, F.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Copyright © 2008 The Institute of Electronics, Information and Communication Engineers

Special Section on Robust Speech Processing in Realistic Environments -- Papers

Signal Processing Techniques for Robust Speech Recognition

Futoshi ASANO1

1 The author is with AIST, Tsukuba-shi, 305–8568 Japan. E-mail: f.asano{at}aist.go.jp

In this paper, signal processing techniques which can be applied to automatic speech recognition to improve its robustness are reviewed. The choice of signal processing techniques is strongly dependent on the scenario of the applications. The analysis of scenario and the choice of suitable signal processing techniques are shown through two examples.

Key Words: signal processing, automatic speech recognition, robustness


Manuscript received August 3, 2007. Manuscript revised September 28, 2007.

Reference

[1] M. Omologo and P. Svaizer, "Use of the crosspower-spectrum phase in acoustic event location," IEEE Trans. Speech Audio Process., vol.5, no.3, pp.288–292, 1997.

[2] T. Nishiura, T. Yamada, S. Nakamura, and K. Shikano, "Localization of multiple sound sources based on a csp analysis with a microphone array," Proc. ICASSP2000, pp.1053–1056, June 2000.

[3] R.O. Schmidt, "Multiple emitter location and signal parameter estimation," IEEE Trans. Antennas Propag., vol.AP-34, no.3, pp.276–280, March 1986.

[4] D.H. Johnson and D.E. Dudgeon, Array signal processing, Prentice Hall, Englewood Cliffs, NJ. 1993.

[5] G. Strang, Linear Algebra and Its Application, Harcourt Brace Jovanovich Inc., Orlando. 1988.

[6] R. Roy and T. Kailath, "Esprit – estimation of signal parameters via rotational invariance techniques," IEEE Trans. Acoust. Speech Signal Process., vol.37, no.7, pp.984–995, July 1989.

[7] F. Asano, "Sound localization for robots," J. Acoust. Soc. Jpn., vol.63, no.1, pp.41–46, 2007.

[8] F. Asano, K. Yamamoto, I. Hara, J. Ogata, T. Yoshimura, Y. Motomura, N. Ichimura, and H. Asoh, "Detection and separation of speech event using audio and video information fusion and its application to robust speech interface," EURASIP Journal on Applied Signal Processing, vol.2004, no.11, pp.1727–1738, 2004.

[9] F. Asano, K. Yamamoto, J. Ogata, M. Yamada, and M. Nakamura, "Detection and separation of speech events in meeting recordings using a microphone array," EURASIP Journal on Audio, Speech, and Music Processing, vol.2007, Article ID 27616, 2007.

[10] K. Yamamoto, F. Asano, W. Rooijen, T. Yamada, and N. Kitawaki, "Estimation of the number of sound sources using support vector machine and its application to sound source separation," Proc. ICASSP 2003, vol.V, pp.485–488, 2003.

[11] A. Quinlan, J.-P. Barbot, P. Larzabal, and M. Haardt, "Model order selection for short data: An exponential fitting test (EFT)," EURASIP Journal on Advances in Signal Processing, vol.2007, Article ID 71953, 2007.

[12] V. Gilg, C. Beaugeant, M. Schoenle, and B. Andrassy, "Methodology for the design of a robust voice activity detectory for speech enhancement," Proc. IWAENC 2003, pp.131–134, Sept. 2003.

[13] K. Li, S. Swamy, and O. Ahmad, "An improved voice activity detection using higher order statistics," IEEE Trans. Speech Audio Process., vol.13, no.5, pp.965–974, 2005.

[14] L. Stone, C. Barlow, and T. Corwin, Bayesian multiple target tracking, Artech House. 1999.

[15] L.J. Griffiths and C.W. Jim, "An alternative approach to linearly constrained adaptive beamforming," IEEE Trans. Antennas Propag., vol.AP-30, no.1, pp.27–34, Jan. 1982.

[16] O. Hoshuyama and A. Sugiyama, "Robust adaptive beamforming," in Microphone arrays, pp.87–109, Springer. 2001.

[17] S. Amari, A. Cichocki, and H. Yang, "Blind signal separation and extraction: Neural and information-theoretic approaches," in Unsupervised adaptive filtering, pp.63–138, John Wiley & Sons. 2000.

[18] H. Sawada, R. Mukai, S. Araki, and S. Makino, "A robust and prcise method for solving the permutation problem of the frequency domain blind source separation," IEEE Trans. Speech Audio Process., vol.12, no.5, pp.530–538, 2004.

[19] H. Sawada, R. Mukai, S. Araki, and S. Makino, "Polar coordinate based nonlinear function for frequency domain blind source separation," IEICE Trans. Fundamentals, vol.E86-A, no.3, pp.590–596, March 2003.

[20] F. Asano, S. Ikeda, M. Ogawa, H. Asoh, and N. Kitawaki, "Combined approach of array processing and independent component analysis for blind separation of acoustic signals," IEEE Trans. Speech Audio Process., vol.11, no.3, pp.204–215, May 2003.

[21] L. Parra and C. Alvino, "Geometric source separation: Merging convolutive source separation with geometric beamforming," IEEE Trans. Speech Audio Process., vol.10, no.6, pp. 352–362, 2002.

[22] J.S. Lim, ed., Speech enhancement, Prentice Hall, Englewood Cliffs, NJ. 1983.

[23] R. Martin, "Noise power spectral density estimation based on optimal smoothing and minimum statistics," IEEE Trans. Speech Audio Process., vol.9, no.7, pp.504–512, 2001.

[24] R. Martin, "Statistical methods for the enhancement of noise speech," Proc. IWAENC 2003, pp.1–6, 2003.

[25] M. Fujimoto and S. Nakamura, "Sequential non-stationary noise tracking using particle filter with switching dynamical system," Proc. ICASSP 2006, vol.I, pp.769–772, 2006.

[26] Y. Ephraim and D. Malah, "Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator," IEEE Trans. Acoust. Speech Signal Process., vol.ASSP-32, no.12, pp.1109–1121, Dec. 1984.

[27] J. Valine, J. Rouat, and F. Michaud, "Microphone array post-filter for separation of simultaneous non-stationary sources," Proc. ICASSP, vol.I, pp.221–224, 2004.

[28] M. Miyoshi and Y. Kaneda, "Inverse filtering of room acoustics," IEEE Trans. Acoust. Speech Signal Process., vol.36, no.2, pp.145–152, 1988.

[29] T. Hikichi, M. Delcroix, and M. Miyoshi, "Inverse filtering for speech dereverberation less sensitive to noise and room transfer function fluctuation," EURASIP J. Advances in Signal Process., vol.2007, Article ID 34013, 2007.

[30] K. Furuya, S. Sakauchi, and A. Kataoka, "Speech dereverberation by combining mint-based blind deconvolution and modified spectral subtraction," Proc. ICASSP 2006, vol.1, pp.813–816, 2006.

[31] T. Nakatani, K. Kinoshita, and M. Miyoshi, "Harmonicity-based blind dereberberation for single-channel speech signals," IEEE Trans. Audio, Speech and Language Proc., vol.15, no.1, pp.80–95, 2007.

[32] K. Kinoshita, T. Nakatani, and M. Miyoshi, "Spectral subtraction steered by multi-step forward linear prediction for single channel speech dereverberation," Proc. ICASSP 2006, vol.1, pp.817–820, 2006.

[33] W. Herbordt, W. Kellermann, and S. Nakamura, "Joint optimization of acoustic echo cancellation and adaptive beamforming," in Topics in acoustic echo and noise control, pp.19–50, Springer. 2006.

[34] M. Kawamoto, F. Asano, H. Asoh, and K. Yamamoto, "Particle filtering algorithms for tracking mutiple sound sources using microphone arrays," Proc. ICASSP 2007, 2007.

[35] K. Yamamoto, F. Asano, H. Hara, J. Ogata, H. Asoh, T. Yamada, and N. Kitawaki, "Real-time speech interference based on the fusion of audio and video information for humanoid robot hrp-2," J. Acoustical Society of Japan, vol.62, no.3, pp.161–172, 2006.

[36] C.L. Leggetter and P.C. Woodland, "Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models," Comput. Speech Lang., vol.9, pp.171–185, 1995.

[37] J.L. Gauvain and C.H. Lee, "Maximum a posteriori estimation for multivariate Gaussian mixture observation of Markov chains," IEEE Trans. Speech Audio Process., vol.2, no.2, pp.291–298, 1994.

[38] F.V. Jensen, Bayesian Networks and Decision Graphs, Springer. 2001.

[39] F. Asano and H. Asoh, "Sound source tracking using particle filter," J. Acoust. Soc. Jpn., vol.61, no.12, pp.720–727, 2005.

[40] F. Asano and H. Asoh, "Sound source localization and separation based on the EM algorithm," Proc. SAPA2004, 2004.

[41] M. Katoh, K. Yamamoto, J. Ogata, T. Yoshimura, F. Asano, H. Asoh, and N. Kitawaki, "State estimation of meetings by information fusion using Bayesian network," Proc. Interspeech 2005, pp.113–116, 2005.

[42] Y. Suzuki, F. Asano, H.Y. Kim, and T. Sone, "An optimum computer-generated pulse suitable for the measurement of very long impulse response," J. Acoust. Soc. Am., vol.97, no.2, pp.1119–1123, 1993.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Request Permissions
Google Scholar
Right arrow Articles by ASANO, F.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?