Skip Navigation

IEICE Transactions on Information and Systems 2006 E89-D(3):1006-1014; doi:10.1093/ietisy/e89-d.3.1006
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Request Permissions
Google Scholar
Right arrow Articles by MCDERMOTT, E.
Right arrow Articles by NAKAMURA, A.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Copyright © 2006 The Institute of Electronics, Information and Communication Engineers

Special Section on Statistical Modeling for Speech Processing -- Papers -- Speech Recognition

Production-Oriented Models for Speech Recognition

Erik MCDERMOTT and Atsushi NAKAMURA

The authors are with NTT Communication Science Laboratories, NTT Corporation, Kyoto-fu, 619–0237 Japan. E-mail: mcd{at}cslab.kecl.ntt.co.jp, E-mail: ats{at}cslab.kecl.ntt.co.jp

Acoustic modeling in speech recognition uses very little knowledge of the speech production process. At many levels our models continue to model speech as a surface phenomenon. Typically, hidden Markov model (HMM) parameters operate primarily in the acoustic space or in a linear transformation thereof; state-to-state evolution is modeled only crudely, with no explicit relationship between states, such as would be afforded by the use of phonetic features commonly used by linguists to describe speech phenomena, or by the continuity and smoothness of the production parameters governing speech. This survey article attempts to provide an overview of proposals by several researchers for improving acoustic modeling in these regards. Such topics as the controversial Motor Theory of Speech Perception, work by Hogden explicitly using a continuity constraint in a pseudo-articulatory domain, the Kalman filter based Hidden Dynamic Model, and work by many groups showing the benefits of using articulatory features instead of phones as the underlying units of speech, will be covered.

Key Words: speech recognition, speech production, articulatory modeling, linear dynamical systems


Manuscript received July 11, 2005. Manuscript revised October 6, 2005.

References

[1] P. Ladefoged, A Course in Phonetics, Harcourt Brace, third ed., 1993.

[2] G. Fant, Acoustic theory of speech production, Mouton, Hague, 1960.

[3] K. Stevens, Acoustic Phonetics, MIT Press, 1998.

[4] C. Fowler and L.D. Rosenblum, "The perception of phonetic gestures," in Modularity and the Motor Theory of Speech Perception, ed. I. Mattingly and M. Studdert-Kennedy, chapter 3, Lawrence Erlbaum Associates, 1991.

[5] C. Browman and L. Goldstein, "Gestural structures: Distinctiveness, phonological processes, and historical change," in Modularity and the Motor Theory of Speech Perception, ed. I. Mattingly and M. Studdert-Kennedy, chapter 13, Lawrence Erlbaum Associates, 1991.

[6] P. Rubin and E. Vatikiotis-Bateson, "Measuring and modeling speech production in humans," in Animal Acoustic Communication: Recent Technical Advances, ed. S.L. Hopp and C.S. Evans, pp.251–290, Springer-Verlag, New York, 1998.

[7] A.M. Liberman and I.G. Mattingly, "The motor theory of speech perception revised," Cognition, vol.21, pp.1–36, 1985.[Medline]

[8] D. Callan, A. Callan, K. Honda, and S. Masaki, "Single-sweep EEG analysis of neural processes underlying perception and production of vowels," Cognitive Brain Research, vol.10, pp.173–176, 2000.[Medline]

[9] D. Callan, J. Jones, K. Munhall, A. Callan, C. Kroos, and E. Vatikiotis-Bateson, "Neural processes underlying perceptual enhancement by visual speech gestures," Journal of Cognitive Neuroscience and Neuropsychology, vol.14, no.17, pp.2213–2218, 2003.

[10] D. Callan, K. Tajima, A. Callan, R. Kubo, and S. Masaki, "Learning-induced neural plasticity associated with improved identification performance after training of a difficult second-language phonetic contrast," NeuroImage, vol.19, pp.113–124, 2003.[Medline]

[11] F. Metze and A. Waibel, "A flexible stream architecture for asr using articulatory features," International Conference on Spoken Language Processing, pp.2133–2136, 2002.

[12] K. Kirchhoff, "Combining articulatory and acoustic information for speech recognition in noisy and reverberant environments," International Conference on Spoken Language Processing, vol.3, pp.891–894, 1998.

[13] K. Kirchhoff, Robust speech recognition using articulatory information, PhD Thesis, University of Bielefeld, Germany, 1999.

[14] K. Kirchhoff, G.A. Fink, and G. Sagerer, "Combining acoustic and articulatory feature information for robust speech recognition," Speech Commun., vol.37, pp.303–319, 2002.

[15] K. Kirchhoff, "Syllable-level desynchronization of phonetic features for speech recognition," International Conference on Spoken Language Processing, vol.4, pp.2274–2276, 1996.

[16] M. Wester, "Syllable classification using articulatory-acoustic features," Proc. Eurospeech, vol.1, pp.233–236, 2003.

[17] S. King, T. Stephenson, S. Isard, P. Taylor, and A. Strachan, "Speech recognition via phonetically featured syllables," International Conference on Spoken Language Processing, vol.3, pp.1031–1034, 1998.

[18] S. King and P. Taylor, "Detection of phonological features in continuous speech using neural networks," Comput. Speech Lang., vol.14, pp.333–345, 2000.

[19] N. Chomsky M. Halle, The sound pattern of English, MIT Press, 1968.

[20] E. Eide, "Distinctive features for use in an automatic speech recognition system," Proc. Eurospeech, 2001.

[21] L. Deng and D.X. Sun, "A statistical approach to automatic speech recognition using the atomic speech units constructed from overlapping articulatory features," J. Acoust. Soc. Am., vol.95, no.4, pp.2702–2719, 1994.

[22] L. Deng and K. Erler, "Structural design of hidden Markov model speech recognizer using multivalued phonetic features: Comparison with segmental speech units," J. Acoust. Soc. Am., vol.92, no.6, pp.3058–3067, Dec. 1992.

[23] L. Deng and D. Sun, "Phonetic classification and recognition using HMM representation of overlapping articulatory features for all classes of English sounds," Proc. IEEE ICASSP, vol.1, pp.45–48, 1994.

[24] D. Sun, L. Deng, "An overlapping-feature based phonological model incorporating linguistic constraints: Applications to speech recognition," J. Acoust. Soc. Am., vol.111, no.2, pp.1086–1101, 2002.

[25] M. Richardson, J. Bilmes, and C. Diorio, "Hidden-articulator Markov models: Performance improvements and robustness to noise," International Conference on Spoken Language Processing, Beijing, China, 1998.

[26] B. Logan and P. Moreno, "Factorial hmms for acoustic modeling," Proc. IEEE ICASSP, pp.813–816, 1998.

[27] H.J. Nock and S.J. Young, "Loosely coupled hmms for asr," International Conference on Spoken Language Processing, Beijing, China, 1998.

[28] A.V. Nefian, L. Liang, P. Xiabo, L. Xiaoxiang, C. Mao, and K. Murphy, "A coupled HMM for audio-visual speech recognition," Proc. IEEE ICASSP, vol.2, pp.2013–2016, 2002.

[29] S. Gurbuz, Robust and efficient techniques for audio-visual speech recognition, PhD Thesis, Clemson University, Dept. of Electrical Engineering, 2002.

[30] K. Murphy, "A brief introduction to graphical models," Technical report, Institute of Phonetics, University of Saarland - www.ai.mit.edu/~murphyk/Bayes/bayes.html, 1998.

[31] J. Bilmes, Natural Statistical Models for Automatic Speech Recognition, PhD Thesis, University of California, Berkeley, Dept. of EECS, CS division, 1999.

[32] G. Zweig, Speech recognition with Dynamic Bayesian Networks, PhD Thesis, University of California, Berkeley, Computer Science, 2002.

[33] M. Ostendorf, "Incorporating linguistic theories of pronunciation variation into speech-recognition models," Phil. Trans. R. Soc. Lond., vol.358, pp.1325–1338, 2000.

[34] K. Markov, J. Dang, Y. Iizuka, and S. Nakamura, "Hybrid HMM/BN ASR system integrating spectrum and articulatory features," Proc. Eurospeech, pp.965–968, 2003.

[35] K. Markov, S. Nakamura, and J. Dang, "Integration of articulatory dynamic parameters in HMM/BN based speech recognition system," International Conference on Spoken Language Processing, 2004.

[36] K. Livescu, J. Glass, and J. Bilmes, "Hidden feature models for speech recognition using dynamic Bayesian networks," Proc. Eurospeech, Sept. 2003.

[37] K. Livescu and J. Glass, "Feature-based pronunciation modeling for speech recognition," Proc. HLT/NAACL, May 2004.

[38] K. Livescu and J. Glass, "Feature-based pronunciation modeling with trainable asynchrony probabilities," International Conference on Spoken Language Processing, Oct. 2004.

[39] V. Digalakis, J.R. Rohlicek, and M. Ostendorf, "A dynamical system approach to continuous speech recognition," Proc. ICASSP '91, pp.289–292, Toronto, Canada, May 1991.

[40] K. Iso and T. Watanabe, "Speaker-independent word recognition using a neural prediction model," Proc. IEEE ICASSP, vol.S8.8, pp.441–444, 1990.

[41] M. Honda, Speech feature extraction based on articulatory modeling, PhD Thesis, Waseda University, Department of Science and Engineering, 1977.

[42] M. Russell and P. Jackson, "The effect of an intermediate articulatory layer on the performance of a segmental hmm," Proc. Eurospeech, pp.2737–2740, 2003.

[43] Z. Ghahramani and S. Roweis, "Learning nonlinear dynamical systems using an EM algorithm," Proc. Conf. Advances in Neural Information Processing Systems, vol.11, pp.431–437, MIT Press, 1999.

[44] L. Deng, G. Ramsay, and D. Sun, "Production models as a structural basis for automatic speech recognition," Speech Commun., vol.22, no.2, pp.93–111, 1997.

[45] E. Saltzman and K. Munhall, "A dynamical approach to gestural patterning in speech production," Ecological Psychology, vol.1, no.4, pp.333–382, 1989.

[46] E. Vatikiotis-Bateson, M. Tiede, Y. Wada, V. Gracco, and M. Kawato, "Phoneme extraction using via point estimation of real speech," International Conference on Spoken Language Processing, pp.631–634, 1994.

[47] S. Suzuki, T. Okadome, and M. Honda, "Determination of articulatory positions from speech acoustics by applying dynamic articulatory constraints," International Conference on Spoken Language Processing, Sydney, Australia, 1998.

[48] J.S. Perkell, M.L. Matthies, M.A. Svirsky, and M.I. Jordan, "Goal-based speech motor control: A theoretical framework and some preliminary data," Journal of Phonetics, vol.23, pp.23–35, 1995.

[49] R. Bakis, Coarticulation modeling with continuous state hmms, Proc. IEEE Workshop Automatic Speech Recognition, pp.20–21, Arden House, New York, 1991.

[50] H. Richards and J. Bridle, "The HDM: A segmental hidden dynamic model of coarticulation," Proc. IEEE ICASSP, 1999.

[51] J. Picone, S. Pike, R. Regan, T. Kamm, J. Bridle, L. Deng, Z. Ma, H. Richards, and M. Schuster, "Initial evaluation of hidden dynamic models on conversational speech," Proc. IEEE ICASSP, 1999.

[52] J. Ma and L. Deng, "Optimization of dynamic regimes in a statistical hidden dynamic model for conversational speech recognition," Proc. Eurospeech, vol.3, pp.1339–1342, 1999.

[53] L. Deng, I. Bazzi, and A. Acero, "Tracking vocal tract resonances using an analytical non-linear predictor and a target-guided temporal constraint," Proc. Eurospeech, vol.1, pp.73–76, 2003.

[54] L. Deng, L.J. Lee, H. Attias, and A. Acero, "A structured speech model with continuous hidden dynamics and prediction-residual training for tracking vocal tract resonances," Proc. IEEE ICASSP, vol.1, pp.557–560, 2004.

[55] Y. Gao, R. Bakis, J. Huang, and B. Xiang, "Multistage coarticulation model combining articulatory, formant and cepstral features," International Conference on Spoken Language Processing, Beijing, China, 2000.

[56] C. Blackburn and S. Young, "Towards improved speech recognition using a speech production model," Proc. Eurospeech, pp.1623–1626, 1995.

[57] C.S. Blackburn, Articulatory Methods for Speech Production and Recognition, PhD Thesis, Cambridge University, Engineering Department, 1996.

[58] Z. Ghahramani and G. Hinton, "Variational learning for switching state-space models," Neural Comput., vol.12, no.4, pp.831–864, April 2000.[Abstract/Free Full Text]

[59] L.J. Lee, H. Attias, and L. Deng, "Variational inference and learning for segmental switching state space models of hidden speech dynamics," Proc. IEEE ICASSP, vol.1, pp.357–360, 1990.

[60] A. Rosti and M. Gales, "Rao-Blackwellised Gibbs sampling for switching linear dynamical systems," Proc. IEEE ICASSP, vol.1, pp.809–812, 2004.

[61] J. Droppo and A. Acero, "Noise robust speech recognition with a switching linear dynamic model," Proc. IEEE ICASSP, vol.1, pp.953–956, 2004.

[62] L.J. Lee, H. Attias, L. Deng, and P. Fieguth, "A multimodal variational approach to learning and inference in switching state space models," Proc. IEEE ICASSP, vol.5, pp.505–508, 2004.

[63] S. King and A. Wrench, "Dynamical system modelling of articulator movement," International Congress on Phonetic Sciences, pp.2259–2262, 1999.

[64] J. Frankel and S. King, "ASR—Articulatory speech recognition," Proc. Eurospeech, 2001.

[65] A. Wrench and W. Hardcastle, "A multichannel articulatory speech database and its application for automatic speech recognition," Proc. 5th seminar on speech production: models and data, 2000.

[66] J. Sun, L. Deng, and X. Jing, "Data-driven model construction for continuous speech recognition using overlapping articulatory features," International Conference on Spoken Language Processing, vol.1, pp.437–440, 2000.

[67] C.S. Blackburn and S.J. Young, "Pseudo-articulatory speech synthesis for recognition using automatic feature extraction from X-ray data," International Conference on Spoken Language Processing, vol.2, pp.969–972, Philadelphia, PA, 1996.

[68] B. Atal, J. Chang, M. Mathews, and J. Tukey, "Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer sorting technique," J. Acoust. Soc. Am., vol.63, no.5, pp.1535–1556, 1978.

[69] J. Dang and K. Honda, "Estimation of vocal tract shapes from speech sounds with a physiological articulatory model," Journal of Phonetics, vol.30, pp.511–532, 2002.

[70] G. Bailly, C. Abry, L.-J. Boe, R. Laboissiere, P. Perrier, and J.-L. Schwartz, "Inversion and speech recognition," Proc. EUSIPCO-92, vol.1, pp.159–164, 1992.

[71] J. Hogden, "A maximum likelihood approach to estimating speech articulator positions from speech acoustics," J. Acoust. Soc. Am., vol.100, no.4, pp.2663–2664, Oct. 1996.

[72] H. Yehia, A study on the speech acoustic-to-articulatory mapping using morphological constraints, PhD Thesis, Nagoya University, Graduate School of Engineering, 2002.

[73] J. Hogden, P. Valdez, S. Katagiri, and E. McDermott, "Blind inversion of multidimensional functions for speech enhancement," Proc. Eurospeech, pp.1409–1412, 2003.

[74] J. Hogden and P. Valdez, "Bridging the gap between speech production and speech recognition," 5th Seminar on Speech Production: Models and Data, Kloster Seeon, Germany, 2000.

[75] K. Tokuda, T. Kobayashi, and S. Imai, "Speech parameter generation from hmm using dynamic features," Proc. IEEE ICASSP, vol.2, pp.585–588, 1999.

[76] Y. Minami, E. McDermott, A. Nakamura, and S. Katagiri, "A recognition method with parametric trajectory synthesized using direct relations between static and dynamic feature vector time series," Proc. IEEE ICASSP, vol.1, pp.957–960, 2002.

[77] K. Tokuda, H. Zen, and T. Kitamura, "Trajectory modeling based on hmms with the explicit relationship between static and dynamic features," Proc. Eurospeech, pp.865–868, 2003.

[78] T. Irino, Y. Minami, T. Nakatani, M. Tsuzaki, and H. Tagawa, "Evaluation of a speech recogniton/generation method based on HMM and STRAIGHT," International Conference on Spoken Language Processing, pp.2545–2548, 2002.

[79] K. Stevens, "Toward a model for speech recognition," J. Acoust. Soc. Am., vol.32, no.1, pp.47–55, Jan. 1960.

[80] C.-H. Lee, "From knowledge-ignorant to knowledge-rich modeling: A new speech research paradigm for next generation automatic speech recognition," International Conference on Spoken Language Processing, 2004.

[81] E. McDermott and T.J. Hazen, "Minimum classification error training of landmark models for real-time continuous speech recognition," Proc. IEEE ICASSP, vol.1, pp.937–940, 2004.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Request Permissions
Google Scholar
Right arrow Articles by MCDERMOTT, E.
Right arrow Articles by NAKAMURA, A.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?