Skip Navigation

IEICE Transactions on Information and Systems 2007 E90-D(5):835-843; doi:10.1093/ietisy/e90-d.5.835
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Request Permissions
Google Scholar
Right arrow Articles by KOBAYASHI, A.
Right arrow Articles by IMAI, T.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Copyright © 2007 The Institute of Electronics, Information and Communication Engineers

Regular Section -- Papers -- Speech and Hearing

Word Error Rate Minimization Using an Integrated Confidence Measure

Akio KOBAYASHI1, Kazuo ONOE1, Shinichi HOMMA1, Shoei SATO1 and Toru IMAI1

1 The authors are with NHK Science and Technical Research Laboratories, Tokyo, 157–8510 Japan. E-mail: kobayashi.a-fs{at}nhk.or.jp

This paper describes a new criterion for speech recognition using an integrated confidence measure to minimize the word error rate (WER). The conventional criteria for WER minimization obtain the expected WER of a sentence hypothesis merely by comparing it with other hypotheses in an n-best list. The proposed criterion estimates the expected WER by using an integrated confidence measure with word posterior probabilities for a given acoustic input. The integrated confidence measure, which is implemented as a classifier based on maximum entropy (ME) modeling or support vector machines (SVMs), is used to acquire probabilities reflecting whether the word hypotheses are correct. The classifier is comprised of a variety of confidence measures and can deal with a temporal sequence of them to attain a more reliable confidence. Our proposed criterion for minimizing WER achieved a WER of 9.8% and a 3.9% reduction, relative to conventional n-best rescoring methods in transcribing Japanese broadcast news in various environments such as under noisy field and spontaneous speech conditions.

Key Words: word error rate minimization, maximum entropy, support vector machines, n-best rescoring


Manuscript received June 30, 2006. Manuscript revised October 23, 2006.

References

[1] A. Ando, T. Imai, A. Kobayashi, S. Homma, J. Goto, N. Seiyama, T. Mishima, T. Kobayakawa, S. Sato, K. Onoe, H. Segi, A. Imai, A. Matsui, A. Nakamura, H. Tanaka, T. Takagi, E. Miyasaka, and H. Isono, "Simultaneous subtitling system for broadcast news programs with a speech recognizer," IEICE Trans. Inf. & Syst., vol.E86-D, no.1, pp.15–25, Jan. 2003.

[2] G. Riccardi and D. Hakkani-Tür, "Active and unsupervised learning for automatic speech recognition," Proc. Eurospeech, pp.1825–1828, 2003.

[3] M. Nakano, "Using untranscribed user utterances for improving language models based on confidence scoring," Proc. Eurospeech, pp.417–420, 2003.

[4] A. Stolcke, Y. Konig, and M. Weintraub, "Explicit word error minimization in N-best list rescoring," Proc. Eurospeech, pp.163–166, 1997.

[5] V. Goel, W.J. Byrne, and S.P. Khudanpur, "LVCSR rescoring with modified loss functions: A decision theoristic perspective," Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, pp.425–428, 1998.

[6] F. Wessel, R. Schlüter, K. Macherey, and H. Ney, "Confidence measure for large vocabulary continuous speech recognition," IEEE Trans. Speech Audio Process., vol.9, no.3, pp.288–298, 2001.

[7] T. Kemp and T. Schaaf, "Estimating confidence using word lattices," Proc. Eurospeech, pp.827–830, 1997.

[8] G. Evermann and P. Woodland, "Posterior probability decoding, confidence estimation and system combination," Proc. NIST Speech Transcription Workshop, http://www.nist.gov/speech/publications/tw00/html/cp230/cp230.htm, 2000.

[9] G. Riccardi and D. Hakkani-Tür, "Active learning: Theory and applications to automatic speech recognition," IEEE Trans. Speech Audio Process., vol.13, no.4, pp.504–511, 2005.

[10] A. Berger, S.D. Pietra, and V.D. Pietra, "A maximum entropy approach to natural language processing," Computational Linguistics, vol.22, pp.39–71, 1996.

[11] J. Darroch and D. Ratcliff, "Generalized iterative scaling for log-linear models," The Annals of Mathematical Statistics, pp.1470–1480, 1972.

[12] S.F. Chen and R. Rosenfeld, "A Gaussian prior for smoothing maximum entropy models," Technical Report CMU-CS-99-108, Carnegie Mellon University, 1999.

[13] T.J. Hazen, S. Seneff, and J. Polifromi, "Recognition confidence scoring and its use in speech understanding systems," Comput. Speech Lang., vol.16, pp.49–67, 2002

[14] P.J. Moreno, B. Logan, and B. Raj, "A boosting approach for confidence scoring," Proc. Eurospeech, pp.2109–2112, 2001.

[15] T. Joachims, Learning to classify text using support vector machines, Kluwer Academic Publishers, Boston, 2002.

[16] T. Joachims, "Introduction to support vector learning," in Advances in Kernel Methods, ed. B. Scho-lkopf, C.J. Burges, and A.J. Smola, MIT Press, 1999.

[17] J. Platt, "Probabilistic outputs for support vector machines and comparison to regularized likelihood methods," in Advances in Large Margin Classifiers, ed. A. Smola, P. Bartlett, B. Schoelkopf, and D. Schuurmans, pp.61–74, MIT Press, 2000.

[18] T. Joachims, "Making large-scale SVM learning practical," in Advances in Kernel Methods, ed. B. Scho-lkopf, C.J. Burges, and A.J. Smola, MIT Press, 1999.

[19] A. Stolcke, "SRILM–An extenisible language modeling toolkit," Proc. Int. Conf. Spoken Language Processing, pp.901–904, 2002.

[20] T. Zeppenfeld, M. Finke, M. Westphal, K. Ries, and A. Waibel, "Recognition of conversational telephone speech using the Janus speech engine," Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, pp.1815–1818, 1997.

[21] F. Weng, A. Stolcke, and A. Sanker, "Efficient lattice representation and generation," Proc. Int. Conf. Spoken Language Processing, pp.2531–2534, 1998.

[22] H. Hermansky and N. Morgan, "RASTA processing of speech," IEEE Trans. Speech Audio Process., vol.2, pp.587–589, 1994.

[23] A. Kobayashi, K. Onoe, T. Imai, and A. Ando, "Time dependent language model for broadcast news transcription and its post-correction," Proc. Int. Conf. Spoken Language Processing, pp.2435–2438, 1998.

[24] L. Gillick and S.J. Cox, "Some statistical issues in the comparison of speech recognition algorithms," Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, pp.532–535, 1989.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Request Permissions
Google Scholar
Right arrow Articles by KOBAYASHI, A.
Right arrow Articles by IMAI, T.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?