Copyright © 2007 The Institute of Electronics, Information and Communication Engineers
Regular Section -- Papers -- Speech and Hearing |
Word Error Rate Minimization Using an Integrated Confidence Measure
1 The authors are with NHK Science and Technical Research Laboratories, Tokyo, 1578510 Japan. E-mail: kobayashi.a-fs{at}nhk.or.jp
This paper describes a new criterion for speech recognition using an integrated confidence measure to minimize the word error rate (WER). The conventional criteria for WER minimization obtain the expected WER of a sentence hypothesis merely by comparing it with other hypotheses in an n-best list. The proposed criterion estimates the expected WER by using an integrated confidence measure with word posterior probabilities for a given acoustic input. The integrated confidence measure, which is implemented as a classifier based on maximum entropy (ME) modeling or support vector machines (SVMs), is used to acquire probabilities reflecting whether the word hypotheses are correct. The classifier is comprised of a variety of confidence measures and can deal with a temporal sequence of them to attain a more reliable confidence. Our proposed criterion for minimizing WER achieved a WER of 9.8% and a 3.9% reduction, relative to conventional n-best rescoring methods in transcribing Japanese broadcast news in various environments such as under noisy field and spontaneous speech conditions.
Key Words: word error rate minimization, maximum entropy, support vector machines, n-best rescoring
Manuscript received June 30, 2006. Manuscript revised October 23, 2006.
References
[1] A. Ando, T. Imai, A. Kobayashi, S. Homma, J. Goto, N. Seiyama, T. Mishima, T. Kobayakawa, S. Sato, K. Onoe, H. Segi, A. Imai, A. Matsui, A. Nakamura, H. Tanaka, T. Takagi, E. Miyasaka, and H. Isono, "Simultaneous subtitling system for broadcast news programs with a speech recognizer," IEICE Trans. Inf. & Syst., vol.E86-D, no.1, pp.1525, Jan. 2003.
[2] G. Riccardi and D. Hakkani-Tür, "Active and unsupervised learning for automatic speech recognition," Proc. Eurospeech, pp.18251828, 2003.
[3] M. Nakano, "Using untranscribed user utterances for improving language models based on confidence scoring," Proc. Eurospeech, pp.417420, 2003.
[4] A. Stolcke, Y. Konig, and M. Weintraub, "Explicit word error minimization in N-best list rescoring," Proc. Eurospeech, pp.163166, 1997.
[5] V. Goel, W.J. Byrne, and S.P. Khudanpur, "LVCSR rescoring with modified loss functions: A decision theoristic perspective," Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, pp.425428, 1998.
[6] F. Wessel, R. Schlüter, K. Macherey, and H. Ney, "Confidence measure for large vocabulary continuous speech recognition," IEEE Trans. Speech Audio Process., vol.9, no.3, pp.288298, 2001.
[7] T. Kemp and T. Schaaf, "Estimating confidence using word lattices," Proc. Eurospeech, pp.827830, 1997.
[8] G. Evermann and P. Woodland, "Posterior probability decoding, confidence estimation and system combination," Proc. NIST Speech Transcription Workshop, http://www.nist.gov/speech/publications/tw00/html/cp230/cp230.htm, 2000.
[9] G. Riccardi and D. Hakkani-Tür, "Active learning: Theory and applications to automatic speech recognition," IEEE Trans. Speech Audio Process., vol.13, no.4, pp.504511, 2005.
[10] A. Berger, S.D. Pietra, and V.D. Pietra, "A maximum entropy approach to natural language processing," Computational Linguistics, vol.22, pp.3971, 1996.
[11] J. Darroch and D. Ratcliff, "Generalized iterative scaling for log-linear models," The Annals of Mathematical Statistics, pp.14701480, 1972.
[12] S.F. Chen and R. Rosenfeld, "A Gaussian prior for smoothing maximum entropy models," Technical Report CMU-CS-99-108, Carnegie Mellon University, 1999.
[13] T.J. Hazen, S. Seneff, and J. Polifromi, "Recognition confidence scoring and its use in speech understanding systems," Comput. Speech Lang., vol.16, pp.4967, 2002
[14] P.J. Moreno, B. Logan, and B. Raj, "A boosting approach for confidence scoring," Proc. Eurospeech, pp.21092112, 2001.
[15] T. Joachims, Learning to classify text using support vector machines, Kluwer Academic Publishers, Boston, 2002.
[16] T. Joachims, "Introduction to support vector learning," in Advances in Kernel Methods, ed. B. Scho-lkopf, C.J. Burges, and A.J. Smola, MIT Press, 1999.
[17] J. Platt, "Probabilistic outputs for support vector machines and comparison to regularized likelihood methods," in Advances in Large Margin Classifiers, ed. A. Smola, P. Bartlett, B. Schoelkopf, and D. Schuurmans, pp.6174, MIT Press, 2000.
[18] T. Joachims, "Making large-scale SVM learning practical," in Advances in Kernel Methods, ed. B. Scho-lkopf, C.J. Burges, and A.J. Smola, MIT Press, 1999.
[19] A. Stolcke, "SRILMAn extenisible language modeling toolkit," Proc. Int. Conf. Spoken Language Processing, pp.901904, 2002.
[20] T. Zeppenfeld, M. Finke, M. Westphal, K. Ries, and A. Waibel, "Recognition of conversational telephone speech using the Janus speech engine," Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, pp.18151818, 1997.
[21] F. Weng, A. Stolcke, and A. Sanker, "Efficient lattice representation and generation," Proc. Int. Conf. Spoken Language Processing, pp.25312534, 1998.
[22] H. Hermansky and N. Morgan, "RASTA processing of speech," IEEE Trans. Speech Audio Process., vol.2, pp.587589, 1994.
[23] A. Kobayashi, K. Onoe, T. Imai, and A. Ando, "Time dependent language model for broadcast news transcription and its post-correction," Proc. Int. Conf. Spoken Language Processing, pp.24352438, 1998.
[24] L. Gillick and S.J. Cox, "Some statistical issues in the comparison of speech recognition algorithms," Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, pp.532535, 1989.
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||