Skip Navigation

IEICE Transactions on Information and Systems 2006 E89-D(3):998-1005; doi:10.1093/ietisy/e89-d.3.998
This Article
Right arrow Full Text (PDF)
Right arrow References
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Request Permissions
Google Scholar
Right arrow Articles by GOMEZ, R.
Right arrow Articles by SHIKANO, K.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Copyright © 2006 The Institute of Electronics, Information and Communication Engineers

Special Section on Statistical Modeling for Speech Processing -- Papers -- Speech Recognition

Improving Rapid Unsupervised Speaker Adaptation Based on HMM-Sufficient Statistics in Noisy Environments Using Multi-Template Models

Randy GOMEZ1, Akinobu LEE2, Tomoki TODA1, Hiroshi SARUWATARI1 and Kiyohiro SHIKANO1

1 The authors are with Nara Institute of Science and Technology, Ikoma-shi, 630–0192 Japan. E-mail: randy-g{at}is.naist.jp, 2 The author is with Nagoya Institute of Technology, Nagoya-shi, 466–8555 Japan.

This paper describes the method of using multi-template unsupervised speaker adaptation based on HMM-Sufficient Statistics to push up the adaptation performance while keeping adaptation time within few seconds with just one arbitrary utterance. This adaptation scheme is mainly composed of two processes. The first part is done offline which involves the training of multiple class-dependent acoustic models and the creation of speakers' HMM-Sufficient Statistics based on gender and age. The second part is performed online where adaptation begins using the single utterance of a test speaker. From this utterance, the system will classify the speaker's class and consequently select the N-best neighbor speakers close to the utterance using Gaussian Mixture Models (GMM). The classified speakers' class template model is then adopted as a base model. From this template model, the adapted model is rapidly constructed using the N-best neighbor speakers' HMM-Sufficient Statistics. Experiments in noisy environment conditions with 20 dB, 15 dB and 10 dB SNR office, crowd, booth, and car noise are performed. The proposed multi-template method achieved 89.5% word accuracy rate compared with 88.1% of the conventional single-template method, while the baseline recognition rate without adaptation is 86.4%. Moreover, experiments using Vocal Tract Length Normalization (VTLN) and supervised Maximum Likelihood Linear Regression (MLLR) are also compared.

Key Words: HMM-Sufficient Statistics, unsupervised, speaker adaptation, noisy environments


Manuscript received July 11, 2005. Manuscript revised September 30, 2005.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
IEICE Trans Inf & SystHome page
R. GOMEZ, T. TODA, H. SARUWATARI, and K. SHIKANO
Reducing Computation Time of the Rapid Unsupervised Speaker Adaptation Based on HMM-Sufficient Statistics
IEICE Trans D: Information, February 1, 2007; E90-D(2): 554 - 561.
[Abstract] [PDF]



Disclaimer:
Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.