Skip Navigation

IEICE Transactions on Information and Systems 2007 E90-D(5):816-824; doi:10.1093/ietisy/e90-d.5.816
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Request Permissions
Google Scholar
Right arrow Articles by TODA, T.
Right arrow Articles by TOKUDA, K.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Copyright © 2007 The Institute of Electronics, Information and Communication Engineers

Regular Section -- Papers -- Speech and Hearing

A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis

Tomoki TODA1 and Keiichi TOKUDA2

1 The author is with the Graduate School of Information Science, Nara Institute of Science and Technology, Ikoma-shi, 630–0192 Japan. E-mail: tomoki{at}is.naist.jp, 2 The author is with the Graduate School of Engineering, Nagoya Institute of Technology, Nagoya-shi, 466–8555 Japan. E-mail: tokuda{at}ics.nitech.ac.jp

This paper describes a novel parameter generation algorithm for an HMM-based speech synthesis technique. The conventional algorithm generates a parameter trajectory of static features that maximizes the likelihood of a given HMM for the parameter sequence consisting of the static and dynamic features under an explicit constraint between those two features. The generated trajectory is often excessively smoothed due to the statistical processing. Using the over-smoothed speech parameters usually causes muffled sounds. In order to alleviate the over-smoothing effect, we propose a generation algorithm considering not only the HMM likelihood maximized in the conventional algorithm but also a likelihood for a global variance (GV) of the generated trajectory. The latter likelihood works as a penalty for the over-smoothing, i.e., a reduction of the GV of the generated trajectory. The result of a perceptual evaluation demonstrates that the proposed algorithm causes considerably large improvements in the naturalness of synthetic speech.

Key Words: HMM-based speech synthesis, speech parameter generation, maximum likelihood criterion, over-smoothing effect, global variance


Manuscript received July 11, 2006. Manuscript revised December 11, 2006.

References

[1] Y. Sagisaka, "Speech synthesis by rule using an optimal selection of non-uniform synthesis units," Proc. ICASSP, pp.679–682, New York, USA, April 1988.

[2] T. Hirokawa, "Speech synthesis using a waveform dictionary," Proc. EUROSPEECH, pp.140–143, Paris, France, Sept. 1989.

[3] D.H. Klatt, "Review of text-to-speech conversion for English," J. Acoust. Soc. Am., vol.82, no.3, pp.737–793, 1987.[Medline]

[4] N. Iwahashi, N. Kaiki, and Y. Sagisaka, "Speech segment selection for concatenative synthesis based on spectral distortion minimization," IEICE Trans. Fundamentals, vol.E76-A, no.11, pp.1942–1948, Nov. 1993.

[5] A.J. Hunt and A.W. Black, "Unit selection in a concatenative speech synthesis system using a large speech database," Proc. ICASSP, pp.373–376, Atlanta, USA, May 1996.

[6] M. Isogai and H. Mizuno, "A new F0 contour control method based on vector representation of F0 contour," Proc. EUROSPEECH, pp.727–730, Budapest, Hungary, Sept. 1999.

[7] A. Raux and A.W. Black, "A unit selection approach to F0 modeling and its application to emphasis," Proc. ASRU, pp.700–705, St. Thomas, USA, Dec. 2003.

[8] T. Saito, "Generating F0 contours by statistical manipulation of natural F0 shapes," IEICE Trans. Inf. & Syst., vol.E89-D, no.3, pp.1100–1106, March 2006.[Abstract/Free Full Text]

[9] H. Kawahara, I. Masuda-Katsuse, and A.de Cheveigné, "Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds," Speech Commun., vol.27, no.3–4, pp.187–207, 1999.

[10] Y. Stylianou, "Applying the harmonic plus noise model in concatenative speech synthesis," IEEE Trans. Speech Audio Process., vol.9, no.1, pp.21–29, 2001.

[11] T. Toda, H. Kawai, and M. Tsuzaki, "Effectiveness of prosodic modification in concatenative Text-to-Speech synthesis," Proc. Autumn Meeting of ASJ, 1-8-10, pp.201–202, Sept. 2003.

[12] T. Toda, H. Kawai, M. Tsuzaki, and K. Shikano, "An evaluation of cost functions sensitively capturing local degradation of naturalness for segment selection in concatenative speech synthesis," Speech Commun., vol.48, no.1, pp.45–56, Jan. 2006.

[13] H. Kawai and M. Tsuzaki, "A study on time-dependent voice quality variation in a large-scale single speaker speech corpus used for speech synthesis," Proc. IEEE 2002 Workshop on Speech Synthesis, Santa Monica, U.S.A., Sept. 2002.

[14] A.K. Syrdal, C.W. Wightman, A. Conkie, Y. Stylianou, M. Beutnagel, J. Schroeter, V. Strom, K-S. Lee, and M.J. Makashay, "Corpus-based techniques in the AT&T NextGen synthesis system," Proc. ICSLP, vol.3, pp.410–415, Beijing, China, Oct. 2000.

[15] M. Chu, H. Peng, H. Yang, and E. Chang, "Selecting non-uniform units from a very large corpus for concatenative speech synthesizer," Proc. ICASSP, pp.785–788, Salt Lake City, U.S.A., May 2001.

[16] H. Kawai, T. Toda, J. Ni, M. Tsuzaki, and K. Tokuda, "XIMERA: A new TTS from ATR based on corpus-based technologies," Proc. 5th ISCA Speech Synthesis Workshop (SSW5), pp.179–184, Pittsburgh, USA, June 2004.

[17] S. Nakajima and H. Hamada, "Automatic generation of synthesis units based on context oriented clustering," Proc. ICASSP, pp.659–662, New York, USA, April 1988.

[18] T. Kagoshima and M. Akamine, "An F0 contour control model for totally speaker driven text to speech system," Proc. ICSLP, pp.1975–1978, Sydney, Australia, Dec. 1998.

[19] M. Akamine and T. Kagoshima, "Analytic generation of synthesis units by closed loop training for totally speaker driven text to speech system (TOS Drive TTS)," Proc. ICSLP, pp.1927–1930, Sydney, Australia, Dec. 1998.

[20] T. Mizutani and T. Kagoshima, "Concatenative speech synthesis based on the plural unit selection and fusion method," IEICE Trans. Inf. & Syst., vol.E88-D, no.11, pp.2565–2572, Nov. 2005.[Abstract/Free Full Text]

[21] K. Tokuda, H. Zen, and A.W. Black, "An HMM-based speech synthesis system applied to English," Proc. IEEE 2002 Workshop on Speech Synthesis, Santa Monica, USA, Sept. 2002.

[22] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, "Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis," Proc. EUROSPEECH, pp.2347–2350, Budapest, Hungary, Sept. 1999.

[23] K. Tokuda, T. Kobayashi, and S. Imai, "Speech parameter generation from HMM using dynamic features," Proc. ICASSP, pp.660–663, Detroit, USA, May 1995.

[24] K. Tokuda, T. Masuko, T. Yamada, T. Kobayashi, and S. Imai, "An algorithm for speech parameter generation from continuous mixture HMMs with dynamic features," Proc. EUROSPEECH, pp.757–760, Madrid, Spain, Sept. 1995.

[25] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura, "Speech parameter generation algorithms for HMM-based speech synthesis," Proc. ICASSP, pp.1315–1318, Istanbul, Turkey, June 2000.

[26] H. Zen and T. Toda, "An overview of nitech HMM-based speech synthesis system for Blizzard Challenge 2005," Proc. Interspeech, pp.93–96, Lisbon, Portugal, Sept. 2005.

[27] Y. Sagisaka, K. Takeda, M. Abe, S. Katagiri, T. Umeda, and H. Kuwabara, "A large-scale Japanese speech database," ICSLP90, pp.1089–1092, Kobe, Japan, Nov. 1990.

[28] K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi, "Multi-space probability distribution HMM," IEICE Trans. Inf. & Syst., vol.E85-D, no.3, pp.455–464, March 2002.

[29] K. Shinoda and T. Watanabe, "MDL-based context-dependent subword modeling for speech recognition," J. Acoust. Soc. Jpn. (E), vol.21, no.2, pp.79–86, 2000.

[30] S. Imai, "Cepstral analysis synthesis on the mel frequency scale," Proc. ICASSP, pp.93–96, Boston, USA, April 1983.

[31] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, "Incorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesis," IEICE Trans. Inf. & Syst. (Japanese Edition), vol.87-D-II, no.8, pp.1565–1571, Aug. 2004.

[32] T. Toda, A.W. Black, and K. Tokuda, "Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter," Proc. ICASSP, vol.1, pp.9–12, Philadelphia, USA, March 2005.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Request Permissions
Google Scholar
Right arrow Articles by TODA, T.
Right arrow Articles by TOKUDA, K.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?