Copyright © 2007 The Institute of Electronics, Information and Communication Engineers
Regular Section -- Papers -- Natural Language Processing |
An EM-Based Approach for Mining Word Senses from Corpora
1 The authors are with Sirindhorn International Institute of Technology, Thammasat University, Thailand. E-mail: thatsanee{at}tcllab.org, 2 The authors are with The NICT Asia Research Center, Thailand.
Manually collecting contexts of a target word and grouping them based on their meanings yields a set of word senses but the task is quite tedious. Towards automated lexicography, this paper proposes a word-sense discrimination method based on two modern techniques; EM algorithm and principal component analysis (PCA). The spherical Gaussian EM algorithm enhanced with PCA for robust initialization is proposed to cluster word senses of a target word automatically. Three variants of the algorithm, namely PCA, sGEM, and PCA-sGEM, are investigated using a gold standard dataset of two polysemous words. The clustering result is evaluated using the measures of purity and entropy as well as a more recent measure called normalized mutual information (NMI). The experimental result indicates that the proposed algorithms gain promising performance with regard to discriminate word senses and the PCA-sGEM outperforms the other two methods to some extent.
Key Words: corpus-based lexicography, word sense discrimination, clustering, EM algorithm, principal component analysis
Manuscript received May 11, 2006. Manuscript revised August 30, 2006.
References
[1] B. Boguraev and T. Briscoe, eds., Computational Lexicography for Natural Language Processing, Longman, London/New York, 1989.
[2] G.A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller, "Introduction to WordNet: An on-line lexical database," CSL Report 43, 1993.
[3] C.F. Baker, C.J. Fillmore, and J.B. Lowe, "The Berkeley FrameNet project," Proc. COLING-ACL, pp.8690, Montreal, Canada, 1998.
[4] EDR, EDR Electronic Dictionary Technical Guide, Japan Electronic Dictionary Research Institute, Ltd., 1990.
[5] D. Zhendong, "MT research in China," Proc. International Conference on New Directions in Machine Translation, Budapest, 1988 (Also in New Directions in Machine Translation, Distributed Language Translation, ed. D. Maxwell, K. Schubert, and T. Witkam, Foris Publications).
[6] IPA Lexicon of the Japanese Language for Computers IPAL (Basic Verbs), Information Technology Promotion Agency, Japan, 1987.
[7] R.K. Ando, "Semantic lexicon construction: Learning from unlabeled data via spectral analysis," Proc. CoNLL-2004, pp.916, 2004.
[8] H. Schütze, "Automatic word sense discrimination," Computational Linguistics, vol.24, no.1, pp.97124, 1998.
[9] T. Charoenporn, C. Kruengkrai, V. Sornlertlamvanich, and H. Isahara, "Acquiring semantic information in the TCL's computational lexicon," Proc. ALR-04, pp.4753, 2004.
[10] T. Charoenporn, C. Kruengkrai, T. Theeramunkong, and V. Sornlertlamvanich, "Construction of Thai lexicon from existing dictionaries and texts on the Web," IEICE Trans. Inf. & Syst., vol.E89-D, no.7, pp.22862293, July 2006.
[11] CICC, "Thai basic dictionary: Technical report," Center of the International Cooperation for Computerization (CICC), 1995.
[12] Lexitron, "Thai-English dictionary," NECTEC, available at http://lexitron.nectec.or.th
[14] P. Palingoon, P. Chantanapraiwan, S. Theerawattanasuk, T. Charoenporn, and V. Sornlertlamvanich, "Qualitative and quantitative approaches in bilingual corpus-based dictionary," Proc. SNLP-Oriental COCOSDA 2002, pp.152158, 2002.
[15] M. Stevenson, Word Sense Disambiguation: The Case for Combinations of Knowledge Sources, CSLI Publications, California, 2003.
[16] D. Yarowsky, "Word sense disambiguation using statistical models of Roget's categories trained on large corpora," Proc. COLING'92, pp.454460, 1992.
[17] H. Schütze, "Automatic word sense discrimination," Computational Linguistics, vol.24, no.1, pp.97124, 1998.
[18] C. Leacock, G.A. Miller, and M. Chodorow, "Using corpus statistics and Wordnet relations for sense identification," Computational Linguistics, vol.24, no.1, pp.147165, 1998.
[19] A. Purandare and T. Pedersen, "Word sense discrimination by clustering contexts in vector and similarity spaces," Proc. CoNLL-2004, pp.4148, 2004.
[20] M. Marneffe and P. Dupont, "Comparative study of statistical word sense discrimination," Proc. International Conference on Statistical Analysis of Textual Data, pp.270281, 2004.
[21] D. Yarowsky, "Unsupervised word sense disambiguation rivaling supervised methods," Proc. Annual Meeting of the Association for Computational Linguistics, pp.189196, 1995.
[22] C.D. Manning and H. Schutze, Foundations of Statistical Natural Language Processing, MIT Press, Cambridge, MA, 1999.
[23] C. Leacock, T. Geoffrey, and E.M. Voorhees, "Towards building contextual representations of word senses using statistical models," in Corpus Processing for Lexical Acquisition, ed. B. Boguraev and J. Pustejovsky, MIT Press, Cambridge, 1996.
[24] C. Ding and H. Xiaofeng, "K-means clustering via principal component analysis," Proc. International Conference on Machine Learning, pp.2936, 2004.
[25] D. Boley, "Principal direction divisive partitioning," Data Mining and Knowledge Discovery, vol.2, no.4, pp.325344, 1998.
[26] C. Kruengkrai, V. Sornlertlamvanich, and H. Isahara, "Refining a divisive partitioning algorithm for unsupervised clustering," Proc. International Conference on Hybrid Intelligent Systems, pp.535542, 2003.
[27] G. Salton and C. Buckley, "Term-weighting approaches in automatic text retrieval," Information Processing and Management, vol.24, no.5, pp.513523, 1988.
[28] A. Strehl and J. Ghosh, "Cluster ensemblesA knowledge reuse framework for combining multiple partitions," Machine Learning Research, vol.3, pp.583617, 2002.
[29] T. Pedersen and R. Bruce, "Distinguishing word senses in untagged text," Proc. Second Conference on Empirical Methods in Natural Language Processing, pp.197207, Somerset, New Jersey, 1997.
[30] A. Purandare and T. Pedersen, "Improving word sense discrimination with gloss augmented feature vectors," Proc. Workshop on Lexical Resources for the Web and Word Sense Disambiguation, Puebla, Mexico, 2004, http://www.cs.pitt.edu/~amruta/pubs.html
[31] B. Schölkopf, A. Smola, and K. Muller, "Nonlinear component analysis as a kernel eigen value problem," Neural Comput., vol.10, pp.12991319, 1998.[Abstract]
[32] I.S. Dhillon, Y. Guan, and B. Kulis, "Kernel K-means, spectral clustering and normalized cuts," Proc. ACM SIGKDD, pp.551556, 2004.
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||