Copyright © 2007 The Institute of Electronics, Information and Communication Engineers
Regular Section -- Papers -- Natural Language Processing |
An EM-Based Approach for Mining Word Senses from Corpora
1 The authors are with Sirindhorn International Institute of Technology, Thammasat University, Thailand. E-mail: thatsanee{at}tcllab.org, 2 The authors are with The NICT Asia Research Center, Thailand.
| Abstract |
|---|
Manually collecting contexts of a target word and grouping them based on their meanings yields a set of word senses but the task is quite tedious. Towards automated lexicography, this paper proposes a word-sense discrimination method based on two modern techniques; EM algorithm and principal component analysis (PCA). The spherical Gaussian EM algorithm enhanced with PCA for robust initialization is proposed to cluster word senses of a target word automatically. Three variants of the algorithm, namely PCA, sGEM, and PCA-sGEM, are investigated using a gold standard dataset of two polysemous words. The clustering result is evaluated using the measures of purity and entropy as well as a more recent measure called normalized mutual information (NMI). The experimental result indicates that the proposed algorithms gain promising performance with regard to discriminate word senses and the PCA-sGEM outperforms the other two methods to some extent.
Key Words: corpus-based lexicography, word sense discrimination, clustering, EM algorithm, principal component analysis
Manuscript received May 11, 2006. Manuscript revised August 30, 2006.