Statistics-based Chinese word segmentation and new word detection

HKUST Electronic Theses

Statistics-based Chinese word segmentation and new word detection

by Lo Wing Sze

THESIS 2002

M.Phil. Electrical and Electronic Engineering

xiii, 86 leaves : ill. (some col.) ; 30 cm

Abstract

A Chinese sentence is typically written as a sequence of characters. However, a word, a logical semantic and syntactic unit, is often used for language model, information retrieval and information query. Thus, a segmentation algorithm is necessary to map the sequence of characters into a sequence of words. One of the simplest methods is forward maximum matching (FMM), which considers the segmentation boundary locally. In this thesis we propose two segmentation methods, the dynamic matching (DM) and the maximum likelihood (ML) algorithms. Both of them overcome the problem of FMM by considering the whole sentence when making decisions. Information on word transition is also considered in ML. Character perplexity as well as the recognition accuracy of DM and ML are improved compared to FMM.

A lexicon is required in most segmentation methods. As it is impossible to include all Chinese words in a lexicon, new word detection is needed to find meaningful non-lexicon words in training data. A statistical measurement, mutual information, is used to measure the tendency of character pairs to form words. We propose position-based mutual information which also captures the character position information in forming the new words. With the statistical information on character position, both the character perplexity and the recognition accuracy of DM and ML using the lexicon with new words improve significantly.

While training of a language model using words is trivial if the segmentation of the training sentences is done, it can be done regardless of the segmentation of the training data. This can be treated as a ML problem with missing information. The expectation maximization (EM) algorithm is used to deal with such a problem. It estimates the expected values of the count of the word phrases and the transition probabilities of the language model are calculated using these counts. The re-estimated language model parameters can then be used to do the segmentation using ML algorithm. This soft counting method can be applied to estimate the position-based mutual information. Better results can be achieved using this soft counting method.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent. Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree M.Phil. Department Electrical and Electronic Engineering Authors Lo, Wing Sze Subjects Chinese characters Data processing Computational linguistics Chinese language Language English Call number Thesis ELEC 2002 LoW DOI 10.14711/thesis-b779391

Full record

Statistics-based Chinese word segmentation and new word detection

by Lo Wing Sze

Post a Comment Cancel reply