THESIS
2002
xiii, 86 leaves : ill. (some col.) ; 30 cm
Abstract
A Chinese sentence is typically written as a sequence of characters. However, a word, a logical semantic and syntactic unit, is often used for language model, information retrieval and information query. Thus, a segmentation algorithm is necessary to map the sequence of characters into a sequence of words. One of the simplest methods is forward maximum matching (FMM), which considers the segmentation boundary locally. In this thesis we propose two segmentation methods, the dynamic matching (DM) and the maximum likelihood (ML) algorithms. Both of them overcome the problem of FMM by considering the whole sentence when making decisions. Information on word transition is also considered in ML. Character perplexity as well as the recognition accuracy of DM and ML are improved compared to FMM...[
Read more ]
A Chinese sentence is typically written as a sequence of characters. However, a word, a logical semantic and syntactic unit, is often used for language model, information retrieval and information query. Thus, a segmentation algorithm is necessary to map the sequence of characters into a sequence of words. One of the simplest methods is forward maximum matching (FMM), which considers the segmentation boundary locally. In this thesis we propose two segmentation methods, the dynamic matching (DM) and the maximum likelihood (ML) algorithms. Both of them overcome the problem of FMM by considering the whole sentence when making decisions. Information on word transition is also considered in ML. Character perplexity as well as the recognition accuracy of DM and ML are improved compared to FMM.
A lexicon is required in most segmentation methods. As it is impossible to include all Chinese words in a lexicon, new word detection is needed to find meaningful non-lexicon words in training data. A statistical measurement, mutual information, is used to measure the tendency of character pairs to form words. We propose position-based mutual information which also captures the character position information in forming the new words. With the statistical information on character position, both the character perplexity and the recognition accuracy of DM and ML using the lexicon with new words improve significantly.
While training of a language model using words is trivial if the segmentation of the training sentences is done, it can be done regardless of the segmentation of the training data. This can be treated as a ML problem with missing information. The expectation maximization (EM) algorithm is used to deal with such a problem. It estimates the expected values of the count of the word phrases and the transition probabilities of the language model are calculated using these counts. The re-estimated language model parameters can then be used to do the segmentation using ML algorithm. This soft counting method can be applied to estimate the position-based mutual information. Better results can be achieved using this soft counting method.
Post a Comment