THESIS
1997
xiii, 80 leaves : ill. ; 30 cm
Abstract
Transcription of Chinese syllables such as Pinyin to the corresponding Chinese character (Hanzi) is an important problem in Chinese information processing. To date, η-gram probabilistic models outperform models that try to capture linguistic structures. η-gram models capture the correlation between neighboring words by estimating the probability of co-occurrence between words in a language. Owing to the sparse data distribution, it is difficult to make good estimates from a training corpus which is often small with respect to the language. Various smoothing techniques for estimating η-gram statistics in the English language have been proposed and compared. It is known that how well various techniques actually perform depends on the problem domain in which the probabilistic model is appl...[
Read more ]
Transcription of Chinese syllables such as Pinyin to the corresponding Chinese character (Hanzi) is an important problem in Chinese information processing. To date, η-gram probabilistic models outperform models that try to capture linguistic structures. η-gram models capture the correlation between neighboring words by estimating the probability of co-occurrence between words in a language. Owing to the sparse data distribution, it is difficult to make good estimates from a training corpus which is often small with respect to the language. Various smoothing techniques for estimating η-gram statistics in the English language have been proposed and compared. It is known that how well various techniques actually perform depends on the problem domain in which the probabilistic model is applied. In this thesis, we consider various smoothing techniques to estimate word-based bigram probabilities in Chinese, and specifically their application in the problem of Pinyin to Hanzi transcription. Techniques based on the backoff idea are also proposed for this application domain. A working prototype system to demonstrate the techniques discussed has been implemented. The system follows the ANSI specification for the C++ language and is designed to facilitate easy enhancements and to increase reusability and portability. Experiments on data from different context domains have been performed. Results based on three test sets with a total of 10391 characters show significant and consistent improvements over the MLE method and the widely used backoff model of Katz in this application domain. The techniques developed in this research may be useful for other applications such as keyboard input of Chinese characters, Chinese speech recognition, optical Chinese character recognition, and related applications.
Post a Comment