Smoothing techniques in probabilistic models for Pinyin to Hanzi transcription

HKUST Electronic Theses

Smoothing techniques in probabilistic models for Pinyin to Hanzi transcription

by Cheung Hon-kit

THESIS 1997

M.Phil. Computer Science

xiii, 80 leaves : ill. ; 30 cm

Abstract

Transcription of Chinese syllables such as Pinyin to the corresponding Chinese character (Hanzi) is an important problem in Chinese information processing. To date, η-gram probabilistic models outperform models that try to capture linguistic structures. η-gram models capture the correlation between neighboring words by estimating the probability of co-occurrence between words in a language. Owing to the sparse data distribution, it is difficult to make good estimates from a training corpus which is often small with respect to the language. Various smoothing techniques for estimating η-gram statistics in the English language have been proposed and compared. It is known that how well various techniques actually perform depends on the problem domain in which the probabilistic model is appl...[ Read more ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree M.Phil. Department Computer Science Authors Cheung, Hon-kit Subjects Chinese characters Data processing Chinese language Language English Call number Thesis COMP 1997 Cheung DOI 10.14711/thesis-b564583

Full record

Smoothing techniques in probabilistic models for Pinyin to Hanzi transcription

by Cheung Hon-kit

Post a Comment Cancel reply