THESIS
1993
xi, 110 leaves : ill. ; 30 cm
Abstract
Conversion of Chinese phonetic symbols to the corresponding Chinese characters has been one of the most important topics currently being pursued in the field of Chinese information processing. In Mandarin Chinese, the character to syllable and syllable to character are both many to many mappings. There are around 1,300 different syllables in Mandarin, but more than thirteen thousand commonly used Chinese characters. Some syllables can be mapped to more than 100 characters (e.g. yi). In this research, the conversion of Chinese phonetic symbols to Chinese characters is based on linguistic and statistical techniques. The phonetic symbols are first segmented into a list of syllable words by the Augmented Maximal Matching method developed in this thesis. A syllable word is a sequence of syll...[
Read more ]
Conversion of Chinese phonetic symbols to the corresponding Chinese characters has been one of the most important topics currently being pursued in the field of Chinese information processing. In Mandarin Chinese, the character to syllable and syllable to character are both many to many mappings. There are around 1,300 different syllables in Mandarin, but more than thirteen thousand commonly used Chinese characters. Some syllables can be mapped to more than 100 characters (e.g. yi). In this research, the conversion of Chinese phonetic symbols to Chinese characters is based on linguistic and statistical techniques. The phonetic symbols are first segmented into a list of syllable words by the Augmented Maximal Matching method developed in this thesis. A syllable word is a sequence of syllable that can be transcribed to one or more valid Chinese words. Augmented Maximal Matching uses Maximal Matching as backbone, integrates with special techniques that identify derived words, and modules that use both linguistic and statistical methods to determine the final segmentation. The ambiguity in syllable words are then resolved by idiomatic phrase matching, adjacency constraint rules, and statistical methods. A working prototype system to demonstrate the techniques developed in the project, together with compilers for the linguistic information, has been implemented. Extensive experiments on data from domains, ranging form science articles and linguistic text to political articles, have been done and the sources of error have been identified and analyzed. Experimental results based on 264 sentences, with 3,001 characters, show the error rate of transcription is 1.3%. The techniques and linguistic knowledge developed in this research maybe useful for many applications such as keyboard input of Chinese characters, Chinese speech recognition, and optical Chinese character recognition.
Post a Comment