THESIS
2000
x, 56 leaves : ill. ; 30 cm
Abstract
This thesis presents a colloquial language modeling technique for spontaneous Cantonese speech recognition. Cantonese is a language linguistically distinct from standard written Chinese. It is different in terms of certain vocabulary and word order. Therefore, Cantonese large vocabulary continuous speech recognition (LVCSR) systems need to be trained on Cantonese language model. Since Cantonese is not a written language, there is a lack of Cantonese text corpus. Moreover, spoken Cantonese tends to be more spontaneous and colloquial than spoken Mandarin. To underline the language difference between Cantonese and Mandarin (standard Chinese), we quantify the similarity between Cantonese and Mandarin texts by applying Zipf's Law of Language Distance and R
2 regression scores to measure commo...[
Read more ]
This thesis presents a colloquial language modeling technique for spontaneous Cantonese speech recognition. Cantonese is a language linguistically distinct from standard written Chinese. It is different in terms of certain vocabulary and word order. Therefore, Cantonese large vocabulary continuous speech recognition (LVCSR) systems need to be trained on Cantonese language model. Since Cantonese is not a written language, there is a lack of Cantonese text corpus. Moreover, spoken Cantonese tends to be more spontaneous and colloquial than spoken Mandarin. To underline the language difference between Cantonese and Mandarin (standard Chinese), we quantify the similarity between Cantonese and Mandarin texts by applying Zipf's Law of Language Distance and R
2 regression scores to measure common words between the two languages. These scores show that Taiwan and Chinese newsgroup articles are more similar to each other (with a 0.83 R
2 regression score) than with Hong Kong articles (0.79 and 0.67). Any similarity of Cantonese and Mandarin is mostly due to an overlap in content word vocabulary. To collect a spontaneous Cantonese speech corpus, we set up a Wizard-of-Oz database collection system. Spontaneous Cantonese speech consists of colloquial and filler phrases with keywords that are shared between Cantonese and Mandarin. We use a statistical tool to extract colloquial and filler phrases from this database. Additional filler phrases are collected from written Cantonese downloaded from online newsgroups. The extracted filler and colloquial phrases are used for keyword spotting and colloquial Cantonese dictation system. By applying garbage and filler phrase modeling, we obtain 82.5% keyword spotting accuracy, this gives a 33% improvement compared to our baseline system. In order to train a Cantonese language model for our colloquial Cantonese LVCSR system, we adapt our baseline Mandarin language model to Cantonese language model using linear interpolation between large Mandarin text corpus and small amount of Cantonese text corpus. This gives 30% improvement of character accuracy compared with our baseline system.
Post a Comment