Building phrase based language model from large corpus
by Tang Haijiang
M.Phil. Electrical and Electronic Engineering
x, 79 leaves : ill. ; 30 cm
Statistical language models (SLM) encode linguistic information in the form of estimation of probability distribution of natural language, and have been successfully applied in various language processing applications....[ Read more ]
Statistical language models (SLM) encode linguistic information in the form of estimation of probability distribution of natural language, and have been successfully applied in various language processing applications.
Currently, most SLMs are based on words. A language model is trained on text based on a pre-defined lexicon. However, even language model trained from huge body of text with very large lexicons yield a significant number of unreliable estimates due to the lack of linguistic information and the inaccurate independent assumption of language modeling. In fact, word bases n-gram language model, the most commonly used SLM, uses so little linguistic knowledge that it may applied to a sequence of any symbols with no deep structure or meaning behind them. One solution is to encode linguistic information into lexical units that has longer context, which is, including phrases as the linguistic unit for language modeling. The research work presented in this thesis focus on using automatically extracted phrase for language model training.
In this thesis, we investigate phrase based language model building techniques. We compare phrase extraction approaches using different statistical information obtained from the training data. The experimental results show that phrase based language model addresses main problems with regard to word-based n-gram model, hence systematically and significantly improves the quality of perplexity and recognition accuracy. We also propose our new approach that outperforms the previous methods.
Another contribution of this thesis is in robust Chinese syllable to word decoding. Syllable to word decoding is very important for Chinese keyboard input, and also a main part of Chinese CSR. However, inherent ambiguities of Chinese language hamper the accurate decoding. We present a multi-path search algorithm that addresses this problem and significantly improves the recognition accuracy.