THESIS
2002
xiii, 71 leaves : ill. ; 30 cm
Abstract
Most Web search engines are content word based but it is not natural for the users to translate their query into content word(CW). Therefore searching with spontaneous language query is desirable. However, we cannot search with natural language query when using most of the popular Content word based search engines, such as Yahoo and AltaVista. Otherwise, the filler word(FW) in the natural language query will introduce lots of noise, resulting in low precision and recall for the search result....[
Read more ]
Most Web search engines are content word based but it is not natural for the users to translate their query into content word(CW). Therefore searching with spontaneous language query is desirable. However, we cannot search with natural language query when using most of the popular Content word based search engines, such as Yahoo and AltaVista. Otherwise, the filler word(FW) in the natural language query will introduce lots of noise, resulting in low precision and recall for the search result.
Mixed language is common in Asian Society. A mixed language query consists of words in a primary language and a secondary language. For example, when we search the web in Chinese (primary language), we may embed some English(secondary language) in the query. We may want to translate the secondary language to primary language and vice versa for cross language information retrieval. In many situations, the CWs are ambiguous and difficult to translate, leading to poor search precision and recall.
This thesis aims to solve these problems by investigating the use of natural language.
In the first part of the thesis, we propose a HMM based method for extracting CW from the natural language query for the web search. To identify the CW from the natural language query, the most intuitive method is to filter out the FW. Whether a word is CW or FW is context dependent. The proposed HMM based method can take the context into account and the performance of the CW extractor is comparable to that of a human.
In the second part of the thesis, we propose a context dependent statistical method for translating a mixed language query into a monolingual query in either language. Two novel features for disambiguation are introduced and the disambiguator is trained using co-occurrence information from monolingual data only. The average query translation accuracy is 83.72%, compared with a baseline accuracy of 75.5%.
Post a Comment