The development of an effective co-training framework for adapting metasearch engine rankers

HKUST Electronic Theses

The development of an effective co-training framework for adapting metasearch engine rankers

by Qingzhao Tan

THESIS 2004

M.Phil. Computer Science

xi, 73 leaves : ill. ; 30 cm

Abstract

A metasearch engine consists of a number of search engine components, which typically have different strengths in terms of coverage and search quality. The function used to combine and re-rank the results retrieved by the underlying search engine components is the critical step for the metasearch engine since it determines the metasearch engine's performance. This thesis presents a technique for training a metasearch engine to adapt to suit the specific needs of an individual or a user community with similar interests. The training data is a set of clickthrough data captured by a search engine. It contains a set of queries and for each query the links returned by the search engine and the links that have been clicked on by the users. It is desirable to use clickthrough data in training, since it is a kind of implicit user feedback and can be obtained from the search engine server in a timely fashion and inexpensively.

Since the amount of available clickthrough data from an individual or a user community with similar interests is typically sparse, we propose to apply co-training and a Ranking Support Vector Machine (RSVM) to train the metasearch engine. This is because RSVM can be used to combine and re-rank all the results returned by each underlying search engine to meet the needs of an individual or a user community with similar interests in an efficient way. In addition, co-training is known to be more effective when the amount of training data is small, thus making RSVM more effective. We call this technique RSCF (Ranking SVM in a Co-training Framework). Essentially, it takes as input a set of clickthrough data and generates as output adaptive rankers in a learning process. By analyzing the clickthrough data, RSCF first categorizes the data as the labeled data set, which contains the search items that have been scanned by users already, and the unlabelled data set, which contains the data items that have not yet been scanned by users. We then augment the labeled data with unlabelled data and re-rank the results according to their relevance. Finally, we obtain a larger data set for training the final rankers in RSCF.

We first carry out the experiments to demonstrate that the RSCF algorithm produces better ranking results than the standard Ranking SVM algorithm in terms of prediction error. We also apply RSCF to build two metasearch engines: one comprises MSNsearch, Wisenut and Overture, and the other has Google added. We show that, in general, the metasearch engines have better search quality than their underlying search engines. In particular, the first metasearch engine, without Google as a component, can produce better search quality than Google, which is in general considered the most powerful search engine. The second metasearch engine demonstrates that while Google performs excellently, it does not generally excel in every query category. These metasearch engines are able to be adapted to bring out the strengths of the underlying search engines.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent. Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree M.Phil. Department Computer Science Authors Tan, Qingzhao Subjects Search engines Programming Computer algorithms Metadata Language English Call number Thesis COMP 2004 Tan DOI 10.14711/thesis-b837458

Full record

The development of an effective co-training framework for adapting metasearch engine rankers

by Qingzhao Tan

Post a Comment Cancel reply