THESIS
2002
xiv, 144 leaves : ill. ; 30 cm
Abstract
The Web, which has become one of the major information resources nowadays, contains billions of web pages, not including the huge amount of information encapsulated in web databases. Due to this huge and dynamic Web, it is difficult for a single search engine to index all of the web pages and yet keep its index database up-to-date. Meta-search and distributed search systems incorporating many single search engines can alleviate the problems associated with a single search engine....[
Read more ]
The Web, which has become one of the major information resources nowadays, contains billions of web pages, not including the huge amount of information encapsulated in web databases. Due to this huge and dynamic Web, it is difficult for a single search engine to index all of the web pages and yet keep its index database up-to-date. Meta-search and distributed search systems incorporating many single search engines can alleviate the problems associated with a single search engine.
A meta-search system is a middleware for a large number of underlying search engines. It receives queries from users and redirects them to the most relevant underlying search engines for processing. We propose two cluster based server selection methods (i.e., GSE and DE) in this thesis although they can also be adopted without clustering. The GSE server selection estimates each underlying server’s relevance according to the general similarity of the clusters in the server. Only the most similar clusters are counted in the server selection since these clusters tend to contain the majority of the relevant documents in the server. The DE server selection estimates the distance of some optimal documents in the server. These optimal documents, which are believed to be inherently more relevant to the query, contain the query terms in a balanced manner and are also very close to the query point. Experimental results show that both GSE and DE can achieve very good and steady performance.
We also propose a peer-to-peer distributed search system where a large number of autonomous search engines are logically connected in a flat non-hierarchical architecture (network). We model the distributed search process as Markov Decision Processes (MDPs). The estimated relevance of a server to a query is regarded as the reward in the MDP model. Once the MDP policies representing the global knowledge are obtained at each server through the asynchronous value iteration, the most relevant servers to a given query can be efficiently identified despite the lack of centralized control and global knowledge at each autonomous server.
For our specific MDP model, we design a passive asynchronous MDP value iteration based on the propagation of reward messages. The properties of the reward propagation are studied and the value iteration algorithm is optimized correspondingly. In the optimized value iteration, the MDP rewards are propagated along the optimal directions and the unnecessary rewards are discarded (i.e., the reward selection and the propagation are combined). In order to adapt to the dynamic web environment, we also solve the problem of the revision of the MDP policies upon either content or structural adjustments. The effectiveness of our MDP model is verified by extensive experiments.
Post a Comment