THESIS
2007
xiii, 94 leaves : ill. ; 30 cm
Abstract
Due to the rapid growth of the World Wide Web, applications running on centralized systems are not be able to handle the large volume of data that are geographically distributed all over the world. Distributed systems, such as peer-to-peer (P2P) systems and meta-search systems, have become a popular and to some extent revolutionary solution to large-scale data sharing and retrieval. They offer advantages such as autonomy and flexibility for peers to join and leave the system, scalability through the addition of inexpensive peers, and robustness against single-peer failures. However, the "open nature" of P2P systems and their lack of centralized control pose difficult challenges for full-text search, which has been implemented successfully in centralized systems with powerful search abil...[
Read more ]
Due to the rapid growth of the World Wide Web, applications running on centralized systems are not be able to handle the large volume of data that are geographically distributed all over the world. Distributed systems, such as peer-to-peer (P2P) systems and meta-search systems, have become a popular and to some extent revolutionary solution to large-scale data sharing and retrieval. They offer advantages such as autonomy and flexibility for peers to join and leave the system, scalability through the addition of inexpensive peers, and robustness against single-peer failures. However, the "open nature" of P2P systems and their lack of centralized control pose difficult challenges for full-text search, which has been implemented successfully in centralized systems with powerful search ability and high precision.
In this thesis, we study keyword search methods in meta-search and P2P networks. For meta-search, we propose a new server ranking approach in which each search engine's document collection is divided into clusters based on the index terms and term correlation information of the clusters is utilized to improve the server ranking quality. We develop two methods for deriving term correlation information from a cluster. The first method records term correlation for each pair of words found in a document cluster. The second method applies Latent Semantic Indexing (LSI) to map a query into a semantic vector for each cluster and judges the relevance of a cluster based on the properties of the semantic vectors.
For P2P networks, we propose an efficient and scalable technique to support partial-match queries. A distributed index structure, called the distributed pattern tree (DPTree), is developed to record frequent query patterns, i.e., combinations of keywords, learnt from the query history at each node in the network. Using DPTree, a query can identify its best matching patterns quickly and data lookup can be done in logarithmic time with respect to the network size.
Post a Comment