THESIS
2002
xiii, 106 leaves : ill. (some col.) ; 30 cm
Abstract
Information retrieval refers to the representation, storage, organization, and accessing of information items. It has become one of the most important subjects of Information and Computer Science. Statistical physics, applied to many branches of information processing, has led to fruitful results. However, little attention has been paid for the combination of the two disciplines. This thesis is devoted to the studies of the application of statistical physics on information retrieval....[
Read more ]
Information retrieval refers to the representation, storage, organization, and accessing of information items. It has become one of the most important subjects of Information and Computer Science. Statistical physics, applied to many branches of information processing, has led to fruitful results. However, little attention has been paid for the combination of the two disciplines. This thesis is devoted to the studies of the application of statistical physics on information retrieval.
I investigate a probabilistic model for information retrieval. The documents, queries and relevancy assessments are modeled with explicit parametric distributions. The total probability distribution is then obtained according to a probability meta-structure. Two variations of the meta-structures are discussed based on the annealed and quenched interactions between documents and queries respectively. The physical pictures of the two variations are discussed and compared.
I apply statistical physics to analyze the quenched model. The zeroth order of the partition function gives a set of mean field equations. Two approaches are used for the solution of the equations. One is the Augmented Lagrangian Method in constrained optimization; the other is a direct iterative algorithm, utilizing the special form of the mean field equations. The complexity of the two algorithms is analyzed, which suggests the that iterative algorithm is superior.
Our model and the mean field algorithm are evaluated on the benchmark test collections. They are compared to the standard tf-idf scheme, latent semantic indexing and Rocchio heuristics in information retrieval. Significant improvement in retrieval precision is obtained in most cases. The empirical results of the experiments are discussed and we draw conclusions about the contexts that most favor our model.
Our model has three hyperparameters. Good estimation of hyperparameters is of key importance. First, we have used brute force cross validation approach. Second, we consider the second order expansion of the partition function, which is equivalent to Bayesian Evidence Framework. The important meanings of the hyper-parameters are also discussed.
Post a Comment