THESIS
2012
xii, 120 p. : ill. ; 30 cm
Abstract
As matrix is a common data presentation in many applications, the submatrix mining problem has been extensively studied to reveal interesting correlations among a set of rows and a set of columns. Various submatrix models have been proposed to describe different local correlations, among which, the order-preserving submatrix (OPSM) model is proposed to capture the fact that the entry values of a set of rows follow the same trend under a set of columns. In this thesis, I systematically study the problem of mining OPSM patterns to address various real application needs....[
Read more ]
As matrix is a common data presentation in many applications, the submatrix mining problem has been extensively studied to reveal interesting correlations among a set of rows and a set of columns. Various submatrix models have been proposed to describe different local correlations, among which, the order-preserving submatrix (OPSM) model is proposed to capture the fact that the entry values of a set of rows follow the same trend under a set of columns. In this thesis, I systematically study the problem of mining OPSM patterns to address various real application needs.
In gene expression analysis, as the OPSM model is regarded to be restrictive and not practical when sample contamination inevitably exists, I consider two relaxation strategies to relax the strict OPSM model. Three relaxed OPSM models and the corresponding pattern mining methods have been proposed. Experimental studies on real biological data show that my relaxed OPSM models lead to the discovery of interesting correlations with high biological significance. The experiments also justify the efficiency of my proposed pattern mining methods.
Due to the prevalence of uncertain data, I define new probabilistic matrix representations, and formalize a novel probabilistic OPSM (POPSM) model. I propose an efficient POPSM mining framework called PROBAPRI. I demonstrate the superiority of my approach by two applications of POPSM. Experiments on biological datasets show that the POPSM model better captures the characteristics of the expression levels of biologically correlated genes. Experiments on a RFID trace dataset demonstrate the effectiveness of my POPSM model in capturing the common visiting behavior among users.
Data in applications like recommender systems can also be organized as matrices, which however usually have extremely large size and high degree of sparsity. I propose a sparse OPSM (SOPSM) model for sparse matrices, and adopt two criteria to control the number and distribution of patterns mined from large matrices. Accordingly, I formulate a constrained SOPSM mining problem, and develop an effective approach called QUICK PROBING (QP) and its parallelization scheme. Experiments on two datasets from real recommender systems show that the QP framework achieves high scalability and is efficient for handling matrices having millions of rows and columns. Besides, QP predicts the ordered preference of users over a set of items with high accuracy.
Post a Comment