Efficient correlated pattern discovery in databases

HKUST Electronic Theses

Efficient correlated pattern discovery in databases

by Yiping Ke

THESIS 2008

Ph.D. Computer Science and Engineering

xi, 120 leaves : ill. ; 30 cm

Abstract

Correlation mining has gained great success in many application domains for its ability to capture the underlying dependency between objects. However, existing research on correlation mining is mainly conducted on boolean databases, despite that other complex data, especially in various scientific and business domains, proliferates in recent years. In this thesis, we study the problem of correlated pattern discovery from two types of prevalently-used databases named quantitative databases and graph databases.

In mining correlations from quantitative databases, we propose a novel notion of Quantitative Correlated Patterns (QCPs), which is founded on two correlation measures, normalized mutual information and all-confidence. We also develop an algorithm, QCoMine, to efficiently mine QCPs by utilizing a supervised interval combining method and performing a bi-level pruning. We also identify the redundancy existing in the set of QCPs and propose effective techniques to eliminate the redundancy. Our extensive experiments on both real and synthetic datasets verify the efficiency of QCoMine and the quality of the QCPs. The experimental results also justify the effectiveness of our proposed techniques for redundancy elimination. To further demonstrate the usefulness and the quality of QCPs, we study an application of QCPs to the problem of classification. We demonstrate that the classifier built on QCPs achieves higher classification accuracy than the state-of-the-art classifiers built on association rules.

In mining graph databases, we formalize a new problem of Correlated Graph Search (CGS). CGS adopts Pearson’s correlation coefficient as a correlation measure to take into account the occurrence distributions of graphs. We derive two necessary conditions that set bounds on the occurrence probability of a candidate graph in the database. With this result, we devise an efficient algorithm that mines the candidate set from a much smaller projected database and thus we are able to obtain a significantly smaller set of candidates. Three heuristic rules are further developed to refine the candidate set. We also make use of the bounds to directly answer high-support queries without mining the candidates. We conduct a comprehensive set of experiments to justify the efficiency of our algorithm. The results show that the candidate generation from the projected database and the three heuristic rules are very effective in reducing the large search space. Compared with the approach that generates the candidates from the whole database, our algorithm is several orders of magnitude faster and consumes up to 40 times less memory. Finally, we prove that our algorithm provides a general solution when most of the correlation measures are used to generalize the CGS problem.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree Ph.D. Department Computer Science and Engineering Supervisors Ng, Wilfred Authors Ke, Yiping Subjects Data mining Pattern recognition systems Correlation (Statistics) Language English Call number Thesis CSED 2008 Ke DOI 10.14711/thesis-b1023179

Full record

Efficient correlated pattern discovery in databases

by Yiping Ke

Post a Comment Cancel reply