THESIS
2005
xi, 134 leaves : ill. ; 30 cm
Abstract
With the large amount of data collected in various applications, data mining has become an essential tool for data analysis and decision support. Frequent itemset mining is an important problem in the data mining area with a wide range of applications. In this dissertation, we investigate several techniques to support efficient and scalable frequent itemset mining....[
Read more ]
With the large amount of data collected in various applications, data mining has become an essential tool for data analysis and decision support. Frequent itemset mining is an important problem in the data mining area with a wide range of applications. In this dissertation, we investigate several techniques to support efficient and scalable frequent itemset mining.
We first identify the key factors of a frequent itemset mining algorithm, and propose an algorithm AFOPT for efficient mining of frequent itemsets. The AFOPT algorithm adopts the pattern growth approach. It uses a new struc-ture, called Ascending Frequency Ordered Prefix-Trie (or AFOPT for short), and two other structures to store conditional databases. An adaptive strategy is em-ployed to choose proper structures according to the density of the conditional databases, which makes the AFOPT algorithm perform well on both sparse and dense databases.
To deal with very large databases with millions of transactions, we propose a Search Space based Partitioning approach SSP for scalable out-of-core mining. The novel idea of SSP is to partition the database based on the search space. Dif-ferent partitions share data but not frequent itemsets. As such, frequent itemsets can be mined from each partition without the need to scan the whole database to verify their support. Furthermore, techniques are developed so that the avail-able memory can be utilized during the mining process to minimize disk I/O. Our experiment results show that the proposed algorithms achieve significant performance improvement over existing out-of-core mining algorithms.
Many decision support systems need to support online interactive frequent itemset mining, which is very challenging because frequent itemset mining is a computation intensive repetitive process. We propose a compact disk-based data structure---CFP-tree (Condensed Frequent Pattern tree) for storing precomputed frequent itemsets to support online mining requests. Efficient algorithms for retrieving frequent itemsets and association rules from a CFP-tree, as well as efficient algorithms to construct and update a CFP-tree, are developed. One obstacle to online association rule mining is the undesirably large result size. We introduce the concept of frequent itemset class and the class association rule to reduce the result size without information loss. The benefits of organizing itemsets into classes are two-fold. On the one hand, the output size is reduced and it is easier for users to browse the results. On the other hand, the computation cost is saved because the rule generation cost of the itemsets in the same class is shared. Our performance study demonstrates that with CFP-tree and the concept of frequent itemset class, frequent itemset and association rule mining requests can be responded to promptly.
Post a Comment