THESIS
2013
xi, 130 pages : illustrations ; 30 cm
Abstract
With the broad usage of Internet of Things (IoT), pervasive computing, and
crowdsourcing techniques in the modern society, a growing number of data monitoring and collection systems lead to the uncertainty in massive data. For example,
data integration of multiple data sources causes uncertainty. Thus, mining probabilistic data (or called uncertain data in this thesis) has become a hot topic
in recent years. In particular, mining frequent patterns is one of the most fundamental problems in traditional data mining researches, thus mining frequent
patterns over probabilistic databases has attracted much attention in the database
and data mining communities. In the scenario of probabilistic data, the support
of an itemset is a discrete random variable rather than a frequency of this...[
Read more ]
With the broad usage of Internet of Things (IoT), pervasive computing, and
crowdsourcing techniques in the modern society, a growing number of data monitoring and collection systems lead to the uncertainty in massive data. For example,
data integration of multiple data sources causes uncertainty. Thus, mining probabilistic data (or called uncertain data in this thesis) has become a hot topic
in recent years. In particular, mining frequent patterns is one of the most fundamental problems in traditional data mining researches, thus mining frequent
patterns over probabilistic databases has attracted much attention in the database
and data mining communities. In the scenario of probabilistic data, the support
of an itemset is a discrete random variable rather than a frequency of this itemset.
Hence, the frequent itemset under uncertain data has different definitions
due to variation of probabilistic semantics, which even generates inconsistent results
in current studies. However, the relationship between different definitions
and the inconsistent results has not yet been thoroughly identified and explored.
Furthermore, like its counterpart, mining frequent patterns in deterministic data,
mining frequent patterns over uncertain data cannot avoid an exponential number
of frequent patterns which cause the mining results less useful. In this thesis, we demonstrate how our solution can clarify the problems in existing studies and
address the challenge of an exponential number of mining results. Moreover, our
solution is also well applied to constructing effective indexes for query processing
over other types of uncertain data, i.e., querying uncertain graphs. In addition,
with the advent of big data era, we also propose the solution of finding frequent
patterns in distributed uncertain data. To summarize, our study covers the following
four aspects:
1) We conduct a comprehensive experimental study of existing representative frequent
itemset mining algorithms over probabilistic databases and clarify several
existing inconsistent conclusions;
2) We formalize a novel problem of mining probabilistic frequent closed itemsets
in an uncertain transaction database and design an efficient solution, which includes
a series of pruning techniques and an effective sampling algorithm.
3) We study the problem of efficient probabilistic supergraph containment query
and provide the efficient solution, which integrates probabilistic frequent pattern
mining technique for constructing the index.
4) We propose the problem of finding frequent items in distributed probabilistic
data and provide a local thresholds-based deterministic algorithm and a sketch-based
sampling method to reduce both communication and computation costs.
We validate our solutions through extensive experiments and discuss several
potential research directions of mining frequent pattern over probabilistic databases.
Post a Comment