THESIS
2017
Abstract
Data fusion has played an important role in data mining because high quality data
is required in a lot of applications. As on-line data may be out-of-date and errors in the
data may propagate with copying and referring between sources, it is hard to achieve
satisfying results with merely applying existing data fusion methods to fuse Web data.
To best understand the current studies, we first present an extensive survey about the
Data Fusion field. Since we use crowdsourcing as a tool to solve the problem, we then
survey the crowdsourcing researches as well.
In this paper, we make use of the crowd to achieve high quality data fusion. We design
a framework selecting a set of tasks to ask crowds in order to improve the confidence of
data. Since data are correlated and crowds may p...[
Read more ]
Data fusion has played an important role in data mining because high quality data
is required in a lot of applications. As on-line data may be out-of-date and errors in the
data may propagate with copying and referring between sources, it is hard to achieve
satisfying results with merely applying existing data fusion methods to fuse Web data.
To best understand the current studies, we first present an extensive survey about the
Data Fusion field. Since we use crowdsourcing as a tool to solve the problem, we then
survey the crowdsourcing researches as well.
In this paper, we make use of the crowd to achieve high quality data fusion. We design
a framework selecting a set of tasks to ask crowds in order to improve the confidence of
data. Since data are correlated and crowds may provide incorrect answers, how to select a
proper set of tasks to ask the crowd is a very challenging problem. In this paper, we design
an approximation solution to address these challenges since we prove that the problem
is at NP-hard. To further improve the efficiency, we design a pruning strategy and a
preprocessing method, which effectively improve the performance of the approximation
solution.
Furthermore, we find that under certain scenarios, we are not interested in all the facts,
but only a specific set of facts. Thus, for these specific scenarios, we also develop another
approximation solution which is much faster than the general approximation solution.
Then, we verify the solutions with extensive experiments on a real crowdsourcing
platform. We apply multiple existing machine-based data fusion methods and apply our
refinement method on those results to show our method is general enough with many
methods.
Post a Comment