THESIS
2009
i, x, 64 p. : ill. ; 30 cm
Abstract
Duplicate entities are quite common on the Web, where structured XML data are increasingly common. Duplicate detection, which is considered an important data cleaning task, consists of detecting different presentations of the same real world object. Detecting and resolving duplicate entities will certainly be of benefit to Web users. Thus, to improve the web data quality, algorithms for detecting duplicates are required....[
Read more ]
Duplicate entities are quite common on the Web, where structured XML data are increasingly common. Duplicate detection, which is considered an important data cleaning task, consists of detecting different presentations of the same real world object. Detecting and resolving duplicate entities will certainly be of benefit to Web users. Thus, to improve the web data quality, algorithms for detecting duplicates are required.
In this thesis, we present a feature-dependent algorithm, which efficiently identifies duplicates in XML Web data. First, we generate features which are related to the targeted duplicates. Then, we create a function which is used for the similarity measurements, based on the generated features. A threshold is used to help identify whether the identified duplicates are real duplicates. We also introduce another step, similarity function learning, to improve the duplicate detection results.
To prove that the above methodology can be broadly applied, we apply the algorithm on different kinds of XML Web data, which can be easily found on websites. We also use various entities as the duplicates in the experiments, such as CD name entities and author entities. Moreover, we generate some dirty data manually to show that our algorithm can work well even when there are some errors or missing information in the datasets.
Post a Comment