THESIS
2012
xiii, 123 p. : ill. ; 30 cm
Abstract
Entity resolution (ER) identifies and merges records judged to represent the same real-world
entity. With the development of the Internet, ER for hidden Web data has become
increasingly important in many real-world applications such as online search engines, web
data integration and so on. Hidden Web data often originates from different data sources that
usually have different schemas. As a consequence, there is no one most efficient way to
compare and merge records from different schemas. Moreover, the existing proposed
techniques that put all records together under a unified schema are often not suitable.
In this thesis, we investigate ER methods for hidden Web data using a multi-schema
approach. That is, we keep the data under the original schemas instead of placing them un...[
Read more ]
Entity resolution (ER) identifies and merges records judged to represent the same real-world
entity. With the development of the Internet, ER for hidden Web data has become
increasingly important in many real-world applications such as online search engines, web
data integration and so on. Hidden Web data often originates from different data sources that
usually have different schemas. As a consequence, there is no one most efficient way to
compare and merge records from different schemas. Moreover, the existing proposed
techniques that put all records together under a unified schema are often not suitable.
In this thesis, we investigate ER methods for hidden Web data using a multi-schema
approach. That is, we keep the data under the original schemas instead of placing them under
a unified schema. Based on the multi-schema structure, a pair-wise ER method
Validity-Ensured and Order-Sensitive (VEOS) has been previously proposed. In this thesis,
we first propose two techniques for improving the performance of the VEOS method. Since
duplicates that exist in the same data source may adversely affect recall performance, the first
technique applies an expanding window to VEOS to enhance the recall performance. To
reduce the number of record pair comparisons, our second technique separates the records in
large data sources into several blocks, so that only records in the blocks with the same key
values need to be compared. Then, we propose an efficient ER method for on-line query data
integration, which self-trains the schema fields (attributes) so as to set appropriate weights,
such that more representative attributes will be used for the ER process.
We demonstrate through extensive experiments using real online data sets from different
domains and some reasonable synthetic data sets, the scalability of the ER algorithms, the
efficiency of the advanced VEOS approaches and the effectiveness of our proposed ER
method for online querying.
Post a Comment