THESIS
2008
ix, 56 leaves : ill. ; 30 cm
Abstract
Entity resolution (ER) is the problem of identifying and merging the records judged to represent the same real-world entity. Most previous ER approaches assumed a unified schema (or a global schema) under which all records are compared and merged in a field-by-field basis. We consider the multi-schema ER problem in which records come from multiple sources that are of different schemas. A prime example of multi-schema ER is Information Integration over the deep web, where the goal is to integrate data from heterogeneous sources....[
Read more ]
Entity resolution (ER) is the problem of identifying and merging the records judged to represent the same real-world entity. Most previous ER approaches assumed a unified schema (or a global schema) under which all records are compared and merged in a field-by-field basis. We consider the multi-schema ER problem in which records come from multiple sources that are of different schemas. A prime example of multi-schema ER is Information Integration over the deep web, where the goal is to integrate data from heterogeneous sources.
In this thesis, we formalize the multi-schema ER problem, investigate some properties that are satisfied in a unified-schema setting, but not in a multi-schema setting, and identify the possible resolution conflicts that might occur in a multi-schema setting using the previous ER approaches. We then propose the validity-ensured and order-sensitive (VEOS) algorithm that is free from such conflicts and, at the same time, can take advantage of order scheduling to improve accuracy.
We identify schema-level and data-level criteria to distinguish the more reliable comparisons so that by comparing them first a more accurate result is obtained. To leverage such information, we propose to construct a confidence graph upon which our scheduling algorithm is developed. Our experiments, using real online shopping data, show that: (1) our scheduling algorithm is very effective in improving accuracy, and (2) VEOS with scheduling outperforms other methods in both accuracy and efficiency.
Post a Comment