A constrained clustering approach to duplicate detection among relational data

Publisher:
Springer Berlin / Heidelberg
Publication Type:
Conference Proceeding
Citation:
Advances in Knowledge Discovery and Data Mining (Lecture Notes in computer Science (4426)), 2007, pp. 308 - 319
Issue Date:
2007-01
Full metadata record
Files in This Item:
Filename Description Size
Thumbnail2006010584.pdf152.96 kB
Adobe PDF
This paper proposes an approach to detect duplicates among relational data. Traditional methods for record linkage or duplicate detection work on a set of records which have no explicit relations with each other. These records can be formatted into a single database table for processing. However, there are situations that records from different sources can not be flattened into one table and records within one source have certain (semantic) relations between them. The duplicate detection issue of these relational data records/instances can be dealt with by formatting them into several tables and applying traditional methods to each table. However, as the relations among the original data records are ignored, this approach generates poor or inconsistent results. This paper analyzes the characteristics of relational data and proposes a particular clustering approach to perform duplicate detection. This approach incorporates constraint rules derived from the characteristics of relational data and therefore yields better and more consistent results, which are revealed by our experiments.
Please use this identifier to cite or link to this item: