A constrained clustering approach to duplicate detection among relational data

Chao, W; Jie, L; Guangquan, Z

A constrained clustering approach to duplicate detection among relational data

Chao, W Jie, L

Guangquan, Z

Permalink

Publication Type:: Conference Proceeding
Citation:: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2007, 4426 LNAI pp. 308 - 319
Issue Date:: 2007-12-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download full textAdobe PDF (152.96 kB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Chao, W	en_US
dc.contributor.author	Jie, L https://orcid.org/0000-0003-0690-4732	en_US
dc.contributor.author	Guangquan, Z	en_US
dc.date.issued	2007-12-01	en_US
dc.identifier.citation	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2007, 4426 LNAI pp. 308 - 319	en_US
dc.identifier.isbn	9783540717003	en_US
dc.identifier.issn	0302-9743	en_US
dc.identifier.uri	http://hdl.handle.net/10453/1890
dc.description.abstract	This paper proposes an approach to detect duplicates among relational data. Traditional methods for record linkage or duplicate detection work on a set of records which have no explicit relations with each other. These records can be formatted into a single database table for processing. However, there are situations that records from different sources can not be flattened into one table and records within one source have certain (semantic) relations between them. The duplicate detection issue of these relational data records/instances can be dealt with by formatting them into several tables and applying traditional methods to each table. However, as the relations among the original data records are ignored, this approach generates poor or inconsistent results. This paper analyzes the characteristics of relational data and proposes a particular clustering approach to perform duplicate detection. This approach incorporates constraint rules derived from the characteristics of relational data and therefore yields better and more consistent results, which are revealed by our experiments. © Springer-Verlag Berlin Heidelberg 2007.	en_US
dc.relation.ispartof	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)	en_US
dc.subject.classification	Artificial Intelligence & Image Processing	en_US
dc.title	A constrained clustering approach to duplicate detection among relational data	en_US
dc.type	Conference Proceeding
utslib.citation.volume	4426 LNAI	en_US
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
dc.location.activity	Nanjing, China	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
utslib.copyright.status	open_access
pubs.publication-status	Published	en_US
pubs.volume	4426 LNAI	en_US

Abstract:

This paper proposes an approach to detect duplicates among relational data. Traditional methods for record linkage or duplicate detection work on a set of records which have no explicit relations with each other. These records can be formatted into a single database table for processing. However, there are situations that records from different sources can not be flattened into one table and records within one source have certain (semantic) relations between them. The duplicate detection issue of these relational data records/instances can be dealt with by formatting them into several tables and applying traditional methods to each table. However, as the relations among the original data records are ignored, this approach generates poor or inconsistent results. This paper analyzes the characteristics of relational data and proposes a particular clustering approach to perform duplicate detection. This approach incorporates constraint rules derived from the characteristics of relational data and therefore yields better and more consistent results, which are revealed by our experiments. © Springer-Verlag Berlin Heidelberg 2007.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/1890