Coupled clustering ensemble by exploring data interdependence

Wang, C; Chi, CH; She, Z; Cao, L; Stantic, B

Coupled clustering ensemble by exploring data interdependence

Wang, C

Chi, CH She, Z Cao, L

Stantic, B

Permalink

Publication Type:: Journal Article
Citation:: ACM Transactions on Knowledge Discovery from Data, 2018, 12 (6)
Issue Date:: 2018-08-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Accepted Manuscript VersionAdobe PDF (1.03 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Wang, C https://orcid.org/0000-0002-2890-0057	en_US
dc.contributor.author	Chi, CH	en_US
dc.contributor.author	She, Z	en_US
dc.contributor.author	Cao, L https://orcid.org/0000-0003-1562-9429	en_US
dc.contributor.author	Stantic, B	en_US
dc.date.issued	2018-08-01	en_US
dc.identifier.citation	ACM Transactions on Knowledge Discovery from Data, 2018, 12 (6)	en_US
dc.identifier.issn	1556-4681	en_US
dc.identifier.uri	http://hdl.handle.net/10453/131349
dc.description.abstract	© 2018 ACM. Clustering ensembles combine multiple partitions of data into a single clustering solution. It is an effective technique for improving the quality of clustering results. Current clustering ensemble algorithms are usually built on the pairwise agreements between clusterings that focus on the similarity via consensus functions, between data objects that induce similarity measures from partitions and re-cluster objects, and between clusters that collapse groups of clusters into meta-clusters. In most of those models, there is a strong assumption on IIDness (i.e., independent and identical distribution), which states that base clusterings perform independently of one another and all objects are also independent. In the real world, however, objects are generally likely related to each other through features that are either explicit or even implicit. There is also latent but definite relationship among intermediate base clusterings because they are derived from the same set of data. All these demand a further investigation of clustering ensembles that explores the interdependence characteristics of data. To solve this problem, a new coupled clustering ensemble (CCE) framework that works on the interdependence nature of objects and intermediate base clusterings is proposed in this article. The main idea is to model the coupling relationship between objects by aggregating the similarity of base clusterings, and the interactive relationship among objects by addressing their neighborhood domains. Once these interdependence relationships are discovered, they will act as critical supplements to clustering ensembles. We verified our proposed framework by using three types of consensus function: clustering-based, object-based, and cluster-based. Substantial experiments on multiple synthetic and real-life benchmark datasets indicate that CCE can effectively capture the implicit interdependence relationships among base clusterings and among objects with higher clustering accuracy, stability, and robustness compared to 14 state-of-the-art techniques, supported by statistical analysis. In addition, we show that the final clustering quality is dependent on the data characteristics (e.g., quality and consistency) of base clusterings in terms of sensitivity analysis. Finally, the applications in document clustering, as well as on the datasets with much larger size and dimensionality, further demonstrate the effectiveness, efficiency, and scalability of our proposed models.	en_US
dc.relation.ispartof	ACM Transactions on Knowledge Discovery from Data	en_US
dc.relation.isbasedon	10.1145/3230967	en_US
dc.subject.classification	Artificial Intelligence & Image Processing	en_US
dc.title	Coupled clustering ensemble by exploring data interdependence	en_US
dc.type	Journal Article
utslib.citation.volume	6	en_US
utslib.citation.volume	12	en_US
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
utslib.for	0806 Information Systems	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre
utslib.copyright.status	open_access
pubs.issue	6	en_US
pubs.publication-status	Published	en_US
pubs.volume	12	en_US

Abstract:

© 2018 ACM. Clustering ensembles combine multiple partitions of data into a single clustering solution. It is an effective technique for improving the quality of clustering results. Current clustering ensemble algorithms are usually built on the pairwise agreements between clusterings that focus on the similarity via consensus functions, between data objects that induce similarity measures from partitions and re-cluster objects, and between clusters that collapse groups of clusters into meta-clusters. In most of those models, there is a strong assumption on IIDness (i.e., independent and identical distribution), which states that base clusterings perform independently of one another and all objects are also independent. In the real world, however, objects are generally likely related to each other through features that are either explicit or even implicit. There is also latent but definite relationship among intermediate base clusterings because they are derived from the same set of data. All these demand a further investigation of clustering ensembles that explores the interdependence characteristics of data. To solve this problem, a new coupled clustering ensemble (CCE) framework that works on the interdependence nature of objects and intermediate base clusterings is proposed in this article. The main idea is to model the coupling relationship between objects by aggregating the similarity of base clusterings, and the interactive relationship among objects by addressing their neighborhood domains. Once these interdependence relationships are discovered, they will act as critical supplements to clustering ensembles. We verified our proposed framework by using three types of consensus function: clustering-based, object-based, and cluster-based. Substantial experiments on multiple synthetic and real-life benchmark datasets indicate that CCE can effectively capture the implicit interdependence relationships among base clusterings and among objects with higher clustering accuracy, stability, and robustness compared to 14 state-of-the-art techniques, supported by statistical analysis. In addition, we show that the final clustering quality is dependent on the data characteristics (e.g., quality and consistency) of base clusterings in terms of sensitivity analysis. Finally, the applications in document clustering, as well as on the datasets with much larger size and dimensionality, further demonstrate the effectiveness, efficiency, and scalability of our proposed models.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/131349