Sanitized clustering against confounding bias

Yao, Y; Pan, Y; Li, J; Tsang, IW; Yao, X

Sanitized clustering against confounding bias

Yao, Y Pan, Y Li, J Tsang, IW Yao, X

Permalink

Publisher:: Springer Nature
Publication Type:: Journal Article
Citation:: Machine Learning, 2023, pp. 1-20
Issue Date:: 2023-01-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Published versionAdobe PDF (2.13 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Yao, Y
dc.contributor.author	Pan, Y
dc.contributor.author	Li, J
dc.contributor.author	Tsang, IW
dc.contributor.author	Yao, X
dc.date.accessioned	2024-02-26T06:09:38Z
dc.date.available	2024-02-26T06:09:38Z
dc.date.issued	2023-01-01
dc.identifier.citation	Machine Learning, 2023, pp. 1-20
dc.identifier.issn	0885-6125
dc.identifier.issn	1573-0565
dc.identifier.uri	http://hdl.handle.net/10453/175894
dc.description.abstract	Real-world datasets inevitably contain biases that arise from different sources or conditions during data collection. Consequently, such inconsistency itself acts as a confounding factor that disturbs the cluster analysis. Existing methods eliminate the biases by projecting data onto the orthogonal complement of the subspace expanded by the confounding factor before clustering. Therein, the interested clustering factor and the confounding factor are coarsely considered in the raw feature space, where the correlation between the data and the confounding factor is ideally assumed to be linear for convenient solutions. These approaches are thus limited in scope as the data in real applications is usually complex and non-linearly correlated with the confounding factor. This paper presents a new clustering framework named Sanitized Clustering Against confounding Bias, which removes the confounding factor in the semantic latent space of complex data through a non-linear dependence measure. To be specific, we eliminate the bias information in the latent space by minimizing the mutual information between the confounding factor and the latent representation delivered by variational auto-encoder. Meanwhile, a clustering module is introduced to cluster over the purified latent representations. Extensive experiments on complex datasets demonstrate that our SCAB achieves a significant gain in clustering performance by removing the confounding bias.
dc.language	en
dc.publisher	Springer Nature
dc.relation.ispartof	Machine Learning
dc.relation.isbasedon	10.1007/s10994-023-06451-5
dc.rights	info:eu-repo/semantics/openAccess
dc.subject	0801 Artificial Intelligence and Image Processing, 0806 Information Systems, 1702 Cognitive Sciences
dc.subject.classification	Artificial Intelligence & Image Processing
dc.subject.classification	4611 Machine learning
dc.title	Sanitized clustering against confounding bias
dc.type	Journal Article
utslib.for	0801 Artificial Intelligence and Image Processing
utslib.for	0806 Information Systems
utslib.for	1702 Cognitive Sciences
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	open_access	*
dc.date.updated	2024-02-26T06:09:37Z
pubs.publication-status	Published

Abstract:

Real-world datasets inevitably contain biases that arise from different sources or conditions during data collection. Consequently, such inconsistency itself acts as a confounding factor that disturbs the cluster analysis. Existing methods eliminate the biases by projecting data onto the orthogonal complement of the subspace expanded by the confounding factor before clustering. Therein, the interested clustering factor and the confounding factor are coarsely considered in the raw feature space, where the correlation between the data and the confounding factor is ideally assumed to be linear for convenient solutions. These approaches are thus limited in scope as the data in real applications is usually complex and non-linearly correlated with the confounding factor. This paper presents a new clustering framework named Sanitized Clustering Against confounding Bias, which removes the confounding factor in the semantic latent space of complex data through a non-linear dependence measure. To be specific, we eliminate the bias information in the latent space by minimizing the mutual information between the confounding factor and the latent representation delivered by variational auto-encoder. Meanwhile, a clustering module is introduced to cluster over the purified latent representations. Extensive experiments on complex datasets demonstrate that our SCAB achieves a significant gain in clustering performance by removing the confounding bias.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/175894