Unsupervised Part Discovery via Dual Representation Alignment.

Xia, J; Huang, W; Xu, M; Zhang, J; Zhang, H; Sheng, Z; Xu, D

Unsupervised Part Discovery via Dual Representation Alignment.

Xia, J

Huang, W Xu, M Zhang, J Zhang, H

Sheng, Z Xu, D

Permalink

Publisher:: Institute of Electrical and Electronics Engineers (IEEE)
Publication Type:: Journal Article
Citation:: IEEE Trans Pattern Anal Mach Intell, 2024, PP, pp. 1-18
Issue Date:: 2024-08-19

Closed Access

	Filename	Description	Size
	1746491.pdf	Published version	9.03 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Xia, J https://orcid.org/0000-0001-9628-9563
dc.contributor.author	Huang, W
dc.contributor.author	Xu, M
dc.contributor.author	Zhang, J
dc.contributor.author	Zhang, H https://orcid.org/0000-0002-0021-3634
dc.contributor.author	Sheng, Z
dc.contributor.author	Xu, D
dc.date.accessioned	2024-09-05T11:07:20Z
dc.date.available	2024-09-05T11:07:20Z
dc.date.issued	2024-08-19
dc.identifier.citation	IEEE Trans Pattern Anal Mach Intell, 2024, PP, pp. 1-18
dc.identifier.issn	0162-8828
dc.identifier.issn	1939-3539
dc.identifier.uri	http://hdl.handle.net/10453/180718
dc.description.abstract	Object parts serve as crucial intermediate representations in various downstream tasks, but part-level representation learning still has not received as much attention as other vision tasks. Previous research has established that Vision Transformer can learn instance-level attention without labels, extracting high-quality instance-level representations for boosting downstream tasks. In this paper, we achieve unsupervised part-specific attention learning using a novel paradigm and further employ the part representations to improve part discovery performance. Specifically, paired images are generated from the same image with different geometric transformations, and multiple part representations are extracted from these paired images using a novel module, named PartFormer. These part representations from the paired images are then exchanged to improve geometric transformation invariance. Subsequently, the part representations are aligned with the feature map extracted by a feature map encoder, achieving high similarity with the pixel representations of the corresponding part regions and low similarity in irrelevant regions. Finally, the geometric and semantic constraints are applied to the part representations through the intermediate results in alignment for part-specific attention learning, encouraging the PartFormer to focus locally and the part representations to explicitly include the information of the corresponding parts. Moreover, the aligned part representations can further serve as a series of reliable detectors in the testing phase, predicting pixel masks for part discovery. Extensive experiments are carried out on four widely used datasets, and our results demonstrate that the proposed method achieves competitive performance and robustness due to its part-specific attention. The code will be released upon paper acceptance.
dc.language	eng
dc.publisher	Institute of Electrical and Electronics Engineers (IEEE)
dc.relation.ispartof	IEEE Trans Pattern Anal Mach Intell
dc.relation.isbasedon	10.1109/TPAMI.2024.3445582
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0801 Artificial Intelligence and Image Processing, 0806 Information Systems, 0906 Electrical and Electronic Engineering
dc.subject.classification	Artificial Intelligence & Image Processing
dc.subject.classification	4603 Computer vision and multimedia computation
dc.subject.classification	4611 Machine learning
dc.title	Unsupervised Part Discovery via Dual Representation Alignment.
dc.type	Journal Article
utslib.citation.volume	PP
utslib.location.activity	United States
utslib.for	0801 Artificial Intelligence and Image Processing
utslib.for	0806 Information Systems
utslib.for	0906 Electrical and Electronic Engineering
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology/School of Electrical and Data Engineering
utslib.copyright.status	closed_access	*
dc.date.updated	2024-09-05T11:07:07Z
pubs.publication-status	Published online
pubs.volume	PP

Abstract:

Object parts serve as crucial intermediate representations in various downstream tasks, but part-level representation learning still has not received as much attention as other vision tasks. Previous research has established that Vision Transformer can learn instance-level attention without labels, extracting high-quality instance-level representations for boosting downstream tasks. In this paper, we achieve unsupervised part-specific attention learning using a novel paradigm and further employ the part representations to improve part discovery performance. Specifically, paired images are generated from the same image with different geometric transformations, and multiple part representations are extracted from these paired images using a novel module, named PartFormer. These part representations from the paired images are then exchanged to improve geometric transformation invariance. Subsequently, the part representations are aligned with the feature map extracted by a feature map encoder, achieving high similarity with the pixel representations of the corresponding part regions and low similarity in irrelevant regions. Finally, the geometric and semantic constraints are applied to the part representations through the intermediate results in alignment for part-specific attention learning, encouraging the PartFormer to focus locally and the part representations to explicitly include the information of the corresponding parts. Moreover, the aligned part representations can further serve as a series of reliable detectors in the testing phase, predicting pixel masks for part discovery. Extensive experiments are carried out on four widely used datasets, and our results demonstrate that the proposed method achieves competitive performance and robustness due to its part-specific attention. The code will be released upon paper acceptance.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/180718