Vision transformers are active learners for image copy detection

Tan, Z; Wang, W; Shan, C

Vision transformers are active learners for image copy detection

Tan, Z Wang, W

Shan, C

Permalink

Publisher:: ELSEVIER
Publication Type:: Journal Article
Citation:: Neurocomputing, 2024, 587
Issue Date:: 2024-06-28

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Published versionAdobe PDF (3.9 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Tan, Z
dc.contributor.author	Wang, W https://orcid.org/0000-0001-8727-1572
dc.contributor.author	Shan, C
dc.date.accessioned	2024-10-21T07:20:35Z
dc.date.available	2024-10-21T07:20:35Z
dc.date.issued	2024-06-28
dc.identifier.citation	Neurocomputing, 2024, 587
dc.identifier.issn	0925-2312
dc.identifier.issn	1872-8286
dc.identifier.uri	http://hdl.handle.net/10453/181504
dc.description.abstract	Image Copy Detection (ICD) is developed to identify and track duplicated or manipulated images. The majority of existing methods rely on Convolutional Neural Networks (CNNs) and are trained using unsupervised learning techniques, which leads to subpar performance. We discover that by carefully designing the training process, Vision Transformer (ViT) backbones yield superior results. Specifically, directly training a ViT for ICD often leads to overfitting on the training images, which in turn results in poor generalization to unseen (test) images. Consequently, we initially train a CNN (such as ResNet-50), and during the ViT training, the distances between the features of CNN and ViT are regularized. We also incorporate an active learning method to further enhance performance. Notably, due to the visual discrepancy between auto-generated transformations and those used in the query set, we incorporate a small number (approximately 0.5% of unlabeled training images) of manually produced and labeled positive pairs. Training models on these pairs results in a significant performance boost though with little cost. Experimental findings demonstrate the effectiveness of our approach, and our method achieves state-of-the-art performance. Our code is available at: https://github.com/WangWenhao0716/ViT4ICD.
dc.language	English
dc.publisher	ELSEVIER
dc.relation.ispartof	Neurocomputing
dc.relation.isbasedon	10.1016/j.neucom.2024.127687
dc.rights	info:eu-repo/semantics/openAccess
dc.subject	08 Information and Computing Sciences, 09 Engineering, 17 Psychology and Cognitive Sciences
dc.subject.classification	Artificial Intelligence & Image Processing
dc.subject.classification	40 Engineering
dc.subject.classification	46 Information and computing sciences
dc.subject.classification	52 Psychology
dc.title	Vision transformers are active learners for image copy detection
dc.type	Journal Article
utslib.citation.volume	587
utslib.for	08 Information and Computing Sciences
utslib.for	09 Engineering
utslib.for	17 Psychology and Cognitive Sciences
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	open_access	*
dc.rights.license	This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/
dc.date.updated	2024-10-21T07:20:33Z
pubs.publication-status	Published
pubs.volume	587

Abstract:

Image Copy Detection (ICD) is developed to identify and track duplicated or manipulated images. The majority of existing methods rely on Convolutional Neural Networks (CNNs) and are trained using unsupervised learning techniques, which leads to subpar performance. We discover that by carefully designing the training process, Vision Transformer (ViT) backbones yield superior results. Specifically, directly training a ViT for ICD often leads to overfitting on the training images, which in turn results in poor generalization to unseen (test) images. Consequently, we initially train a CNN (such as ResNet-50), and during the ViT training, the distances between the features of CNN and ViT are regularized. We also incorporate an active learning method to further enhance performance. Notably, due to the visual discrepancy between auto-generated transformations and those used in the query set, we incorporate a small number (approximately 0.5% of unlabeled training images) of manually produced and labeled positive pairs. Training models on these pairs results in a significant performance boost though with little cost. Experimental findings demonstrate the effectiveness of our approach, and our method achieves state-of-the-art performance. Our code is available at: https://github.com/WangWenhao0716/ViT4ICD.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/181504