Local deep descriptors in bag-of-words for image retrieval

Cao, J; Huang, Z; Shen, HT

Local deep descriptors in bag-of-words for image retrieval

Cao, J Huang, Z Shen, HT

Permalink

Publication Type:: Conference Proceeding
Citation:: Thematic Workshops 2017 - Proceedings of the Thematic Workshops of ACM Multimedia 2017, co-located with MM 2017, 2017, pp. 52 - 58
Issue Date:: 2017-10-23

Closed Access

	Filename	Description	Size
	cnn_bow_cam.pdf	Published version	3.06 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Cao, J	en_US
dc.contributor.author	Huang, Z	en_US
dc.contributor.author	Shen, HT	en_US
dc.date.issued	2017-10-23	en_US
dc.identifier.citation	Thematic Workshops 2017 - Proceedings of the Thematic Workshops of ACM Multimedia 2017, co-located with MM 2017, 2017, pp. 52 - 58	en_US
dc.identifier.isbn	9781450354165	en_US
dc.identifier.uri	http://hdl.handle.net/10453/126315
dc.description.abstract	© 2017 Association for Computing Machinery. The Bag-of-Words (BoW) models using the SIFT descriptors have achieved great success in content-based image retrieval over the past decade. Recent studies show that the neuron activations of the convolutional neural networks (CNN) can be viewed as local descriptors, which can be aggregated into effective global descriptors for image retrieval. However, little work has been done on using these local deep descriptors in BoW models, especially in the case of large visual vocabularies. In this paper, we provide the key ingredients to build an effective BoW model using deep descriptors. Specifically, we show how to use the CNN as a combination of local feature detector and extractor, without the need of feeding multiple image patches to the network. Moreover, we revisit the classic issues of BoW-including the burstiness and quantization error - in our scenario and improve the retrieval accuracy by addressing these problems. Lastly, we demonstrate that our model can scale up to large visual vocabularies, enjoying the advantages of both the sparseness of visual word histogram and the discriminative power of deep descriptor. Experiments show that our model achieves state-of-the-art performance on different datasets without re-ranking.	en_US
dc.relation.ispartof	Thematic Workshops 2017 - Proceedings of the Thematic Workshops of ACM Multimedia 2017, co-located with MM 2017	en_US
dc.relation.isbasedon	10.1145/3126686.3127018	en_US
dc.title	Local deep descriptors in bag-of-words for image retrieval	en_US
dc.type	Conference Proceeding
utslib.for	0802 Computation Theory and Mathematics	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Software
utslib.copyright.status	closed_access
pubs.publication-status	Published	en_US

Abstract:

© 2017 Association for Computing Machinery. The Bag-of-Words (BoW) models using the SIFT descriptors have achieved great success in content-based image retrieval over the past decade. Recent studies show that the neuron activations of the convolutional neural networks (CNN) can be viewed as local descriptors, which can be aggregated into effective global descriptors for image retrieval. However, little work has been done on using these local deep descriptors in BoW models, especially in the case of large visual vocabularies. In this paper, we provide the key ingredients to build an effective BoW model using deep descriptors. Specifically, we show how to use the CNN as a combination of local feature detector and extractor, without the need of feeding multiple image patches to the network. Moreover, we revisit the classic issues of BoW-including the burstiness and quantization error - in our scenario and improve the retrieval accuracy by addressing these problems. Lastly, we demonstrate that our model can scale up to large visual vocabularies, enjoying the advantages of both the sparseness of visual word histogram and the discriminative power of deep descriptor. Experiments show that our model achieves state-of-the-art performance on different datasets without re-ranking.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/126315