Local deep descriptors in bag-of-words for image retrieval

Publication Type:
Conference Proceeding
Thematic Workshops 2017 - Proceedings of the Thematic Workshops of ACM Multimedia 2017, co-located with MM 2017, 2017, pp. 52 - 58
Issue Date:
Filename Description Size
cnn_bow_cam.pdfPublished version3.06 MB
Adobe PDF
Full metadata record
© 2017 Association for Computing Machinery. The Bag-of-Words (BoW) models using the SIFT descriptors have achieved great success in content-based image retrieval over the past decade. Recent studies show that the neuron activations of the convolutional neural networks (CNN) can be viewed as local descriptors, which can be aggregated into effective global descriptors for image retrieval. However, little work has been done on using these local deep descriptors in BoW models, especially in the case of large visual vocabularies. In this paper, we provide the key ingredients to build an effective BoW model using deep descriptors. Specifically, we show how to use the CNN as a combination of local feature detector and extractor, without the need of feeding multiple image patches to the network. Moreover, we revisit the classic issues of BoW-including the burstiness and quantization error - in our scenario and improve the retrieval accuracy by addressing these problems. Lastly, we demonstrate that our model can scale up to large visual vocabularies, enjoying the advantages of both the sparseness of visual word histogram and the discriminative power of deep descriptor. Experiments show that our model achieves state-of-the-art performance on different datasets without re-ranking.
Please use this identifier to cite or link to this item: