Modality-invariant image-text embedding for image-sentence matching

Liu, R; Zhao, Y; Wei, S; Zheng, L; Yang, Y

Modality-invariant image-text embedding for image-sentence matching

Liu, R Zhao, Y Wei, S Zheng, L Yang, Y

Permalink

Publication Type:: Journal Article
Citation:: ACM Transactions on Multimedia Computing, Communications and Applications, 2019, 15 (1)
Issue Date:: 2019-02-01

Closed Access

	Filename	Description	Size
	3300939.pdf	Published Version	2.19 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Liu, R	en_US
dc.contributor.author	Zhao, Y	en_US
dc.contributor.author	Wei, S	en_US
dc.contributor.author	Zheng, L	en_US
dc.contributor.author	Yang, Y https://orcid.org/0000-0001-5528-0546	en_US
dc.date.accessioned	2020-04-22T06:47:19Z
dc.date.available	2020-04-22T06:47:19Z
dc.date.issued	2019-02-01	en_US
dc.identifier.citation	ACM Transactions on Multimedia Computing, Communications and Applications, 2019, 15 (1)	en_US
dc.identifier.issn	1551-6857	en_US
dc.identifier.uri	http://hdl.handle.net/10453/140194
dc.description.abstract	© 2019 Association for Computing Machinery. Performing direct matching among different modalities (like image and text) can benefit many tasks in computer vision, multimedia, information retrieval, and information fusion. Most of existing works focus on class-level image-text matching, called cross-modal retrieval, which attempts to propose a uniform model for matching images with all types of texts, for example, tags, sentences, and articles (long texts). Although cross-model retrieval alleviates the heterogeneous gap among visual and textual information, it can provide only a rough correspondence between two modalities. In this article, we propose a more precise image-text embedding method, image-sentence matching, which can provide heterogeneous matching in the instance level. The key issue for image-text embedding is how to make the distributions of the two modalities consistent in the embedding space. To address this problem, some previous works on the cross-model retrieval task have attempted to pull close their distributions by employing adversarial learning. However, the effectiveness of adversarial learning on image-sentence matching has not been proved and there is still not an effective method. Inspired by previous works, we propose to learn a modality-invariant image-text embedding for image-sentence matching by involving adversarial learning. On top of the triplet loss-based baseline, we design a modality classification network with an adversarial loss, which classifies an embedding into either the image or text modality. In addition, the multi-stage training procedure is carefully designed so that the proposed network not only imposes the image-text similarity constraints by ground-truth labels, but also enforces the image and text embedding distributions to be similar by adversarial learning. Experiments on two public datasets (Flickr30k and MSCOCO) demonstrate that our method yields stable accuracy improvement over the baseline model and that our results compare favorably to the state-of-the-art methods.	en_US
dc.relation.ispartof	ACM Transactions on Multimedia Computing, Communications and Applications	en_US
dc.relation.isbasedon	10.1145/3300939	en_US
dc.rights	info:eu-repo/semantics/restrictedAccess
dc.subject.classification	Artificial Intelligence & Image Processing	en_US
dc.title	Modality-invariant image-text embedding for image-sentence matching	en_US
dc.type	Journal Article
utslib.citation.volume	1	en_US
utslib.citation.volume	15	en_US
utslib.for	0803 Computer Software	en_US
utslib.for	0805 Distributed Computing	en_US
utslib.for	0806 Information Systems	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
utslib.copyright.status	closed_access	*
pubs.issue	1	en_US
pubs.publication-status	Published	en_US
pubs.volume	15	en_US

Abstract:

© 2019 Association for Computing Machinery. Performing direct matching among different modalities (like image and text) can benefit many tasks in computer vision, multimedia, information retrieval, and information fusion. Most of existing works focus on class-level image-text matching, called cross-modal retrieval, which attempts to propose a uniform model for matching images with all types of texts, for example, tags, sentences, and articles (long texts). Although cross-model retrieval alleviates the heterogeneous gap among visual and textual information, it can provide only a rough correspondence between two modalities. In this article, we propose a more precise image-text embedding method, image-sentence matching, which can provide heterogeneous matching in the instance level. The key issue for image-text embedding is how to make the distributions of the two modalities consistent in the embedding space. To address this problem, some previous works on the cross-model retrieval task have attempted to pull close their distributions by employing adversarial learning. However, the effectiveness of adversarial learning on image-sentence matching has not been proved and there is still not an effective method. Inspired by previous works, we propose to learn a modality-invariant image-text embedding for image-sentence matching by involving adversarial learning. On top of the triplet loss-based baseline, we design a modality classification network with an adversarial loss, which classifies an embedding into either the image or text modality. In addition, the multi-stage training procedure is carefully designed so that the proposed network not only imposes the image-text similarity constraints by ground-truth labels, but also enforces the image and text embedding distributions to be similar by adversarial learning. Experiments on two public datasets (Flickr30k and MSCOCO) demonstrate that our method yields stable accuracy improvement over the baseline model and that our results compare favorably to the state-of-the-art methods.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/140194