Forward and backward multimodal nmt for improved monolingual and multilingual cross-modal retrieval

Huang, PY; Chang, X; Hauptmann, A; Hovy, E

Forward and backward multimodal nmt for improved monolingual and multilingual cross-modal retrieval

Huang, PY Chang, X

Hauptmann, A Hovy, E

Permalink

Publisher:: Association for Computing Machinery (ACM)
Publication Type:: Conference Proceeding
Citation:: ICMR 2020 - Proceedings of the 2020 International Conference on Multimedia Retrieval, 2020, pp. 53-62
Issue Date:: 2020-06-08

Closed Access

	Filename	Description	Size
	3372278.3390674.pdf	Published version	2.57 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Huang, PY
dc.contributor.author	Chang, X https://orcid.org/0000-0002-7778-8807
dc.contributor.author	Hauptmann, A
dc.contributor.author	Hovy, E
dc.date.accessioned	2023-03-31T10:22:34Z
dc.date.available	2023-03-31T10:22:34Z
dc.date.issued	2020-06-08
dc.identifier.citation	ICMR 2020 - Proceedings of the 2020 International Conference on Multimedia Retrieval, 2020, pp. 53-62
dc.identifier.isbn	9781450370875
dc.identifier.uri	http://hdl.handle.net/10453/168983
dc.description.abstract	We explore methods to enrich the diversity of captions associated with pictures for learning improved visual-semantic embeddings (VSE) in cross-modal retrieval. In the spirit of "A picture is worth a thousand words", it would take dozens of sentences to parallel each picture's content adequately. But in fact, real-world multimodal datasets tend to provide only a few (typically, five) descriptions per image. For cross-modal retrieval, the resulting lack of diversity and coverage prevents systems from capturing the fine-grained inter-modal dependencies and intra-modal diversities in the shared VSE space. Using the fact that the encoder-decoder architectures in neural machine translation (NMT) have the capacity to enrich both monolingual and multilingual textual diversity, we propose a novel framework leveraging multimodal neural machine translation (MMT) to perform forward and backward translations based on salient visual objects to generate additional text-image pairs which enables training improved monolingual cross-modal retrieval (English-Image) and multilingual cross-modal retrieval (English-Image and German-Image) models. Experimental results show that the proposed framework can substantially and consistently improve the performance of state-of-the-art models on multiple datasets. The results also suggest that the models with multilingual VSE outperform the models with monolingual VSE.
dc.language	en
dc.publisher	Association for Computing Machinery (ACM)
dc.relation.ispartof	ICMR 2020 - Proceedings of the 2020 International Conference on Multimedia Retrieval
dc.relation.ispartof	Proceedings of the 2020 International Conference on Multimedia Retrieval
dc.relation.isbasedon	10.1145/3372278.3390674
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	Forward and backward multimodal nmt for improved monolingual and multilingual cross-modal retrieval
dc.type	Conference Proceeding
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	closed_access	*
dc.date.updated	2023-03-31T10:22:32Z
pubs.publication-status	Published

Abstract:

We explore methods to enrich the diversity of captions associated with pictures for learning improved visual-semantic embeddings (VSE) in cross-modal retrieval. In the spirit of "A picture is worth a thousand words", it would take dozens of sentences to parallel each picture's content adequately. But in fact, real-world multimodal datasets tend to provide only a few (typically, five) descriptions per image. For cross-modal retrieval, the resulting lack of diversity and coverage prevents systems from capturing the fine-grained inter-modal dependencies and intra-modal diversities in the shared VSE space. Using the fact that the encoder-decoder architectures in neural machine translation (NMT) have the capacity to enrich both monolingual and multilingual textual diversity, we propose a novel framework leveraging multimodal neural machine translation (MMT) to perform forward and backward translations based on salient visual objects to generate additional text-image pairs which enables training improved monolingual cross-modal retrieval (English-Image) and multilingual cross-modal retrieval (English-Image and German-Image) models. Experimental results show that the proposed framework can substantially and consistently improve the performance of state-of-the-art models on multiple datasets. The results also suggest that the models with multilingual VSE outperform the models with monolingual VSE.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/168983