Forward and backward multimodal nmt for improved monolingual and multilingual cross-modal retrieval

Publisher:
Association for Computing Machinery (ACM)
Publication Type:
Conference Proceeding
Citation:
ICMR 2020 - Proceedings of the 2020 International Conference on Multimedia Retrieval, 2020, pp. 53-62
Issue Date:
2020-06-08
Filename Description Size
3372278.3390674.pdfPublished version2.57 MB
Adobe PDF
Full metadata record
We explore methods to enrich the diversity of captions associated with pictures for learning improved visual-semantic embeddings (VSE) in cross-modal retrieval. In the spirit of "A picture is worth a thousand words", it would take dozens of sentences to parallel each picture's content adequately. But in fact, real-world multimodal datasets tend to provide only a few (typically, five) descriptions per image. For cross-modal retrieval, the resulting lack of diversity and coverage prevents systems from capturing the fine-grained inter-modal dependencies and intra-modal diversities in the shared VSE space. Using the fact that the encoder-decoder architectures in neural machine translation (NMT) have the capacity to enrich both monolingual and multilingual textual diversity, we propose a novel framework leveraging multimodal neural machine translation (MMT) to perform forward and backward translations based on salient visual objects to generate additional text-image pairs which enables training improved monolingual cross-modal retrieval (English-Image) and multilingual cross-modal retrieval (English-Image and German-Image) models. Experimental results show that the proposed framework can substantially and consistently improve the performance of state-of-the-art models on multiple datasets. The results also suggest that the models with multilingual VSE outperform the models with monolingual VSE.
Please use this identifier to cite or link to this item: