Multi-head attention with diversity for learning grounded multilingual multimodal representations

Publisher:
ASSOC COMPUTATIONAL LINGUISTICS-ACL
Publication Type:
Conference Proceeding
Citation:
EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 2019, pp. 1461-1467
Issue Date:
2019-01-01
Filename Description Size
1910.00058.pdfPublished version4.24 MB
Full metadata record
With the aim of promoting and understanding the multilingual version of image search, we leverage visual object detection and propose a model with diverse multi-head attention to learn grounded multilingual multimodal representations. Specifically, our model attends to different types of textual semantics in two languages and visual objects for fine-grained alignments between sentences and images. We introduce a new objective function which explicitly encourages attention diversity to learn an improved visual-semantic embedding space. We evaluate our model in the German-Image and English-Image matching tasks on the Multi30K dataset, and in the Semantic Textual Similarity task with the English descriptions of visual content. Results show that our model yields a significant performance gain over other methods in all of the three tasks.
Please use this identifier to cite or link to this item: