Multi-head attention with diversity for learning grounded multilingual multimodal representations
- Publisher:
- ASSOC COMPUTATIONAL LINGUISTICS-ACL
- Publication Type:
- Conference Proceeding
- Citation:
- EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 2019, pp. 1461-1467
- Issue Date:
- 2019-01-01
Closed Access
Filename | Description | Size | |||
---|---|---|---|---|---|
1910.00058.pdf | Published version | 4.24 MB |
Copyright Clearance Process
- Recently Added
- In Progress
- Closed Access
This item is closed access and not available.
With the aim of promoting and understanding the multilingual version of image search, we leverage visual object detection and propose a model with diverse multi-head attention to learn grounded multilingual multimodal representations. Specifically, our model attends to different types of textual semantics in two languages and visual objects for fine-grained alignments between sentences and images. We introduce a new objective function which explicitly encourages attention diversity to learn an improved visual-semantic embedding space. We evaluate our model in the German-Image and English-Image matching tasks on the Multi30K dataset, and in the Semantic Textual Similarity task with the English descriptions of visual content. Results show that our model yields a significant performance gain over other methods in all of the three tasks.
Please use this identifier to cite or link to this item: