Multi-head attention with diversity for learning grounded multilingual multimodal representations

Huang, PY; Chang, X; Hauptmann, A

Multi-head attention with diversity for learning grounded multilingual multimodal representations

Huang, PY Chang, X

Hauptmann, A

Permalink

Publisher:: ASSOC COMPUTATIONAL LINGUISTICS-ACL
Publication Type:: Conference Proceeding
Citation:: EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 2019, pp. 1461-1467
Issue Date:: 2019-01-01

Closed Access

	Filename	Description	Size
	1910.00058.pdf	Published version	4.24 MB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Huang, PY
dc.contributor.author	Chang, X https://orcid.org/0000-0002-7778-8807
dc.contributor.author	Hauptmann, A
dc.date	2019-11-03
dc.date.accessioned	2023-03-31T10:46:38Z
dc.date.available	2023-03-31T10:46:38Z
dc.date.issued	2019-01-01
dc.identifier.citation	EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 2019, pp. 1461-1467
dc.identifier.isbn	9781950737901
dc.identifier.uri	http://hdl.handle.net/10453/168998
dc.description.abstract	With the aim of promoting and understanding the multilingual version of image search, we leverage visual object detection and propose a model with diverse multi-head attention to learn grounded multilingual multimodal representations. Specifically, our model attends to different types of textual semantics in two languages and visual objects for fine-grained alignments between sentences and images. We introduce a new objective function which explicitly encourages attention diversity to learn an improved visual-semantic embedding space. We evaluate our model in the German-Image and English-Image matching tasks on the Multi30K dataset, and in the Semantic Textual Similarity task with the English descriptions of visual content. Results show that our model yields a significant performance gain over other methods in all of the three tasks.
dc.language	en
dc.publisher	ASSOC COMPUTATIONAL LINGUISTICS-ACL
dc.relation.ispartof	EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference
dc.relation.ispartof	Conference on Empirical Methods in Natural Language Processing / 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
dc.relation.isbasedon	10.18653/v1/d19-1154
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	Multi-head attention with diversity for learning grounded multilingual multimodal representations
dc.type	Conference Proceeding
utslib.location.activity	Hong Kong, HONG KONG
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	closed_access	*
dc.date.updated	2023-03-31T10:46:35Z
pubs.finish-date	2019-11-07
pubs.publication-status	Published
pubs.start-date	2019-11-03

Abstract:

With the aim of promoting and understanding the multilingual version of image search, we leverage visual object detection and propose a model with diverse multi-head attention to learn grounded multilingual multimodal representations. Specifically, our model attends to different types of textual semantics in two languages and visual objects for fine-grained alignments between sentences and images. We introduce a new objective function which explicitly encourages attention diversity to learn an improved visual-semantic embedding space. We evaluate our model in the German-Image and English-Image matching tasks on the Multi30K dataset, and in the Semantic Textual Similarity task with the English descriptions of visual content. Results show that our model yields a significant performance gain over other methods in all of the three tasks.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/168998