Annotation efficient cross-modal retrieval with adversarial attentive alignment

Huang, PY; Kang, G; Liu, W; Chang, X; Hauptmann, AG

Annotation efficient cross-modal retrieval with adversarial attentive alignment

Huang, PY Kang, G Liu, W Chang, X

Hauptmann, AG

Permalink

Publisher:: ASSOC COMPUTING MACHINERY
Publication Type:: Conference Proceeding
Citation:: MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 1758-1767
Issue Date:: 2019-10-15

Closed Access

	Filename	Description	Size
	3343031.3350894.pdf	Published version	14.24 MB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Huang, PY
dc.contributor.author	Kang, G
dc.contributor.author	Liu, W
dc.contributor.author	Chang, X https://orcid.org/0000-0002-7778-8807
dc.contributor.author	Hauptmann, AG
dc.date	2019-10-21
dc.date.accessioned	2023-03-31T10:35:53Z
dc.date.available	2023-03-31T10:35:53Z
dc.date.issued	2019-10-15
dc.identifier.citation	MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 1758-1767
dc.identifier.isbn	9781450368896
dc.identifier.uri	http://hdl.handle.net/10453/168993
dc.description.abstract	Visual-semantic embeddings are central to many multimedia applications such as cross-modal retrieval between visual data and natural language descriptions. Conventionally, learning a joint embedding space relies on large parallel multimodal corpora. Since massive human annotation is expensive to obtain, there is a strong motivation in developing versatile algorithms to learn from large corpora with fewer annotations. In this paper, we propose a novel framework to leverage automatically extracted regional semantics from un-annotated images as additional weak supervision to learn visual-semantic embeddings. The proposed model employs adversarial attentive alignments to close the inherent heterogeneous gaps between annotated and un-annotated portions of visual and textual domains. To demonstrate its superiority, we conduct extensive experiments on sparsely annotated multimodal corpora. The experimental results show that the proposed model outperforms state-of-the-art visual-semantic embedding models by a significant margin for cross-modal retrieval tasks on the sparse Flickr30k and MS-COCO datasets. It is also worth noting that, despite using only 20% of the annotations, the proposed model can achieve competitive performance (Recall at 10 > 80.0% for 1K and > 70.0% for 5K text-to-image retrieval) compared to the benchmarks trained with the complete annotations.
dc.language	en
dc.publisher	ASSOC COMPUTING MACHINERY
dc.relation	http://purl.org/au-research/grants/arc/DE190100626
dc.relation.ispartof	MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia
dc.relation.ispartof	27th ACM International Conference on Multimedia (MM)
dc.relation.isbasedon	10.1145/3343031.3350894
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	Annotation efficient cross-modal retrieval with adversarial attentive alignment
dc.type	Conference Proceeding
utslib.location.activity	Nice, FRANCE
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	closed_access	*
dc.date.updated	2023-03-31T10:35:46Z
pubs.finish-date	2019-10-25
pubs.publication-status	Published
pubs.start-date	2019-10-21

Abstract:

Visual-semantic embeddings are central to many multimedia applications such as cross-modal retrieval between visual data and natural language descriptions. Conventionally, learning a joint embedding space relies on large parallel multimodal corpora. Since massive human annotation is expensive to obtain, there is a strong motivation in developing versatile algorithms to learn from large corpora with fewer annotations. In this paper, we propose a novel framework to leverage automatically extracted regional semantics from un-annotated images as additional weak supervision to learn visual-semantic embeddings. The proposed model employs adversarial attentive alignments to close the inherent heterogeneous gaps between annotated and un-annotated portions of visual and textual domains. To demonstrate its superiority, we conduct extensive experiments on sparsely annotated multimodal corpora. The experimental results show that the proposed model outperforms state-of-the-art visual-semantic embedding models by a significant margin for cross-modal retrieval tasks on the sparse Flickr30k and MS-COCO datasets. It is also worth noting that, despite using only 20% of the annotations, the proposed model can achieve competitive performance (Recall at 10 > 80.0% for 1K and > 70.0% for 5K text-to-image retrieval) compared to the benchmarks trained with the complete annotations.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/168993