Entangled transformer for image captioning

Li, G; Zhu, L; Liu, P; Yang, Y

Entangled transformer for image captioning

Li, G Zhu, L

Liu, P

Yang, Y

Permalink

Publisher:: IEEE
Publication Type:: Conference Proceeding
Citation:: Proceedings of the IEEE International Conference on Computer Vision, 2020, 2019-October, pp. 8927-8936
Issue Date:: 2020

Closed Access

	Filename	Description	Size
	09008532.pdf	Published version	6.43 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Li, G
dc.contributor.author	Zhu, L https://orcid.org/0000-0002-4093-7557
dc.contributor.author	Liu, P https://orcid.org/0000-0002-3170-3783
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.date	2019-10-27
dc.date.accessioned	2021-04-12T10:52:26Z
dc.date.available	2021-04-12T10:52:26Z
dc.date.issued	2020
dc.identifier.citation	Proceedings of the IEEE International Conference on Computer Vision, 2020, 2019-October, pp. 8927-8936
dc.identifier.isbn	9781728148038
dc.identifier.issn	1550-5499
dc.identifier.uri	http://hdl.handle.net/10453/148039
dc.description.abstract	© 2019 IEEE. In image captioning, the typical attention mechanisms are arduous to identify the equivalent visual signals especially when predicting highly abstract words. This phenomenon is known as the semantic gap between vision and language. This problem can be overcome by providing semantic attributes that are homologous to language. Thanks to the inherent recurrent nature and gated operating mechanism, Recurrent Neural Network (RNN) and its variants are the dominating architectures in image captioning. However, when designing elaborate attention mechanisms to integrate visual inputs and semantic attributes, RNN-like variants become unflexible due to their complexities. In this paper, we investigate a Transformer-based sequence modeling framework, built only with attention layers and feedforward layers. To bridge the semantic gap, we introduce EnTangled Attention (ETA) that enables the Transformer to exploit semantic and visual information simultaneously. Furthermore, Gated Bilateral Controller (GBC) is proposed to guide the interactions between the multimodal information. We name our model as ETA-Transformer. Remarkably, ETA-Transformer achieves state-of-the-art performance on the MSCOCO image captioning dataset. The ablation studies validate the improvements of our proposed modules.
dc.language	en
dc.publisher	IEEE
dc.relation.ispartof	Proceedings of the IEEE International Conference on Computer Vision
dc.relation.ispartof	IEEE/CVF International Conference on Computer Vision
dc.relation.isbasedon	10.1109/ICCV.2019.00902
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	Entangled transformer for image captioning
dc.type	Conference Proceeding
utslib.citation.volume	2019-October
utslib.location.activity	Seoul, Korea (South)
utslib.for	0801 Artificial Intelligence and Image Processing
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	closed_access	*
pubs.consider-herdc	false
dc.date.updated	2021-04-12T10:52:22Z
pubs.finish-date	2019-11-02
pubs.place-of-publication	Piscataway, USA
pubs.publication-status	Published
pubs.start-date	2019-10-27
pubs.volume	2019-October
dc.location	Piscataway, USA

Abstract:

© 2019 IEEE. In image captioning, the typical attention mechanisms are arduous to identify the equivalent visual signals especially when predicting highly abstract words. This phenomenon is known as the semantic gap between vision and language. This problem can be overcome by providing semantic attributes that are homologous to language. Thanks to the inherent recurrent nature and gated operating mechanism, Recurrent Neural Network (RNN) and its variants are the dominating architectures in image captioning. However, when designing elaborate attention mechanisms to integrate visual inputs and semantic attributes, RNN-like variants become unflexible due to their complexities. In this paper, we investigate a Transformer-based sequence modeling framework, built only with attention layers and feedforward layers. To bridge the semantic gap, we introduce EnTangled Attention (ETA) that enables the Transformer to exploit semantic and visual information simultaneously. Furthermore, Gated Bilateral Controller (GBC) is proposed to guide the interactions between the multimodal information. We name our model as ETA-Transformer. Remarkably, ETA-Transformer achieves state-of-the-art performance on the MSCOCO image captioning dataset. The ablation studies validate the improvements of our proposed modules.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/148039