Visual Relationship Attention for Image Captioning

Zhang, Z; Wang, Y; Wu, Q; Chen, F

Visual Relationship Attention for Image Captioning

Zhang, Z Wang, Y Wu, Q

Chen, F

Permalink

Publication Type:: Conference Proceeding
Citation:: Proceedings of the International Joint Conference on Neural Networks, 2019, 2019-July
Issue Date:: 2019-07-01

Closed Access

	Filename	Description	Size
	08851832.pdf	Published version	1.07 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Zhang, Z	en_US
dc.contributor.author	Wang, Y	en_US
dc.contributor.author	Wu, Q https://orcid.org/0000-0001-5641-2483	en_US
dc.contributor.author	Chen, F https://orcid.org/0000-0003-4971-8729	en_US
dc.date.issued	2019-07-01	en_US
dc.identifier.citation	Proceedings of the International Joint Conference on Neural Networks, 2019, 2019-July	en_US
dc.identifier.isbn	9781728119854	en_US
dc.identifier.uri	http://hdl.handle.net/10453/139133
dc.description.abstract	© 2019 IEEE. Visual attention mechanisms have been broadly used by image captioning models to attend to related visual information dynamically, allowing fine-grained image understanding and reasoning. However, they are only designed to discover the region-level alignment between visual features and the language feature. The exploration of higher-level visual relationship information between image regions, which is rarely researched in recent works, is beyond their capabilities. To fill this gap, we propose a novel visual relationship attention model based on the parallel attention mechanism under the learnt spatial constraints. It can extract relationship information from visual regions and language and then achieve the relationship-level alignment between them. Using combined visual relationship attention and visual region attention to attend to related visual relationships and regions respectively, our image captioning model can achieve state-of-the-art performances on the MSCOCO dataset. Both quantitative analysis and qualitative analysis demonstrate that our novel visual relationship attention model can capture related visual relationship and further improve the caption quality.	en_US
dc.relation.ispartof	Proceedings of the International Joint Conference on Neural Networks	en_US
dc.relation.isbasedon	10.1109/IJCNN.2019.8851832	en_US
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	Visual Relationship Attention for Image Captioning	en_US
dc.type	Conference Proceeding
utslib.citation.volume	2019-July	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Electrical and Data Engineering
pubs.organisational-group	/University of Technology Sydney/Strength - GBDTC - Global Big Data Technologies
pubs.organisational-group	/University of Technology Sydney/Strength - INEXT - Innovation in IT Services and Applications
utslib.copyright.status	closed_access	*
pubs.publication-status	Published	en_US
pubs.volume	2019-July	en_US

Abstract:

© 2019 IEEE. Visual attention mechanisms have been broadly used by image captioning models to attend to related visual information dynamically, allowing fine-grained image understanding and reasoning. However, they are only designed to discover the region-level alignment between visual features and the language feature. The exploration of higher-level visual relationship information between image regions, which is rarely researched in recent works, is beyond their capabilities. To fill this gap, we propose a novel visual relationship attention model based on the parallel attention mechanism under the learnt spatial constraints. It can extract relationship information from visual regions and language and then achieve the relationship-level alignment between them. Using combined visual relationship attention and visual region attention to attend to related visual relationships and regions respectively, our image captioning model can achieve state-of-the-art performances on the MSCOCO dataset. Both quantitative analysis and qualitative analysis demonstrate that our novel visual relationship attention model can capture related visual relationship and further improve the caption quality.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/139133