Exploring region relationships implicitly: Image captioning with visual relationship attention

Zhang, Z; Wu, Q; Wang, Y; Chen, F

Exploring region relationships implicitly: Image captioning with visual relationship attention

Zhang, Z Wu, Q

Wang, Y Chen, F

Permalink

Publisher:: ELSEVIER
Publication Type:: Journal Article
Citation:: Image and Vision Computing, 2021, 109
Issue Date:: 2021-05-01

Closed Access

	Filename	Description	Size
	1-s2.0-S0262885621000512-main.pdf	Published version	3.33 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Zhang, Z
dc.contributor.author	Wu, Q https://orcid.org/0000-0001-5641-2483
dc.contributor.author	Wang, Y
dc.contributor.author	Chen, F https://orcid.org/0000-0003-4971-8729
dc.date.accessioned	2022-04-24T00:34:03Z
dc.date.available	2022-04-24T00:34:03Z
dc.date.issued	2021-05-01
dc.identifier.citation	Image and Vision Computing, 2021, 109
dc.identifier.issn	0262-8856
dc.identifier.issn	1872-8138
dc.identifier.uri	http://hdl.handle.net/10453/156570
dc.description.abstract	Visual attention mechanism has been widely used by image captioning model in order to dynamically attend to the related visual region based on given language information. Such capability allows a trained model to carry out fine-grained level image understanding and reasoning. However, existing visual attention models only focus on the individual visual region in the image and the alignment between the language representation and related individual visual regions. It does not fully explore the relationships/interactions between visual regions. Furthermore, it does not analyze or explore alignment for related words/phrases (e.g. verb or phrasal verb), which may best describe the relationships/interactions between these visual regions. Thus, it causes the inaccurate or impropriate description to the current image captioning model. Instead of visual region attention commonly addressed by existing visual attention mechanism, this paper proposes the novel visual relationship attention via contextualized embedding for individual regions. It can dynamically explore a related visual relationship existing between multiple regions when generating interaction words. Such relationship exploring process is constrained by spatial relationships and driven by the linguistic context of language decoder. In this work, such new visual relationship attention is designed through a parallel attention mechanism under the learned spatial constraint in order to more precisely map visual relationship information to the semantic description of such relationship in language. Different from existing methods for exploring the visual relationship, it is trained implicitly through an unsupervised approach without using any explicit visual relationship annotations. By integrating the newly proposed visual relationship attention with existing visual region attention, our image captioning model can generate high-quality captions. Solid experiments on the MSCOCO dataset demonstrate the proposed visual relationship attention can effectively boost the captioning performances by capturing related visual relationships for generating accurate interaction descriptions.
dc.language	English
dc.publisher	ELSEVIER
dc.relation.ispartof	Image and Vision Computing
dc.relation.isbasedon	10.1016/j.imavis.2021.104146
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0801 Artificial Intelligence and Image Processing, 0906 Electrical and Electronic Engineering
dc.subject.classification	Artificial Intelligence & Image Processing
dc.title	Exploring region relationships implicitly: Image captioning with visual relationship attention
dc.type	Journal Article
utslib.citation.volume	109
utslib.for	0801 Artificial Intelligence and Image Processing
utslib.for	0906 Electrical and Electronic Engineering
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - INEXT - Innovation in IT Services and Applications
pubs.organisational-group	/University of Technology Sydney/Strength - GBDTC - Global Big Data Technologies
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Electrical and Data Engineering
utslib.copyright.status	closed_access	*
dc.date.updated	2022-04-24T00:33:59Z
pubs.publication-status	Published
pubs.volume	109

Abstract:

Visual attention mechanism has been widely used by image captioning model in order to dynamically attend to the related visual region based on given language information. Such capability allows a trained model to carry out fine-grained level image understanding and reasoning. However, existing visual attention models only focus on the individual visual region in the image and the alignment between the language representation and related individual visual regions. It does not fully explore the relationships/interactions between visual regions. Furthermore, it does not analyze or explore alignment for related words/phrases (e.g. verb or phrasal verb), which may best describe the relationships/interactions between these visual regions. Thus, it causes the inaccurate or impropriate description to the current image captioning model. Instead of visual region attention commonly addressed by existing visual attention mechanism, this paper proposes the novel visual relationship attention via contextualized embedding for individual regions. It can dynamically explore a related visual relationship existing between multiple regions when generating interaction words. Such relationship exploring process is constrained by spatial relationships and driven by the linguistic context of language decoder. In this work, such new visual relationship attention is designed through a parallel attention mechanism under the learned spatial constraint in order to more precisely map visual relationship information to the semantic description of such relationship in language. Different from existing methods for exploring the visual relationship, it is trained implicitly through an unsupervised approach without using any explicit visual relationship annotations. By integrating the newly proposed visual relationship attention with existing visual region attention, our image captioning model can generate high-quality captions. Solid experiments on the MSCOCO dataset demonstrate the proposed visual relationship attention can effectively boost the captioning performances by capturing related visual relationships for generating accurate interaction descriptions.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/156570