Position-aware image captioning with spatial relation

Duan, Y; Wang, Z; Wang, J; Wang, YK; Lin, CT

Position-aware image captioning with spatial relation

Duan, Y

Wang, Z Wang, J Wang, YK Lin, CT

Permalink

Publisher:: ELSEVIER
Publication Type:: Journal Article
Citation:: Neurocomputing, 2022, 497, pp. 28-38
Issue Date:: 2022-08-01

Closed Access

	Filename	Description	Size
	Position-aware image captioning with spatial relation.pdf	Published version	1.77 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Duan, Y https://orcid.org/0000-0003-1517-994X
dc.contributor.author	Wang, Z
dc.contributor.author	Wang, J
dc.contributor.author	Wang, YK
dc.contributor.author	Lin, CT
dc.date.accessioned	2023-03-23T04:36:10Z
dc.date.available	2023-03-23T04:36:10Z
dc.date.issued	2022-08-01
dc.identifier.citation	Neurocomputing, 2022, 497, pp. 28-38
dc.identifier.issn	0925-2312
dc.identifier.issn	1872-8286
dc.identifier.uri	http://hdl.handle.net/10453/168203
dc.description.abstract	Image caption aims to generate a language description of a given image. The problem can be solved by learning semantic information of visual objects and generating descriptions based on extracted embedding. However, the spatial relationship between visual objects and their static position is not fully explored by existing methods. In this work, we propose a Position-Aware Transformer (PAT) model that extracts both regional and static global visual features and unify both the regional and global by incorporating spatial information aligned to each visual feature. To make a better representation of spatial information and correlation between extracted visual features, we propose and compare three subtle approaches to explore position embedding with spatial relation information explicitly. Moreover, we jointly consider the static global and regional embedding for spatial modeling. Experimental results illustrate that our proposed model achieves competitive performance on the COCO image captioning dataset, where the PAT model could respectively reach 38.7, 28.6, and 58.6 on BLEU-4, METEOR, and ROUGE-L respectively. Extensive experiments suggest that the proposed PAT model could also reach competitive performance on related visual-language tasks including visual question answering (VQA) and multi-modal retrieval. Detailed ablation studies are conducted to report how each part would contribute to the final performance, which could be a good reference for follow-up spatial information representation works.
dc.language	English
dc.publisher	ELSEVIER
dc.relation.ispartof	Neurocomputing
dc.relation.isbasedon	10.1016/j.neucom.2022.05.003
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	08 Information and Computing Sciences, 09 Engineering, 17 Psychology and Cognitive Sciences
dc.subject.classification	Artificial Intelligence & Image Processing
dc.title	Position-aware image captioning with spatial relation
dc.type	Journal Article
utslib.citation.volume	497
utslib.for	08 Information and Computing Sciences
utslib.for	09 Engineering
utslib.for	17 Psychology and Cognitive Sciences
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	closed_access	*
dc.date.updated	2023-03-23T04:36:09Z
pubs.publication-status	Published
pubs.volume	497

Abstract:

Image caption aims to generate a language description of a given image. The problem can be solved by learning semantic information of visual objects and generating descriptions based on extracted embedding. However, the spatial relationship between visual objects and their static position is not fully explored by existing methods. In this work, we propose a Position-Aware Transformer (PAT) model that extracts both regional and static global visual features and unify both the regional and global by incorporating spatial information aligned to each visual feature. To make a better representation of spatial information and correlation between extracted visual features, we propose and compare three subtle approaches to explore position embedding with spatial relation information explicitly. Moreover, we jointly consider the static global and regional embedding for spatial modeling. Experimental results illustrate that our proposed model achieves competitive performance on the COCO image captioning dataset, where the PAT model could respectively reach 38.7, 28.6, and 58.6 on BLEU-4, METEOR, and ROUGE-L respectively. Extensive experiments suggest that the proposed PAT model could also reach competitive performance on related visual-language tasks including visual question answering (VQA) and multi-modal retrieval. Detailed ablation studies are conducted to report how each part would contribute to the final performance, which could be a good reference for follow-up spatial information representation works.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/168203