High-Quality Image Captioning with Fine-Grained and Semantic-Guided Visual Attention

Zhang, Z; Wu, Q; Wang, Y; Chen, F

High-Quality Image Captioning with Fine-Grained and Semantic-Guided Visual Attention

Zhang, Z

Wu, Q

Wang, Y Chen, F

Permalink

Publication Type:: Journal Article
Citation:: IEEE Transactions on Multimedia, 2019, 21 (7), pp. 1681 - 1693
Issue Date:: 2019-07-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Accepted ManuscriptAdobe PDF (13.04 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Zhang, Z https://orcid.org/0000-0001-9309-1826	en_US
dc.contributor.author	Wu, Q https://orcid.org/0000-0001-5641-2483	en_US
dc.contributor.author	Wang, Y	en_US
dc.contributor.author	Chen, F https://orcid.org/0000-0003-4971-8729	en_US
dc.date.available	2021-04-06T19:00:55Z
dc.date.issued	2019-07-01	en_US
dc.identifier.citation	IEEE Transactions on Multimedia, 2019, 21 (7), pp. 1681 - 1693	en_US
dc.identifier.issn	1520-9210	en_US
dc.identifier.uri	http://hdl.handle.net/10453/132536
dc.description.abstract	© 1999-2012 IEEE. The soft-attention mechanism is regarded as one of the representative methods for image captioning. Based on the end-to-end convolutional neural network (CNN)-long short term memory (LSTM) framework, the soft-attention mechanism attempts to link the semantic representation in text (i.e., captioning) with relevant visual information in the image for the first time. Motivated by this approach, several state-of-the-art attention methods are proposed. However, due to the constraints of CNN architecture, the given image is only segmented to the fixed-resolution grid at a coarse level. The visual feature extracted from each grid indiscriminately fuses all inside objects and/or their portions. There is no semantic link between grid cells. In addition, the large area 'stuff' (e.g., the sky or a beach) cannot be represented using the current methods. To address these problems, this paper proposes a new model based on the fully convolutional network (FCN)-LSTM framework, which can generate an attention map at a fine-grained grid-wise resolution. Moreover, the visual feature of each grid cell is contributed only by the principal object. By adopting the grid-wise labels (i.e., semantic segmentation), the visual representations of different grid cells are correlated to each other. With the ability to attend to large area 'stuff,' our method can further summarize an additional semantic context from semantic labels. This method can provide comprehensive context information to the language LSTM decoder. In this way, a mechanism of fine-grained and semantic-guided visual attention is created, which can accurately link the relevant visual information with each semantic meaning inside the text. Demonstrated by three experiments including both qualitative and quantitative analyses, our model can generate captions of high quality, specifically high levels of accuracy, completeness, and diversity. Moreover, our model significantly outperforms all other methods that use VGG-based CNN encoders without fine-tuning.	en_US
dc.relation.ispartof	IEEE Transactions on Multimedia	en_US
dc.relation.isbasedon	10.1109/TMM.2018.2888822	en_US
dc.rights	info:eu-repo/semantics/openAccess
dc.subject.classification	Artificial Intelligence & Image Processing	en_US
dc.title	High-Quality Image Captioning with Fine-Grained and Semantic-Guided Visual Attention	en_US
dc.type	Journal Article
utslib.citation.volume	7	en_US
utslib.citation.volume	21	en_US
utslib.for	0804 Data Format	en_US
utslib.for	0906 Electrical and Electronic Engineering	en_US
utslib.for	08 Information and Computing Sciences	en_US
utslib.for	09 Engineering	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Electrical and Data Engineering
pubs.organisational-group	/University of Technology Sydney/Strength - GBDTC - Global Big Data Technologies
pubs.organisational-group	/University of Technology Sydney/Strength - INEXT - Innovation in IT Services and Applications
utslib.copyright.status	open_access	*
pubs.issue	7	en_US
pubs.publication-status	Published	en_US
pubs.volume	21	en_US

Abstract:

© 1999-2012 IEEE. The soft-attention mechanism is regarded as one of the representative methods for image captioning. Based on the end-to-end convolutional neural network (CNN)-long short term memory (LSTM) framework, the soft-attention mechanism attempts to link the semantic representation in text (i.e., captioning) with relevant visual information in the image for the first time. Motivated by this approach, several state-of-the-art attention methods are proposed. However, due to the constraints of CNN architecture, the given image is only segmented to the fixed-resolution grid at a coarse level. The visual feature extracted from each grid indiscriminately fuses all inside objects and/or their portions. There is no semantic link between grid cells. In addition, the large area 'stuff' (e.g., the sky or a beach) cannot be represented using the current methods. To address these problems, this paper proposes a new model based on the fully convolutional network (FCN)-LSTM framework, which can generate an attention map at a fine-grained grid-wise resolution. Moreover, the visual feature of each grid cell is contributed only by the principal object. By adopting the grid-wise labels (i.e., semantic segmentation), the visual representations of different grid cells are correlated to each other. With the ability to attend to large area 'stuff,' our method can further summarize an additional semantic context from semantic labels. This method can provide comprehensive context information to the language LSTM decoder. In this way, a mechanism of fine-grained and semantic-guided visual attention is created, which can accurately link the relevant visual information with each semantic meaning inside the text. Demonstrated by three experiments including both qualitative and quantitative analyses, our model can generate captions of high quality, specifically high levels of accuracy, completeness, and diversity. Moreover, our model significantly outperforms all other methods that use VGG-based CNN encoders without fine-tuning.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/132536