Fine-grained and semantic-guided visual attention for image captioning

Zhang, Z; Wu, Q; Wang, Y; Chen, F

Fine-grained and semantic-guided visual attention for image captioning

Zhang, Z

Wu, Q

Wang, Y Chen, F

Permalink

Publication Type:: Conference Proceeding
Citation:: Proceedings - 2018 IEEE Winter Conference on Applications of Computer Vision, WACV 2018, 2018, 2018-January pp. 1709 - 1717
Issue Date:: 2018-05-03

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Accepted ManuscriptAdobe PDF (7.78 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Zhang, Z https://orcid.org/0000-0001-9309-1826	en_US
dc.contributor.author	Wu, Q https://orcid.org/0000-0001-5641-2483	en_US
dc.contributor.author	Wang, Y	en_US
dc.contributor.author	Chen, F https://orcid.org/0000-0003-4971-8729	en_US
dc.date.available	2020-05-25T19:02:46Z
dc.date.issued	2018-05-03	en_US
dc.identifier.citation	Proceedings - 2018 IEEE Winter Conference on Applications of Computer Vision, WACV 2018, 2018, 2018-January pp. 1709 - 1717	en_US
dc.identifier.isbn	9781538648865	en_US
dc.identifier.uri	http://hdl.handle.net/10453/132534
dc.description.abstract	© 2018 IEEE. Soft-attention is regarded as one of the representative methods for image captioning. Based on the end-to-end CNN-LSTM framework, it tries to link the relevant visual information on the image with the semantic representation in the text (i.e. captioning) for the first time. In recent years, there are several state-of-the-art methods published, which are motivated by this approach and include more elegant fine-tune operation. However, due to the constraints of CNN architecture, the given image is only segmented to fixed-resolution grid at a coarse level. The overall visual feature created for each grid cell indiscriminately fuses all inside objects and/or their portions. There is no semantic link among grid cells, although an object may be segmented into different grid cells. In addition, the large-area stuff (e.g. sky and beach) cannot be represented in the current methods. To tackle the problems above, this paper proposes a new model based on the FCN-LSTM framework which can segment the input image into a fine-grained grid. Moreover, the visual feature representing each grid cell is contributed only by the principal object or its portion in the corresponding cell. By adopting the pixel-wise labels (i.e. semantic segmentation), the visual representations of different grid cells are correlated to each other. In this way, a mechanism of fine-grained and semantic-guided visual attention is created, which can better link the relevant visual information with each semantic meaning inside the text through LSTM. Without using the elegant fine-tune, the comprehensive experiments show promising performance consistently across different evaluation metrics.	en_US
dc.relation.ispartof	Proceedings - 2018 IEEE Winter Conference on Applications of Computer Vision, WACV 2018	en_US
dc.relation.isbasedon	10.1109/WACV.2018.00190	en_US
dc.rights	info:eu-repo/semantics/openAccess
dc.title	Fine-grained and semantic-guided visual attention for image captioning	en_US
dc.type	Conference Proceeding
utslib.citation.volume	2018-January	en_US
utslib.for	0803 Computer Software	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Electrical and Data Engineering
pubs.organisational-group	/University of Technology Sydney/Strength - GBDTC - Global Big Data Technologies
pubs.organisational-group	/University of Technology Sydney/Strength - INEXT - Innovation in IT Services and Applications
utslib.copyright.status	open_access	*
pubs.publication-status	Published	en_US
pubs.volume	2018-January	en_US

Abstract:

© 2018 IEEE. Soft-attention is regarded as one of the representative methods for image captioning. Based on the end-to-end CNN-LSTM framework, it tries to link the relevant visual information on the image with the semantic representation in the text (i.e. captioning) for the first time. In recent years, there are several state-of-the-art methods published, which are motivated by this approach and include more elegant fine-tune operation. However, due to the constraints of CNN architecture, the given image is only segmented to fixed-resolution grid at a coarse level. The overall visual feature created for each grid cell indiscriminately fuses all inside objects and/or their portions. There is no semantic link among grid cells, although an object may be segmented into different grid cells. In addition, the large-area stuff (e.g. sky and beach) cannot be represented in the current methods. To tackle the problems above, this paper proposes a new model based on the FCN-LSTM framework which can segment the input image into a fine-grained grid. Moreover, the visual feature representing each grid cell is contributed only by the principal object or its portion in the corresponding cell. By adopting the pixel-wise labels (i.e. semantic segmentation), the visual representations of different grid cells are correlated to each other. In this way, a mechanism of fine-grained and semantic-guided visual attention is created, which can better link the relevant visual information with each semantic meaning inside the text through LSTM. Without using the elegant fine-tune, the comprehensive experiments show promising performance consistently across different evaluation metrics.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/132534