Size-Invariant Attention Accuracy Metric for Image Captioning with High-Resolution Residual Attention

Zhang, Z; Wu, Q; Wang, Y; Chen, F

Size-Invariant Attention Accuracy Metric for Image Captioning with High-Resolution Residual Attention

Zhang, Z

Wu, Q

Wang, Y Chen, F

Permalink

Publication Type:: Conference Proceeding
Citation:: 2018 International Conference on Digital Image Computing: Techniques and Applications, DICTA 2018, 2019
Issue Date:: 2019-01-16

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Accepted ManuscriptAdobe PDF (736.9 kB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Zhang, Z https://orcid.org/0000-0001-9309-1826	en_US
dc.contributor.author	Wu, Q https://orcid.org/0000-0001-5641-2483	en_US
dc.contributor.author	Wang, Y	en_US
dc.contributor.author	Chen, F https://orcid.org/0000-0003-4971-8729	en_US
dc.date.available	2021-01-01T18:04:55Z
dc.date.issued	2019-01-16	en_US
dc.identifier.citation	2018 International Conference on Digital Image Computing: Techniques and Applications, DICTA 2018, 2019	en_US
dc.identifier.isbn	9781538666029	en_US
dc.identifier.uri	http://hdl.handle.net/10453/132535
dc.description.abstract	© 2018 IEEE. Spatial visual attention mechanisms have achieved significant performance improvements for image captioning. To quantitatively evaluate the performances of attention mechanisms, the 'attention correctness' metric has been proposed to calculate the sum of attention weights generated for ground truth regions. However, this metric cannot consistently measure the attention accuracy among the element regions with large size variance. Moreover, its evaluations are inconsistent with captioning performances across different fine-grained attention resolutions. To address these problems, this paper proposes a size-invariant evaluation metric by normalizing the 'attention correctness' metric with the size percentage of the attended region. To demonstrate the efficiency of our size-invariant metric, this paper further proposes a high-resolution residual attention model that uses RefineNet as the Fully Convolutional Network (FCN) encoder. By using the COCO-Stuff dataset, we can achieve pixel-level evaluations on both object and 'stuff' regions. We use our metric to evaluate the proposed attention model across four high fine-grained resolutions (i.e., 27×27, 40×40, 60×60, 80×80). The results demonstrate that, compared with the 'attention correctness' metric, our size-invariant metric is more consistent with the captioning performances and is more efficient for evaluating the attention accuracy.	en_US
dc.relation.ispartof	2018 International Conference on Digital Image Computing: Techniques and Applications, DICTA 2018	en_US
dc.relation.isbasedon	10.1109/DICTA.2018.8615788	en_US
dc.rights	info:eu-repo/semantics/openAccess
dc.title	Size-Invariant Attention Accuracy Metric for Image Captioning with High-Resolution Residual Attention	en_US
dc.type	Conference Proceeding
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Electrical and Data Engineering
pubs.organisational-group	/University of Technology Sydney/Strength - GBDTC - Global Big Data Technologies
pubs.organisational-group	/University of Technology Sydney/Strength - INEXT - Innovation in IT Services and Applications
utslib.copyright.status	open_access	*
pubs.publication-status	Published	en_US

Abstract:

© 2018 IEEE. Spatial visual attention mechanisms have achieved significant performance improvements for image captioning. To quantitatively evaluate the performances of attention mechanisms, the 'attention correctness' metric has been proposed to calculate the sum of attention weights generated for ground truth regions. However, this metric cannot consistently measure the attention accuracy among the element regions with large size variance. Moreover, its evaluations are inconsistent with captioning performances across different fine-grained attention resolutions. To address these problems, this paper proposes a size-invariant evaluation metric by normalizing the 'attention correctness' metric with the size percentage of the attended region. To demonstrate the efficiency of our size-invariant metric, this paper further proposes a high-resolution residual attention model that uses RefineNet as the Fully Convolutional Network (FCN) encoder. By using the COCO-Stuff dataset, we can achieve pixel-level evaluations on both object and 'stuff' regions. We use our metric to evaluate the proposed attention model across four high fine-grained resolutions (i.e., 27×27, 40×40, 60×60, 80×80). The results demonstrate that, compared with the 'attention correctness' metric, our size-invariant metric is more consistent with the captioning performances and is more efficient for evaluating the attention accuracy.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/132535