Size-Invariant Attention Accuracy Metric for Image Captioning with High-Resolution Residual Attention

Publication Type:
Conference Proceeding
Citation:
2018 International Conference on Digital Image Computing: Techniques and Applications, DICTA 2018, 2019
Issue Date:
2019-01-16
Filename Description Size
dicta2018.pdfAccepted Manuscript736.9 kB
Adobe PDF
Full metadata record
© 2018 IEEE. Spatial visual attention mechanisms have achieved significant performance improvements for image captioning. To quantitatively evaluate the performances of attention mechanisms, the 'attention correctness' metric has been proposed to calculate the sum of attention weights generated for ground truth regions. However, this metric cannot consistently measure the attention accuracy among the element regions with large size variance. Moreover, its evaluations are inconsistent with captioning performances across different fine-grained attention resolutions. To address these problems, this paper proposes a size-invariant evaluation metric by normalizing the 'attention correctness' metric with the size percentage of the attended region. To demonstrate the efficiency of our size-invariant metric, this paper further proposes a high-resolution residual attention model that uses RefineNet as the Fully Convolutional Network (FCN) encoder. By using the COCO-Stuff dataset, we can achieve pixel-level evaluations on both object and 'stuff' regions. We use our metric to evaluate the proposed attention model across four high fine-grained resolutions (i.e., 27×27, 40×40, 60×60, 80×80). The results demonstrate that, compared with the 'attention correctness' metric, our size-invariant metric is more consistent with the captioning performances and is more efficient for evaluating the attention accuracy.
Please use this identifier to cite or link to this item: