Visual Relationship Attention for Image Captioning

Publication Type:
Conference Proceeding
Proceedings of the International Joint Conference on Neural Networks, 2019, 2019-July
Issue Date:
Filename Description Size
08851832.pdfPublished version1.07 MB
Adobe PDF
Full metadata record
© 2019 IEEE. Visual attention mechanisms have been broadly used by image captioning models to attend to related visual information dynamically, allowing fine-grained image understanding and reasoning. However, they are only designed to discover the region-level alignment between visual features and the language feature. The exploration of higher-level visual relationship information between image regions, which is rarely researched in recent works, is beyond their capabilities. To fill this gap, we propose a novel visual relationship attention model based on the parallel attention mechanism under the learnt spatial constraints. It can extract relationship information from visual regions and language and then achieve the relationship-level alignment between them. Using combined visual relationship attention and visual region attention to attend to related visual relationships and regions respectively, our image captioning model can achieve state-of-the-art performances on the MSCOCO dataset. Both quantitative analysis and qualitative analysis demonstrate that our novel visual relationship attention model can capture related visual relationship and further improve the caption quality.
Please use this identifier to cite or link to this item: