Exploring Pairwise Relationships Adaptively from Linguistic Context in Image Captioning

Publisher:
Institute of Electrical and Electronics Engineers
Publication Type:
Journal Article
Citation:
IEEE Transactions on Multimedia, 2022, 24, pp. 3101-3113
Issue Date:
2022-01-01
Full metadata record
For image captioning, recent works start to focus on exploring visual relationships for generating high-quality interactive words (i.e. verbs and prepositions). However, many existing works only focus on semantic level by analysing the feature similarity between objects in the visual domain but ignore the linguistic context included in the caption decoder. When captioning is being carried out, the entity words can be inferred based on visual information of objects. The interactive words representing the relationships between entity words can only be inferred based on high-level language meaning generated in the process of captioning decoding. Such high-level language meaning is called linguistic context, which refers to the relational context between words or phrases in the caption sentences. The linguistic context can be used as strong guidance to explore related visual relationships between different objects effectively. To achieve this, we propose a novel context-adaptive attention module that is strongly driven by the linguistic context from the caption decoder. In this module, a novel design of visual relationship attention is proposed based on a bilinear self-attention model to explore related visual relationships and encode more discriminative features under the linguistic context. It works parallelly with visual region attention. To achieve the adaptive process of attending to related visual relationships for generating interactive words or related visual objects for entity words, an attention modulator is integrated as an attention channel controller responding to the changing linguistic context of the caption decoder dynamically. To take full advantage of the linguistic context in the caption, an additional interaction dataset is extracted from the COCO caption datasets and COCO Entities dataset to supervise the training of the proposed context-adaptive attention module explicitly. Demonstrated by experiments on MSCOCO caption dataset, it is effective to leverage linguistic context as strong guidance when exploring visual relationships for generating high-quality captions. Our model achieves state-of-the-art performances compared with all counterpart models that explore visual relationships.
Please use this identifier to cite or link to this item: