Exploring Pairwise Relationships Adaptively from Linguistic Context in Image Captioning

Zhang, Z; Wu, Q; Wang, Y; Chen, F

Exploring Pairwise Relationships Adaptively from Linguistic Context in Image Captioning

Zhang, Z

Wu, Q

Wang, Y Chen, F

Permalink

Publisher:: Institute of Electrical and Electronics Engineers
Publication Type:: Journal Article
Citation:: IEEE Transactions on Multimedia, 2022, 24, pp. 3101-3113
Issue Date:: 2022-01-01

Closed Access

	Filename	Description	Size
	Exploring_Pairwise_Relationships_Adaptively_From_Linguistic_Context_in_Image_Captioning.pdf	Published version	3.57 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Zhang, Z https://orcid.org/0000-0001-9309-1826
dc.contributor.author	Wu, Q https://orcid.org/0000-0001-5641-2483
dc.contributor.author	Wang, Y
dc.contributor.author	Chen, F https://orcid.org/0000-0003-4971-8729
dc.date.accessioned	2023-04-05T02:13:04Z
dc.date.available	2023-04-05T02:13:04Z
dc.date.issued	2022-01-01
dc.identifier.citation	IEEE Transactions on Multimedia, 2022, 24, pp. 3101-3113
dc.identifier.issn	1520-9210
dc.identifier.issn	1941-0077
dc.identifier.uri	http://hdl.handle.net/10453/169149
dc.description.abstract	For image captioning, recent works start to focus on exploring visual relationships for generating high-quality interactive words (i.e. verbs and prepositions). However, many existing works only focus on semantic level by analysing the feature similarity between objects in the visual domain but ignore the linguistic context included in the caption decoder. When captioning is being carried out, the entity words can be inferred based on visual information of objects. The interactive words representing the relationships between entity words can only be inferred based on high-level language meaning generated in the process of captioning decoding. Such high-level language meaning is called linguistic context, which refers to the relational context between words or phrases in the caption sentences. The linguistic context can be used as strong guidance to explore related visual relationships between different objects effectively. To achieve this, we propose a novel context-adaptive attention module that is strongly driven by the linguistic context from the caption decoder. In this module, a novel design of visual relationship attention is proposed based on a bilinear self-attention model to explore related visual relationships and encode more discriminative features under the linguistic context. It works parallelly with visual region attention. To achieve the adaptive process of attending to related visual relationships for generating interactive words or related visual objects for entity words, an attention modulator is integrated as an attention channel controller responding to the changing linguistic context of the caption decoder dynamically. To take full advantage of the linguistic context in the caption, an additional interaction dataset is extracted from the COCO caption datasets and COCO Entities dataset to supervise the training of the proposed context-adaptive attention module explicitly. Demonstrated by experiments on MSCOCO caption dataset, it is effective to leverage linguistic context as strong guidance when exploring visual relationships for generating high-quality captions. Our model achieves state-of-the-art performances compared with all counterpart models that explore visual relationships.
dc.language	English
dc.publisher	Institute of Electrical and Electronics Engineers
dc.relation.ispartof	IEEE Transactions on Multimedia
dc.relation.isbasedon	10.1109/TMM.2021.3093725
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	08 Information and Computing Sciences, 09 Engineering
dc.subject.classification	Artificial Intelligence & Image Processing
dc.title	Exploring Pairwise Relationships Adaptively from Linguistic Context in Image Captioning
dc.type	Journal Article
utslib.citation.volume	24
utslib.for	08 Information and Computing Sciences
utslib.for	09 Engineering
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - INEXT - Innovation in IT Services and Applications
pubs.organisational-group	/University of Technology Sydney/Strength - GBDTC - Global Big Data Technologies
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Electrical and Data Engineering
utslib.copyright.status	closed_access	*
pubs.consider-herdc	false
dc.date.updated	2023-04-05T02:13:02Z
pubs.publication-status	Published
pubs.volume	24

Abstract:

For image captioning, recent works start to focus on exploring visual relationships for generating high-quality interactive words (i.e. verbs and prepositions). However, many existing works only focus on semantic level by analysing the feature similarity between objects in the visual domain but ignore the linguistic context included in the caption decoder. When captioning is being carried out, the entity words can be inferred based on visual information of objects. The interactive words representing the relationships between entity words can only be inferred based on high-level language meaning generated in the process of captioning decoding. Such high-level language meaning is called linguistic context, which refers to the relational context between words or phrases in the caption sentences. The linguistic context can be used as strong guidance to explore related visual relationships between different objects effectively. To achieve this, we propose a novel context-adaptive attention module that is strongly driven by the linguistic context from the caption decoder. In this module, a novel design of visual relationship attention is proposed based on a bilinear self-attention model to explore related visual relationships and encode more discriminative features under the linguistic context. It works parallelly with visual region attention. To achieve the adaptive process of attending to related visual relationships for generating interactive words or related visual objects for entity words, an attention modulator is integrated as an attention channel controller responding to the changing linguistic context of the caption decoder dynamically. To take full advantage of the linguistic context in the caption, an additional interaction dataset is extracted from the COCO caption datasets and COCO Entities dataset to supervise the training of the proposed context-adaptive attention module explicitly. Demonstrated by experiments on MSCOCO caption dataset, it is effective to leverage linguistic context as strong guidance when exploring visual relationships for generating high-quality captions. Our model achieves state-of-the-art performances compared with all counterpart models that explore visual relationships.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/169149