CMGNet: Collaborative multi-modal graph network for video captioning

Rao, Q; Yu, X; Li, G; Zhu, L

CMGNet: Collaborative multi-modal graph network for video captioning

Rao, Q Yu, X

Li, G Zhu, L

Permalink

Publisher:: ACADEMIC PRESS INC ELSEVIER SCIENCE
Publication Type:: Journal Article
Citation:: Computer Vision and Image Understanding, 2024, 238
Issue Date:: 2024-01-01

Embargoed

	Filename	Description	Size
	CMGNet Collaborative multi-modal graph network for video captioning.pdf	Accepted version	15.13 MB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Embargoed
Open Access

This item is currently unavailable due to the publisher's embargo.

The embargo period expires on 1 Jan 2026

Full metadata record

Field	Value	Language
dc.contributor.author	Rao, Q
dc.contributor.author	Yu, X https://orcid.org/0000-0002-0269-5649
dc.contributor.author	Li, G
dc.contributor.author	Zhu, L
dc.date.accessioned	2024-02-13T07:32:35Z
dc.date.available	2024-02-13T07:32:35Z
dc.date.issued	2024-01-01
dc.identifier.citation	Computer Vision and Image Understanding, 2024, 238
dc.identifier.issn	1077-3142
dc.identifier.issn	1090-235X
dc.identifier.uri	http://hdl.handle.net/10453/175644
dc.description.abstract	In video captioning, it is very challenging to comprehensively describe multi-modal content information of a video, such as appearance, motion, and object. Prior arts often neglect interactions among multiple modalities and thus their video representations may not fully depict scene contents. In this paper, we propose a collaborative multi-modal graph network (CMGNet) to explore the interactions among multi-modal features in video captioning. Our CMGNet is composed of an encoder–decoder structure: a Compression-driven Intra-inter Attentive Graph (CIAG) encoder and an Adaptive Multi-modal Selection (AMS) decoder. Specifically, in our CIAG encoder, we first design a Basis Vector Compression (BVC) module to reduce the redundant nodes in graphs and thus improve the efficiency in coping with a large number of nodes. Then we propose an Intra-inter Attentive Graph (IAG) to improve the graph representation by sharing information across intra-and-inter nodes. Afterwards, we present an AMS decoder to generate video captions from the encoded video res presentations. In particular, we let the proposed AMS decoder learn to produce words by adaptively focusing on different modality information, thus leading to comprehensive and accurate captions. Extensive experiments on the large-scale benchmarks, i.e., MSR-VTT and TGIF, demonstrate that our proposed CMGNet achieves the state-of-the-art.
dc.language	English
dc.publisher	ACADEMIC PRESS INC ELSEVIER SCIENCE
dc.relation.ispartof	Computer Vision and Image Understanding
dc.relation.isbasedon	10.1016/j.cviu.2023.103864
dc.rights	info:eu-repo/semantics/embargoedAccess
dc.subject	0801 Artificial Intelligence and Image Processing, 1702 Cognitive Sciences
dc.subject.classification	Artificial Intelligence & Image Processing
dc.subject.classification	4602 Artificial intelligence
dc.subject.classification	4603 Computer vision and multimedia computation
dc.subject.classification	4607 Graphics, augmented reality and games
dc.title	CMGNet: Collaborative multi-modal graph network for video captioning
dc.type	Journal Article
utslib.citation.volume	238
utslib.for	0801 Artificial Intelligence and Image Processing
utslib.for	1702 Cognitive Sciences
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	embargoed	*
utslib.copyright.embargo	2026-01-01T00:00:00+1000Z
dc.date.updated	2024-02-13T07:32:33Z
pubs.publication-status	Published
pubs.volume	238

Abstract:

In video captioning, it is very challenging to comprehensively describe multi-modal content information of a video, such as appearance, motion, and object. Prior arts often neglect interactions among multiple modalities and thus their video representations may not fully depict scene contents. In this paper, we propose a collaborative multi-modal graph network (CMGNet) to explore the interactions among multi-modal features in video captioning. Our CMGNet is composed of an encoder–decoder structure: a Compression-driven Intra-inter Attentive Graph (CIAG) encoder and an Adaptive Multi-modal Selection (AMS) decoder. Specifically, in our CIAG encoder, we first design a Basis Vector Compression (BVC) module to reduce the redundant nodes in graphs and thus improve the efficiency in coping with a large number of nodes. Then we propose an Intra-inter Attentive Graph (IAG) to improve the graph representation by sharing information across intra-and-inter nodes. Afterwards, we present an AMS decoder to generate video captions from the encoded video res presentations. In particular, we let the proposed AMS decoder learn to produce words by adaptively focusing on different modality information, thus leading to comprehensive and accurate captions. Extensive experiments on the large-scale benchmarks, i.e., MSR-VTT and TGIF, demonstrate that our proposed CMGNet achieves the state-of-the-art.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/175644