CMGNet: Collaborative multi-modal graph network for video captioning

Publisher:
ACADEMIC PRESS INC ELSEVIER SCIENCE
Publication Type:
Journal Article
Citation:
Computer Vision and Image Understanding, 2024, 238
Issue Date:
2024-01-01
Filename Description Size
CMGNet Collaborative multi-modal graph network for video captioning.pdfAccepted version15.13 MB
Adobe PDF
Full metadata record
In video captioning, it is very challenging to comprehensively describe multi-modal content information of a video, such as appearance, motion, and object. Prior arts often neglect interactions among multiple modalities and thus their video representations may not fully depict scene contents. In this paper, we propose a collaborative multi-modal graph network (CMGNet) to explore the interactions among multi-modal features in video captioning. Our CMGNet is composed of an encoder–decoder structure: a Compression-driven Intra-inter Attentive Graph (CIAG) encoder and an Adaptive Multi-modal Selection (AMS) decoder. Specifically, in our CIAG encoder, we first design a Basis Vector Compression (BVC) module to reduce the redundant nodes in graphs and thus improve the efficiency in coping with a large number of nodes. Then we propose an Intra-inter Attentive Graph (IAG) to improve the graph representation by sharing information across intra-and-inter nodes. Afterwards, we present an AMS decoder to generate video captions from the encoded video res presentations. In particular, we let the proposed AMS decoder learn to produce words by adaptively focusing on different modality information, thus leading to comprehensive and accurate captions. Extensive experiments on the large-scale benchmarks, i.e., MSR-VTT and TGIF, demonstrate that our proposed CMGNet achieves the state-of-the-art.
Please use this identifier to cite or link to this item: