Convolutional Reconstruction-to-Sequence for Video Captioning

Institute of Electrical and Electronics Engineers
Publication Type:
Journal Article
IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30, (11), pp. 4299-4308
Issue Date:
Filename Description Size
08917665.pdfPublished version3.4 MB
Adobe PDF
Full metadata record
Recent advances towards video captioning mainly follow an encoder-decoder (sequence-to-sequence) framework and generate captions via a recurrent neural network (RNN). However, employing RNN as the decoder (generator) is prone to diluting long-term information, which weakens its ability to capture long-term dependencies. Recently, some work has demonstrated that the convolutional neural network (CNN) could be used to model sequential information. Though strengths in representation ability and computation efficiency, CNN has not been well exploited in video captioning. The reason partially comes from the difficulty of modeling multi-modal sequence with CNN. In this paper, we devise a novel CNN-based encoder-decoder framework for video captioning. Particularly, we first append inter-frame differences to each CNN-extracted frame feature to get a more discriminative representation; then with that as the input, we encode each frame to be a more compact feature by a one-layer convolutional mapping, which could be taken as a reconstruction network. In the decoding stage, we first fuse visual and lexical feature; then we stack multiple dilated convolutional layers to form a hierarchical decoder. As long-term dependencies could be captured by a shorter path along the hierarchical structure, the decoder could alleviate the loss of long-term information. Experiments on two benchmark datasets show that our method could obtain state-of-the-art performance.
Please use this identifier to cite or link to this item: