Convolutional Reconstruction-to-Sequence for Video Captioning

Wu, A; Han, Y; Yang, Y; Hu, Q; Wu, F

Convolutional Reconstruction-to-Sequence for Video Captioning

Wu, A Han, Y Yang, Y

Hu, Q Wu, F

Permalink

Publisher:: Institute of Electrical and Electronics Engineers
Publication Type:: Journal Article
Citation:: IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30, (11), pp. 4299-4308
Issue Date:: 2020

Closed Access

	Filename	Description	Size
	08917665.pdf	Published version	3.4 MB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Wu, A
dc.contributor.author	Han, Y
dc.contributor.author	Yang, Y https://orcid.org/0000-0001-5528-0546
dc.contributor.author	Hu, Q
dc.contributor.author	Wu, F
dc.date.accessioned	2020-12-18T05:39:12Z
dc.date.available	2020-12-18T05:39:12Z
dc.date.issued	2020
dc.identifier.citation	IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30, (11), pp. 4299-4308
dc.identifier.issn	1051-8215
dc.identifier.issn	1558-2205
dc.identifier.uri	http://hdl.handle.net/10453/144847
dc.description.abstract	Recent advances towards video captioning mainly follow an encoder-decoder (sequence-to-sequence) framework and generate captions via a recurrent neural network (RNN). However, employing RNN as the decoder (generator) is prone to diluting long-term information, which weakens its ability to capture long-term dependencies. Recently, some work has demonstrated that the convolutional neural network (CNN) could be used to model sequential information. Though strengths in representation ability and computation efficiency, CNN has not been well exploited in video captioning. The reason partially comes from the difficulty of modeling multi-modal sequence with CNN. In this paper, we devise a novel CNN-based encoder-decoder framework for video captioning. Particularly, we first append inter-frame differences to each CNN-extracted frame feature to get a more discriminative representation; then with that as the input, we encode each frame to be a more compact feature by a one-layer convolutional mapping, which could be taken as a reconstruction network. In the decoding stage, we first fuse visual and lexical feature; then we stack multiple dilated convolutional layers to form a hierarchical decoder. As long-term dependencies could be captured by a shorter path along the hierarchical structure, the decoder could alleviate the loss of long-term information. Experiments on two benchmark datasets show that our method could obtain state-of-the-art performance.
dc.language	en
dc.publisher	Institute of Electrical and Electronics Engineers
dc.relation.ispartof	IEEE Transactions on Circuits and Systems for Video Technology
dc.relation.isbasedon	10.1109/tcsvt.2019.2956593
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0801 Artificial Intelligence and Image Processing, 0906 Electrical and Electronic Engineering
dc.subject.classification	Artificial Intelligence & Image Processing
dc.title	Convolutional Reconstruction-to-Sequence for Video Captioning
dc.type	Journal Article
utslib.citation.volume	30
utslib.for	0801 Artificial Intelligence and Image Processing
utslib.for	0906 Electrical and Electronic Engineering
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney
utslib.copyright.status	closed_access	*
pubs.consider-herdc	false
dc.date.updated	2020-12-18T05:38:57Z
pubs.issue	11
pubs.publication-status	Published
pubs.volume	30
utslib.citation.issue	11

Abstract:

Recent advances towards video captioning mainly follow an encoder-decoder (sequence-to-sequence) framework and generate captions via a recurrent neural network (RNN). However, employing RNN as the decoder (generator) is prone to diluting long-term information, which weakens its ability to capture long-term dependencies. Recently, some work has demonstrated that the convolutional neural network (CNN) could be used to model sequential information. Though strengths in representation ability and computation efficiency, CNN has not been well exploited in video captioning. The reason partially comes from the difficulty of modeling multi-modal sequence with CNN. In this paper, we devise a novel CNN-based encoder-decoder framework for video captioning. Particularly, we first append inter-frame differences to each CNN-extracted frame feature to get a more discriminative representation; then with that as the input, we encode each frame to be a more compact feature by a one-layer convolutional mapping, which could be taken as a reconstruction network. In the decoding stage, we first fuse visual and lexical feature; then we stack multiple dilated convolutional layers to form a hierarchical decoder. As long-term dependencies could be captured by a shorter path along the hierarchical structure, the decoder could alleviate the loss of long-term information. Experiments on two benchmark datasets show that our method could obtain state-of-the-art performance.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/144847