Video Captioning by Adversarial LSTM

Yang, Y; Zhou, J; Ai, J; Bin, Y; Hanjalic, A; Shen, HT; Ji, Y

Video Captioning by Adversarial LSTM

Yang, Y Zhou, J Ai, J Bin, Y Hanjalic, A Shen, HT Ji, Y

Permalink

Publication Type:: Journal Article
Citation:: IEEE Transactions on Image Processing, 2018, 27 (11), pp. 5600 - 5611
Issue Date:: 2018-11-01

Closed Access

	Filename	Description	Size
	08410586.pdf	Published Version	2.13 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Yang, Y	en_US
dc.contributor.author	Zhou, J	en_US
dc.contributor.author	Ai, J	en_US
dc.contributor.author	Bin, Y	en_US
dc.contributor.author	Hanjalic, A	en_US
dc.contributor.author	Shen, HT	en_US
dc.contributor.author	Ji, Y	en_US
dc.date.issued	2018-11-01	en_US
dc.identifier.citation	IEEE Transactions on Image Processing, 2018, 27 (11), pp. 5600 - 5611	en_US
dc.identifier.issn	1057-7149	en_US
dc.identifier.uri	http://hdl.handle.net/10453/131424
dc.description.abstract	© 1992-2012 IEEE. In this paper, we propose a novel approach to video captioning based on adversarial learning and long short-term memory (LSTM). With this solution concept, we aim at compensating for the deficiencies of LSTM-based video captioning methods that generally show potential to effectively handle temporal nature of video data when generating captions but also typically suffer from exponential error accumulation. Specifically, we adopt a standard generative adversarial network (GAN) architecture, characterized by an interplay of two competing processes: a 'generator' that generates textual sentences given the visual content of a video and a 'discriminator' that controls the accuracy of the generated sentences. The discriminator acts as an 'adversary' toward the generator, and with its controlling mechanism, it helps the generator to become more accurate. For the generator module, we take an existing video captioning concept using LSTM network. For the discriminator, we propose a novel realization specifically tuned for the video captioning problem and taking both the sentences and video features as input. This leads to our proposed LSTM-GAN system architecture, for which we show experimentally to significantly outperform the existing methods on standard public datasets.	en_US
dc.relation.ispartof	IEEE Transactions on Image Processing	en_US
dc.relation.isbasedon	10.1109/TIP.2018.2855422	en_US
dc.subject.classification	Artificial Intelligence & Image Processing	en_US
dc.title	Video Captioning by Adversarial LSTM	en_US
dc.type	Journal Article
utslib.citation.volume	11	en_US
utslib.citation.volume	27	en_US
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
utslib.for	0906 Electrical and Electronic Engineering	en_US
utslib.for	1702 Cognitive Sciences	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Software
utslib.copyright.status	closed_access
pubs.issue	11	en_US
pubs.publication-status	Published	en_US
pubs.volume	27	en_US

Abstract:

© 1992-2012 IEEE. In this paper, we propose a novel approach to video captioning based on adversarial learning and long short-term memory (LSTM). With this solution concept, we aim at compensating for the deficiencies of LSTM-based video captioning methods that generally show potential to effectively handle temporal nature of video data when generating captions but also typically suffer from exponential error accumulation. Specifically, we adopt a standard generative adversarial network (GAN) architecture, characterized by an interplay of two competing processes: a 'generator' that generates textual sentences given the visual content of a video and a 'discriminator' that controls the accuracy of the generated sentences. The discriminator acts as an 'adversary' toward the generator, and with its controlling mechanism, it helps the generator to become more accurate. For the generator module, we take an existing video captioning concept using LSTM network. For the discriminator, we propose a novel realization specifically tuned for the video captioning problem and taking both the sentences and video features as input. This leads to our proposed LSTM-GAN system architecture, for which we show experimentally to significantly outperform the existing methods on standard public datasets.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/131424