Describing video with attention-based bidirectional LSTM

Bin, Y; Yang, Y; Shen, F; Xie, N; Shen, HT; Li, X

Describing video with attention-based bidirectional LSTM

Bin, Y Yang, Y Shen, F Xie, N Shen, HT

Li, X

Permalink

Publication Type:: Journal Article
Citation:: IEEE Transactions on Cybernetics, 2019, 49 (7), pp. 2631 - 2641
Issue Date:: 2019-07-01

Closed Access

	Filename	Description	Size
	Describing Video With Attention-Based Bidirectional LSTM.pdf	Published Version	1.56 MB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Bin, Y	en_US
dc.contributor.author	Yang, Y	en_US
dc.contributor.author	Shen, F	en_US
dc.contributor.author	Xie, N	en_US
dc.contributor.author	Shen, HT https://orcid.org/0000-0002-2999-2088	en_US
dc.contributor.author	Li, X	en_US
dc.date.issued	2019-07-01	en_US
dc.identifier.citation	IEEE Transactions on Cybernetics, 2019, 49 (7), pp. 2631 - 2641	en_US
dc.identifier.issn	2168-2267	en_US
dc.identifier.uri	http://hdl.handle.net/10453/135069
dc.description.abstract	© 2013 IEEE. Video captioning has been attracting broad research attention in the multimedia community. However, most existing approaches heavily rely on static visual information or partially capture the local temporal knowledge (e.g., within 16 frames), thus hardly describing motions accurately from a global view. In this paper, we propose a novel video captioning framework, which integrates bidirectional long-short term memory (BiLSTM) and a soft attention mechanism to generate better global representations for videos as well as enhance the recognition of lasting motions in videos. To generate video captions, we exploit another long-short term memory as a decoder to fully explore global contextual information. The benefits of our proposed method are two fold: 1) the BiLSTM structure comprehensively preserves global temporal and visual information and 2) the soft attention mechanism enables a language decoder to recognize and focus on principle targets from the complex content. We verify the effectiveness of our proposed video captioning framework on two widely used benchmarks, that is, microsoft video description corpus and MSR-video to text, and the experimental results demonstrate the superiority of the proposed approach compared to several state-of-the-art methods.	en_US
dc.relation.ispartof	IEEE Transactions on Cybernetics	en_US
dc.relation.isbasedon	10.1109/TCYB.2018.2831447	en_US
dc.subject.classification	Artificial Intelligence & Image Processing	en_US
dc.title	Describing video with attention-based bidirectional LSTM	en_US
dc.type	Journal Article
utslib.citation.volume	7	en_US
utslib.citation.volume	49	en_US
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
utslib.for	0102 Applied Mathematics	en_US
utslib.for	0906 Electrical and Electronic Engineering	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Software
utslib.copyright.status	closed_access
pubs.issue	7	en_US
pubs.publication-status	Published	en_US
pubs.volume	49	en_US

Abstract:

© 2013 IEEE. Video captioning has been attracting broad research attention in the multimedia community. However, most existing approaches heavily rely on static visual information or partially capture the local temporal knowledge (e.g., within 16 frames), thus hardly describing motions accurately from a global view. In this paper, we propose a novel video captioning framework, which integrates bidirectional long-short term memory (BiLSTM) and a soft attention mechanism to generate better global representations for videos as well as enhance the recognition of lasting motions in videos. To generate video captions, we exploit another long-short term memory as a decoder to fully explore global contextual information. The benefits of our proposed method are two fold: 1) the BiLSTM structure comprehensively preserves global temporal and visual information and 2) the soft attention mechanism enables a language decoder to recognize and focus on principle targets from the complex content. We verify the effectiveness of our proposed video captioning framework on two widely used benchmarks, that is, microsoft video description corpus and MSR-video to text, and the experimental results demonstrate the superiority of the proposed approach compared to several state-of-the-art methods.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/135069