Hierarchical recurrent neural encoder for video representation with application to captioning

Pan, P; Xu, Z; Yang, Y; Wu, F; Zhuang, Y

Hierarchical recurrent neural encoder for video representation with application to captioning

Pan, P Xu, Z Yang, Y

Wu, F Zhuang, Y

Permalink

Publication Type:: Conference Proceeding
Citation:: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016, 2016-December pp. 1029 - 1038
Issue Date:: 2016-12-09

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Accepted Manuscript VersionAdobe PDF (1.06 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Pan, P	en_US
dc.contributor.author	Xu, Z	en_US
dc.contributor.author	Yang, Y https://orcid.org/0000-0001-5528-0546	en_US
dc.contributor.author	Wu, F	en_US
dc.contributor.author	Zhuang, Y	en_US
dc.date.issued	2016-12-09	en_US
dc.identifier.citation	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016, 2016-December pp. 1029 - 1038	en_US
dc.identifier.isbn	9781467388504	en_US
dc.identifier.issn	1063-6919	en_US
dc.identifier.uri	http://hdl.handle.net/10453/121790
dc.description.abstract	© 2016 IEEE. Recently, deep learning approach, especially deep Convolutional Neural Networks (ConvNets), have achieved overwhelming accuracy with fast processing speed for image classification. Incorporating temporal structure with deep ConvNets for video representation becomes a fundamental problem for video content analysis. In this paper, we propose a new approach, namely Hierarchical Recurrent Neural Encoder (HRNE), to exploit temporal information of videos. Compared to recent video representation inference approaches, this paper makes the following three contributions. First, our HRNE is able to efficiently exploit video temporal structure in a longer range by reducing the length of input information flow, and compositing multiple consecutive inputs at a higher level. Second, computation operations are significantly lessened while attaining more non-linearity. Third, HRNE is able to uncover temporal tran-sitions between frame chunks with different granularities, i.e. it can model the temporal transitions between frames as well as the transitions between segments. We apply the new method to video captioning where temporal information plays a crucial role. Experiments demonstrate that our method outperforms the state-of-the-art on video captioning benchmarks.	en_US
dc.relation	http://purl.org/au-research/grants/arc/DP150103008
dc.relation.ispartof	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition	en_US
dc.relation.isbasedon	10.1109/CVPR.2016.117	en_US
dc.title	Hierarchical recurrent neural encoder for video representation with application to captioning	en_US
dc.type	Conference Proceeding
utslib.citation.volume	2016-December	en_US
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
pubs.organisational-group	/University of Technology Sydney/Students
utslib.copyright.status	open_access
pubs.publication-status	Published	en_US
pubs.volume	2016-December	en_US

Abstract:

© 2016 IEEE. Recently, deep learning approach, especially deep Convolutional Neural Networks (ConvNets), have achieved overwhelming accuracy with fast processing speed for image classification. Incorporating temporal structure with deep ConvNets for video representation becomes a fundamental problem for video content analysis. In this paper, we propose a new approach, namely Hierarchical Recurrent Neural Encoder (HRNE), to exploit temporal information of videos. Compared to recent video representation inference approaches, this paper makes the following three contributions. First, our HRNE is able to efficiently exploit video temporal structure in a longer range by reducing the length of input information flow, and compositing multiple consecutive inputs at a higher level. Second, computation operations are significantly lessened while attaining more non-linearity. Third, HRNE is able to uncover temporal tran-sitions between frame chunks with different granularities, i.e. it can model the temporal transitions between frames as well as the transitions between segments. We apply the new method to video captioning where temporal information plays a crucial role. Experiments demonstrate that our method outperforms the state-of-the-art on video captioning benchmarks.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/121790