Beyond Frame-level CNN: Saliency-Aware 3-D CNN with LSTM for Video Action Recognition

Wang, X; Gao, L; Song, J; Shen, H

Beyond Frame-level CNN: Saliency-Aware 3-D CNN with LSTM for Video Action Recognition

Wang, X Gao, L Song, J Shen, H

Permalink

Publication Type:: Journal Article
Citation:: IEEE Signal Processing Letters, 2017, 24 (4), pp. 510 - 514
Issue Date:: 2017-04-01

Closed Access

	Filename	Description	Size
	c.pdf	Published Version	600.22 kB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Wang, X	en_US
dc.contributor.author	Gao, L	en_US
dc.contributor.author	Song, J	en_US
dc.contributor.author	Shen, H	en_US
dc.date.issued	2017-04-01	en_US
dc.identifier.citation	IEEE Signal Processing Letters, 2017, 24 (4), pp. 510 - 514	en_US
dc.identifier.issn	1070-9908	en_US
dc.identifier.uri	http://hdl.handle.net/10453/123637
dc.description.abstract	© 2016 IEEE. Human activity recognition in videos with convolutional neural network (CNN) features has received increasing attention in multimedia understanding. Taking videos as a sequence of frames, a new record was recently set on several benchmark datasets by feeding frame-level CNN sequence features to long short-term memory (LSTM) model for video activity recognition. This recurrent model-based visual recognition pipeline is a natural choice for perceptual problems with time-varying visual input or sequential outputs. However, the above-mentioned pipeline takes frame-level CNN sequence features as input for LSTM, which may fail to capture the rich motion information from adjacent frames or maybe multiple clips. Furthermore, an activity is conducted by a subject or multiple subjects. It is important to consider attention that allows for salient features, instead of mapping an entire frame into a static representation. To tackle these issues, we propose a novel pipeline, saliency-aware three-dimensional (3-D) CNN with LSTM, for video action recognition by integrating LSTM with salient-aware deep 3-D CNN features on videos shots. Specifically, we first apply saliency-aware methods to generate saliency-aware videos. Then, we design an end-to-end pipeline by integrating 3-D CNN with LSTM, followed by a time series pooling layer and a softmax layer to predict the activities. Noticeably, we set a new record on two benchmark datasets, i.e., UCF101 with 13 320 videos and HMDB-51 with 6766 videos. Our method outperforms the state-of-the-art end-to-end methods of action recognition by 3.8% and 3.2%, respectively on above two datasets.	en_US
dc.relation.ispartof	IEEE Signal Processing Letters	en_US
dc.relation.isbasedon	10.1109/LSP.2016.2611485	en_US
dc.subject.classification	Networking & Telecommunications	en_US
dc.title	Beyond Frame-level CNN: Saliency-Aware 3-D CNN with LSTM for Video Action Recognition	en_US
dc.type	Journal Article
utslib.citation.volume	4	en_US
utslib.citation.volume	24	en_US
utslib.for	0906 Electrical and Electronic Engineering	en_US
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
utslib.for	1005 Communications Technologies	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Software
utslib.copyright.status	closed_access
pubs.issue	4	en_US
pubs.publication-status	Published	en_US
pubs.volume	24	en_US

Abstract:

© 2016 IEEE. Human activity recognition in videos with convolutional neural network (CNN) features has received increasing attention in multimedia understanding. Taking videos as a sequence of frames, a new record was recently set on several benchmark datasets by feeding frame-level CNN sequence features to long short-term memory (LSTM) model for video activity recognition. This recurrent model-based visual recognition pipeline is a natural choice for perceptual problems with time-varying visual input or sequential outputs. However, the above-mentioned pipeline takes frame-level CNN sequence features as input for LSTM, which may fail to capture the rich motion information from adjacent frames or maybe multiple clips. Furthermore, an activity is conducted by a subject or multiple subjects. It is important to consider attention that allows for salient features, instead of mapping an entire frame into a static representation. To tackle these issues, we propose a novel pipeline, saliency-aware three-dimensional (3-D) CNN with LSTM, for video action recognition by integrating LSTM with salient-aware deep 3-D CNN features on videos shots. Specifically, we first apply saliency-aware methods to generate saliency-aware videos. Then, we design an end-to-end pipeline by integrating 3-D CNN with LSTM, followed by a time series pooling layer and a softmax layer to predict the activities. Noticeably, we set a new record on two benchmark datasets, i.e., UCF101 with 13 320 videos and HMDB-51 with 6766 videos. Our method outperforms the state-of-the-art end-to-end methods of action recognition by 3.8% and 3.2%, respectively on above two datasets.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/123637