Watching a small portion could be as good as watching all: Towards efficient video classification

Fan, H; Xu, Z; Zhu, L; Yan, C; Ge, J; Yang, Y

Watching a small portion could be as good as watching all: Towards efficient video classification

Fan, H Xu, Z Zhu, L

Yan, C Ge, J Yang, Y

Permalink

Publication Type:: Conference Proceeding
Citation:: IJCAI International Joint Conference on Artificial Intelligence, 2018, 2018-July pp. 705 - 711
Issue Date:: 2018-01-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Download Accepted ManuscriptAdobe PDF (521.49 kB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Fan, H	en_US
dc.contributor.author	Xu, Z	en_US
dc.contributor.author	Zhu, L https://orcid.org/0000-0002-4093-7557	en_US
dc.contributor.author	Yan, C	en_US
dc.contributor.author	Ge, J	en_US
dc.contributor.author	Yang, Y https://orcid.org/0000-0001-5528-0546	en_US
dc.date.issued	2018-01-01	en_US
dc.identifier.citation	IJCAI International Joint Conference on Artificial Intelligence, 2018, 2018-July pp. 705 - 711	en_US
dc.identifier.isbn	9780999241127	en_US
dc.identifier.issn	1045-0823	en_US
dc.identifier.uri	http://hdl.handle.net/10453/130103
dc.description.abstract	© 2018 International Joint Conferences on Artificial Intelligence. All right reserved. We aim to significantly reduce the computational cost for classification of temporally untrimmed videos while retaining similar accuracy. Existing video classification methods sample frames with a predefined frequency over entire video. Differently, we propose an end-to-end deep reinforcement approach which enables an agent to classify videos by watching a very small portion of frames like what we do. We make two main contributions. First, information is not equally distributed in video frames along time. An agent needs to watch more carefully when a clip is informative and skip the frames if they are redundant or irrelevant. The proposed approach enables the agent to adapt sampling rate to video content and skip most of the frames without the loss of information. Second, in order to have a confident decision, the number of frames that should be watched by an agent varies greatly from one video to another. We incorporate an adaptive stop network to measure confidence score and generate timely trigger to stop the agent watching videos, which improves efficiency without loss of accuracy. Our approach reduces the computational cost significantly for the large-scale YouTube-8M dataset, while the accuracy remains the same.	en_US
dc.relation.ispartof	IJCAI International Joint Conference on Artificial Intelligence	en_US
dc.title	Watching a small portion could be as good as watching all: Towards efficient video classification	en_US
dc.type	Conference Proceeding
utslib.citation.volume	2018-July	en_US
utslib.for	0802 Computation Theory and Mathematics	en_US
utslib.for	0806 Information Systems	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
pubs.organisational-group	/University of Technology Sydney/Students
utslib.copyright.status	open_access
pubs.publication-status	Published	en_US
pubs.volume	2018-July	en_US

Abstract:

© 2018 International Joint Conferences on Artificial Intelligence. All right reserved. We aim to significantly reduce the computational cost for classification of temporally untrimmed videos while retaining similar accuracy. Existing video classification methods sample frames with a predefined frequency over entire video. Differently, we propose an end-to-end deep reinforcement approach which enables an agent to classify videos by watching a very small portion of frames like what we do. We make two main contributions. First, information is not equally distributed in video frames along time. An agent needs to watch more carefully when a clip is informative and skip the frames if they are redundant or irrelevant. The proposed approach enables the agent to adapt sampling rate to video content and skip most of the frames without the loss of information. Second, in order to have a confident decision, the number of frames that should be watched by an agent varies greatly from one video to another. We incorporate an adaptive stop network to measure confidence score and generate timely trigger to stop the agent watching videos, which improves efficiency without loss of accuracy. Our approach reduces the computational cost significantly for the large-scale YouTube-8M dataset, while the accuracy remains the same.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/130103