Dynamic Inference: A New Approach Toward Efficient Video Action Recognition

Wu, W; He, D; Tan, X; Chen, S; Yang, Y; Wen, S

Dynamic Inference: A New Approach Toward Efficient Video Action Recognition

Wu, W He, D Tan, X Chen, S Yang, Y

Wen, S

Permalink

Publisher:: IEEE
Publication Type:: Conference Proceeding
Citation:: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020, 2020-June, pp. 2890-2898
Issue Date:: 2020-07-28

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

The embargo period expires on 28 Jul 2022

Adobe PDF

Download Accepted versionAdobe PDF (1.61 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Wu, W
dc.contributor.author	He, D
dc.contributor.author	Tan, X
dc.contributor.author	Chen, S
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.contributor.author	Wen, S
dc.date	2020-06-14
dc.date.accessioned	2021-05-05T08:40:54Z
dc.date.available	2021-05-05T08:40:54Z
dc.date.issued	2020-07-28
dc.identifier.citation	2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020, 2020-June, pp. 2890-2898
dc.identifier.isbn	9781728193601
dc.identifier.issn	2160-7508
dc.identifier.issn	2160-7516
dc.identifier.uri	http://hdl.handle.net/10453/148728
dc.description.abstract	Though action recognition in videos has achieved great success recently, it remains a challenging task due to the massive computational cost. Designing lightweight networks is a possible solution, but it may degrade the recognition performance. In this paper, we innovatively propose a general dynamic inference idea to improve inference efficiency by leveraging the variation in the distinguishability of different videos. The dynamic inference approach can be achieved from aspects of the network depth and the number of input video frames, or even in a joint input-wise and network depth-wise manner. In a nutshell, we treat input frames and network depth of the computational graph as a 2-dimensional grid, and several checkpoints are placed on this grid in advance with a prediction module. The inference is carried out progressively on the grid by following some predefined route, whenever the inference process comes across a checkpoint, an early prediction can be made depending on whether the early stop criteria meets. For the proof-of-concept purpose, we instantiate several dynamic inference frameworks. In these instances, we overcome the drawback of limited temporal coverage resulted from an early prediction by a novel frame permutation scheme, and alleviate the conflict between progressive computation and video temporal relation modeling by introducing the online temporal shift module. Extensive experiments are conducted to thoroughly analyze the effectiveness of our ideas and to inspire future research efforts. Results on various datasets also evident the superiority of our approach.
dc.language	en
dc.publisher	IEEE
dc.relation.ispartof	2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
dc.relation.ispartof	2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops
dc.relation.isbasedon	10.1109/cvprw50498.2020.00346
dc.rights	info:eu-repo/semantics/embargoedAccess
dc.title	Dynamic Inference: A New Approach Toward Efficient Video Action Recognition
dc.type	Conference Proceeding
utslib.citation.volume	2020-June
utslib.location.activity	Seattle, WA, USA
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	open_access	*
utslib.copyright.embargo	2022-07-28T00:00:00+1000Z
dc.date.updated	2021-05-05T08:40:53Z
pubs.finish-date	2020-06-19
pubs.publication-status	Published
pubs.start-date	2020-06-14
pubs.volume	2020-June

Abstract:

Though action recognition in videos has achieved great success recently, it remains a challenging task due to the massive computational cost. Designing lightweight networks is a possible solution, but it may degrade the recognition performance. In this paper, we innovatively propose a general dynamic inference idea to improve inference efficiency by leveraging the variation in the distinguishability of different videos. The dynamic inference approach can be achieved from aspects of the network depth and the number of input video frames, or even in a joint input-wise and network depth-wise manner. In a nutshell, we treat input frames and network depth of the computational graph as a 2-dimensional grid, and several checkpoints are placed on this grid in advance with a prediction module. The inference is carried out progressively on the grid by following some predefined route, whenever the inference process comes across a checkpoint, an early prediction can be made depending on whether the early stop criteria meets. For the proof-of-concept purpose, we instantiate several dynamic inference frameworks. In these instances, we overcome the drawback of limited temporal coverage resulted from an early prediction by a novel frame permutation scheme, and alleviate the conflict between progressive computation and video temporal relation modeling by introducing the online temporal shift module. Extensive experiments are conducted to thoroughly analyze the effectiveness of our ideas and to inspire future research efforts. Results on various datasets also evident the superiority of our approach.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/148728