FASTER recurrent networks for efficient video classification

Zhu, L; Tran, D; Sevilla-Lara, L; Yang, Y; Feiszli, M; Wang, H

FASTER recurrent networks for efficient video classification

Zhu, L

Tran, D Sevilla-Lara, L Yang, Y

Feiszli, M Wang, H

Permalink

Publication Type:: Conference Proceeding
Citation:: AAAI 2020 - 34th AAAI Conference on Artificial Intelligence, 2020, pp. 13098-13105
Issue Date:: 2020-01-01

Closed Access

	Filename	Description	Size
	7012-Article Text-10241-1-10-20200525.pdf	Published version	703.88 kB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Zhu, L https://orcid.org/0000-0002-4093-7557
dc.contributor.author	Tran, D
dc.contributor.author	Sevilla-Lara, L
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.contributor.author	Feiszli, M
dc.contributor.author	Wang, H
dc.date.accessioned	2021-06-25T05:10:09Z
dc.date.available	2021-06-25T05:10:09Z
dc.date.issued	2020-01-01
dc.identifier.citation	AAAI 2020 - 34th AAAI Conference on Artificial Intelligence, 2020, pp. 13098-13105
dc.identifier.isbn	9781577358350
dc.identifier.uri	http://hdl.handle.net/10453/149756
dc.description.abstract	Typical video classification methods often divide a video into short clips, do inference on each clip independently, then aggregate the clip-level predictions to generate the video-level results. However, processing visually similar clips independently ignores the temporal structure of the video sequence, and increases the computational cost at inference time. In this paper, we propose a novel framework named FASTER, i.e., Feature Aggregation for Spatio-TEmporal Redundancy. FASTER aims to leverage the redundancy between neighboring clips and reduce the computational cost by learning to aggregate the predictions from models of different complexities. The FASTER framework can integrate high quality representations from expensive models to capture subtle motion information and lightweight representations from cheap models to cover scene changes in the video. A new recurrent network (i.e., FAST-GRU) is designed to aggregate the mixture of different representations. Compared with existing approaches, FASTER can reduce the FLOPs by over 10× while maintaining the state-of-the-art accuracy across popular datasets, such as Kinetics, UCF-101 and HMDB-51.
dc.language	en
dc.relation.ispartof	AAAI 2020 - 34th AAAI Conference on Artificial Intelligence
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	FASTER recurrent networks for efficient video classification
dc.type	Conference Proceeding
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	closed_access	*
dc.date.updated	2021-06-25T05:10:08Z
pubs.publication-status	Published

Abstract:

Typical video classification methods often divide a video into short clips, do inference on each clip independently, then aggregate the clip-level predictions to generate the video-level results. However, processing visually similar clips independently ignores the temporal structure of the video sequence, and increases the computational cost at inference time. In this paper, we propose a novel framework named FASTER, i.e., Feature Aggregation for Spatio-TEmporal Redundancy. FASTER aims to leverage the redundancy between neighboring clips and reduce the computational cost by learning to aggregate the predictions from models of different complexities. The FASTER framework can integrate high quality representations from expensive models to capture subtle motion information and lightweight representations from cheap models to cover scene changes in the video. A new recurrent network (i.e., FAST-GRU) is designed to aggregate the mixture of different representations. Compared with existing approaches, FASTER can reduce the FLOPs by over 10× while maintaining the state-of-the-art accuracy across popular datasets, such as Kinetics, UCF-101 and HMDB-51.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/149756