Action Keypoint Network for Efficient Video Recognition.

Chen, X; Han, Y; Wang, X; Sun, Y; Yang, Y

Action Keypoint Network for Efficient Video Recognition.

Chen, X Han, Y Wang, X Sun, Y Yang, Y

Permalink

Publisher:: Institute of Electrical and Electronics Engineers (IEEE)
Publication Type:: Journal Article
Citation:: IEEE Trans Image Process, 2022, PP, pp. 4980-4993
Issue Date:: 2022-07-21

Closed Access

	Filename	Description	Size
	Action Keypoint Network for Efficient Video Recognition..pdf	Published version	3.45 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Chen, X
dc.contributor.author	Han, Y
dc.contributor.author	Wang, X
dc.contributor.author	Sun, Y
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.date.accessioned	2023-03-20T00:08:22Z
dc.date.available	2023-03-20T00:08:22Z
dc.date.issued	2022-07-21
dc.identifier.citation	IEEE Trans Image Process, 2022, PP, pp. 4980-4993
dc.identifier.issn	1057-7149
dc.identifier.issn	1941-0042
dc.identifier.uri	http://hdl.handle.net/10453/167647
dc.description.abstract	Reducing redundancy is crucial for improving the efficiency of video recognition models. An effective approach is to select informative content from the holistic video, yielding a popular family of dynamic video recognition methods. However, existing dynamic methods focus on either temporal or spatial selection independently while neglecting a reality that the redundancies are usually spatial and temporal, simultaneously. Moreover, their selected content is usually cropped with fixed shapes (e.g., temporally-cropped frames, spatially-cropped patches), while the realistic distribution of informative content can be much more diverse. With these two insights, this paper proposes to integrate temporal and spatial selection into an Action Keypoint Network (AK-Net). From different frames and positions, AK-Net selects some informative points scattered in arbitrary-shaped regions as a set of "action keypoints" and then transforms the video recognition into point cloud classification. More concretely, AK-Net has two steps, i.e., the keypoint selection and the point cloud classification. First, it inputs the video into a baseline network and outputs a feature map from an intermediate layer. We view each pixel on this feature map as a spatial-temporal point and select some informative keypoints using self-attention. Second, AK-Net devises a ranking criterion to arrange the keypoints into an ordered 1D sequence. Since the video is represented with a 1D sequence after the specified layer, AK-Net transforms the subsequent layers into a point cloud classification sub-net by compacting the original 2D convolutional kernels into 1D kernels. Consequentially, AK-Net brings two-fold benefits for efficiency: The keypoint selection step collects informative content within arbitrary shapes and increases the efficiency for modeling spatial-temporal dependencies, while the point cloud classification step further reduces the computational cost by compacting the convolutional kernels. Experimental results show that AK-Net can consistently improve the efficiency and performance of baseline methods on several video recognition benchmarks.
dc.format	Print-Electronic
dc.language	eng
dc.publisher	Institute of Electrical and Electronics Engineers (IEEE)
dc.relation.ispartof	IEEE Trans Image Process
dc.relation.isbasedon	10.1109/TIP.2022.3191461
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0801 Artificial Intelligence and Image Processing, 0906 Electrical and Electronic Engineering, 1702 Cognitive Sciences
dc.subject.classification	Artificial Intelligence & Image Processing
dc.title	Action Keypoint Network for Efficient Video Recognition.
dc.type	Journal Article
utslib.citation.volume	PP
utslib.location.activity	United States
utslib.for	0801 Artificial Intelligence and Image Processing
utslib.for	0906 Electrical and Electronic Engineering
utslib.for	1702 Cognitive Sciences
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
utslib.copyright.status	closed_access	*
dc.date.updated	2023-03-20T00:08:20Z
pubs.publication-status	Published online
pubs.volume	PP

Abstract:

Reducing redundancy is crucial for improving the efficiency of video recognition models. An effective approach is to select informative content from the holistic video, yielding a popular family of dynamic video recognition methods. However, existing dynamic methods focus on either temporal or spatial selection independently while neglecting a reality that the redundancies are usually spatial and temporal, simultaneously. Moreover, their selected content is usually cropped with fixed shapes (e.g., temporally-cropped frames, spatially-cropped patches), while the realistic distribution of informative content can be much more diverse. With these two insights, this paper proposes to integrate temporal and spatial selection into an Action Keypoint Network (AK-Net). From different frames and positions, AK-Net selects some informative points scattered in arbitrary-shaped regions as a set of "action keypoints" and then transforms the video recognition into point cloud classification. More concretely, AK-Net has two steps, i.e., the keypoint selection and the point cloud classification. First, it inputs the video into a baseline network and outputs a feature map from an intermediate layer. We view each pixel on this feature map as a spatial-temporal point and select some informative keypoints using self-attention. Second, AK-Net devises a ranking criterion to arrange the keypoints into an ordered 1D sequence. Since the video is represented with a 1D sequence after the specified layer, AK-Net transforms the subsequent layers into a point cloud classification sub-net by compacting the original 2D convolutional kernels into 1D kernels. Consequentially, AK-Net brings two-fold benefits for efficiency: The keypoint selection step collects informative content within arbitrary shapes and increases the efficiency for modeling spatial-temporal dependencies, while the point cloud classification step further reduces the computational cost by compacting the convolutional kernels. Experimental results show that AK-Net can consistently improve the efficiency and performance of baseline methods on several video recognition benchmarks.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/167647