Point 4D Transformer Networks for Spatio-Temporal Modeling in Point Cloud Videos

Fan, H; Yang, Y; Kankanhalli, M

Point 4D Transformer Networks for Spatio-Temporal Modeling in Point Cloud Videos

Fan, H Yang, Y

Kankanhalli, M

Permalink

Publisher:: IEEE
Publication Type:: Conference Proceeding
Citation:: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, 00, pp. 14199-14208
Issue Date:: 2021-11-13

Closed Access

	Filename	Description	Size
	Point_4D_Transformer_Networks_for_Spatio-Temporal_Modeling_in_Point_Cloud_Videos.pdf	Published version	1.4 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Fan, H
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.contributor.author	Kankanhalli, M
dc.date	2021-06-20
dc.date.accessioned	2022-06-05T01:22:30Z
dc.date.available	2022-06-05T01:22:30Z
dc.date.issued	2021-11-13
dc.identifier.citation	2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, 00, pp. 14199-14208
dc.identifier.isbn	9781665445092
dc.identifier.issn	1063-6919
dc.identifier.uri	http://hdl.handle.net/10453/157932
dc.description.abstract	Point cloud videos exhibit irregularities and lack of order along the spatial dimension where points emerge inconsistently across different frames. To capture the dynamics in point cloud videos, point tracking is usually employed. However, as points may flow in and out across frames, computing accurate point trajectories is extremely difficult. Moreover, tracking usually relies on point colors and thus may fail to handle colorless point clouds. In this paper, to avoid point tracking, we propose a novel Point 4D Transformer (P4Transformer) network to model raw point cloud videos. Specifically, P4Transformer consists of (i) a point 4D convolution to embed the spatio-temporal local structures presented in a point cloud video and (ii) a transformer to capture the appearance and motion information across the entire video by performing self-attention on the embedded local features. In this fashion, related or similar local areas are merged with attention weight rather than by explicit tracking. Extensive experiments, including 3D action recognition and 4D semantic segmentation, on four benchmarks demonstrate the effectiveness of our P4Transformer for point cloud video modeling.
dc.language	en
dc.publisher	IEEE
dc.relation.ispartof	2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
dc.relation.ispartof	2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition
dc.relation.ispartofseries	IEEE Conference on Computer Vision and Pattern Recognition
dc.relation.isbasedon	10.1109/cvpr46437.2021.01398
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	Point 4D Transformer Networks for Spatio-Temporal Modeling in Point Cloud Videos
dc.type	Conference Proceeding
utslib.citation.volume	00
utslib.location.activity	Nashville, TN, USA
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	closed_access	*
dc.date.updated	2022-06-05T01:22:29Z
pubs.finish-date	2021-06-25
pubs.publication-status	Published
pubs.start-date	2021-06-20
pubs.volume	00

Abstract:

Point cloud videos exhibit irregularities and lack of order along the spatial dimension where points emerge inconsistently across different frames. To capture the dynamics in point cloud videos, point tracking is usually employed. However, as points may flow in and out across frames, computing accurate point trajectories is extremely difficult. Moreover, tracking usually relies on point colors and thus may fail to handle colorless point clouds. In this paper, to avoid point tracking, we propose a novel Point 4D Transformer (P4Transformer) network to model raw point cloud videos. Specifically, P4Transformer consists of (i) a point 4D convolution to embed the spatio-temporal local structures presented in a point cloud video and (ii) a transformer to capture the appearance and motion information across the entire video by performing self-attention on the embedded local features. In this fashion, related or similar local areas are merged with attention weight rather than by explicit tracking. Extensive experiments, including 3D action recognition and 4D semantic segmentation, on four benchmarks demonstrate the effectiveness of our P4Transformer for point cloud video modeling.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/157932