Deep Hierarchical Representation of Point Cloud Videos via Spatio-Temporal Decomposition.

Fan, H; Yu, X; Yang, Y; Kankanhalli, M

Deep Hierarchical Representation of Point Cloud Videos via Spatio-Temporal Decomposition.

Fan, H Yu, X

Yang, Y

Kankanhalli, M

Permalink

Publisher:: Institute of Electrical and Electronics Engineers
Publication Type:: Journal Article
Citation:: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, PP, (12), pp. 9918-9930
Issue Date:: 2022-12-14

Closed Access

	Filename	Description	Size
	Deep Hierarchical Representation of Point Cloud Videos via Spatio-Temporal Decomposition..pdf	Published version	3.83 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Fan, H
dc.contributor.author	Yu, X https://orcid.org/0000-0002-0269-5649
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.contributor.author	Kankanhalli, M
dc.date.accessioned	2023-03-21T03:28:20Z
dc.date.available	2023-03-21T03:28:20Z
dc.date.issued	2022-12-14
dc.identifier.citation	IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, PP, (12), pp. 9918-9930
dc.identifier.issn	0162-8828
dc.identifier.issn	1939-3539
dc.identifier.uri	http://hdl.handle.net/10453/167937
dc.description.abstract	In point cloud videos, point coordinates are irregular and unordered but point timestamps exhibit regularities and order. Grid-based networks for conventional video processing cannot be directly used to model raw point cloud videos. Therefore, in this work, we propose a point-based network that directly handles raw point cloud videos. First, to preserve the spatio-temporal local structure of point cloud videos, we design a point tube covering a local range along spatial and temporal dimensions. By progressively subsampling frames and points and enlarging the spatial radius as the point features are fed into higher-level layers, the point tube can capture video structure in a spatio-temporally hierarchical manner. Second, to reduce the impact of the spatial irregularity on temporal modeling, we decompose space and time when extracting point tube representations. Specifically, a spatial operation is employed to capture the local structure of each spatial region in a tube and a temporal operation is used to model the dynamics of the spatial regions along the tube. Empirically, the proposed network shows strong performance on 3D action recognition and 4D semantic segmentation. Theoretically, we analyse the necessity to decompose space and time in point cloud video modeling and why the network outperforms existing methods.
dc.format	Print-Electronic
dc.language	eng
dc.publisher	Institute of Electrical and Electronics Engineers
dc.relation.ispartof	IEEE Transactions on Pattern Analysis and Machine Intelligence
dc.relation.isbasedon	10.1109/TPAMI.2021.3135117
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0801 Artificial Intelligence and Image Processing, 0806 Information Systems, 0906 Electrical and Electronic Engineering
dc.subject.classification	Artificial Intelligence & Image Processing
dc.title	Deep Hierarchical Representation of Point Cloud Videos via Spatio-Temporal Decomposition.
dc.type	Journal Article
utslib.citation.volume	PP
utslib.location.activity	United States
utslib.for	0801 Artificial Intelligence and Image Processing
utslib.for	0806 Information Systems
utslib.for	0906 Electrical and Electronic Engineering
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	closed_access	*
pubs.consider-herdc	false
dc.date.updated	2023-03-21T03:28:19Z
pubs.issue	12
pubs.publication-status	Published online
pubs.volume	PP
utslib.citation.issue	12

Abstract:

In point cloud videos, point coordinates are irregular and unordered but point timestamps exhibit regularities and order. Grid-based networks for conventional video processing cannot be directly used to model raw point cloud videos. Therefore, in this work, we propose a point-based network that directly handles raw point cloud videos. First, to preserve the spatio-temporal local structure of point cloud videos, we design a point tube covering a local range along spatial and temporal dimensions. By progressively subsampling frames and points and enlarging the spatial radius as the point features are fed into higher-level layers, the point tube can capture video structure in a spatio-temporally hierarchical manner. Second, to reduce the impact of the spatial irregularity on temporal modeling, we decompose space and time when extracting point tube representations. Specifically, a spatial operation is employed to capture the local structure of each spatial region in a tube and a temporal operation is used to model the dynamics of the spatial regions along the tube. Empirically, the proposed network shows strong performance on 3D action recognition and 4D semantic segmentation. Theoretically, we analyse the necessity to decompose space and time in point cloud video modeling and why the network outperforms existing methods.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/167937