Optimizing sparse tensor times matrix on GPUs

Ma, Y; Li, J; Wu, X; Yan, C; Sun, J; Vuduc, R

Optimizing sparse tensor times matrix on GPUs

Ma, Y

Li, J Wu, X Yan, C Sun, J Vuduc, R

Permalink

Publisher:: Elsevier
Publication Type:: Journal Article
Citation:: Journal of Parallel and Distributed Computing, 2019, 129, pp. 99-109
Issue Date:: 2019

Closed Access

	Filename	Description	Size
	1-s2.0-S0743731518305161-main.pdf	Published version	889.16 kB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Ma, Y https://orcid.org/0000-0001-8884-6278
dc.contributor.author	Li, J
dc.contributor.author	Wu, X
dc.contributor.author	Yan, C
dc.contributor.author	Sun, J
dc.contributor.author	Vuduc, R
dc.date.accessioned	2020-05-07T04:42:05Z
dc.date.available	2020-05-07T04:42:05Z
dc.date.issued	2019
dc.identifier.citation	Journal of Parallel and Distributed Computing, 2019, 129, pp. 99-109
dc.identifier.issn	0743-7315
dc.identifier.issn	1096-0848
dc.identifier.uri	http://hdl.handle.net/10453/140538
dc.description.abstract	© 2018 Elsevier Inc. This work optimizes tensor-times-dense matrix multiply (Ttm) for general sparse and semi-sparse tensors on CPU and NVIDIA GPU platforms. Ttm is a computational kernel in tensor methods-based data analytics and data mining applications, such as the popular Tucker decomposition. We first design an in-place sequential SpTtm to avoid explicit data reorganizing between a tensor and a matrix in its conventional approach. We further optimize SpTtm on NVIDIA GPU platforms. Five approaches including employing fine thread granularity, arranging coalesced memory access, rank blocking, and using fast GPU shared memory are developed for GPU-SpTtm. We also optimize semi-sparse tensor-times-dense matrix multiply (SspTtm) to take advantage of the inside dense sub-structures. The optimized SpTtm and SspTtm are applied to Tucker decomposition to improve its overall performance. Our sequential SpTtm is 3–120× faster than the SpTtm from Tensor Toolbox library. GPU-SpTtm obtains 6–19× speedup on NVIDIA K40c and 23–67× speedup on NVIDIA P100 over CPU-SpTtm respectively. Our GPU-SpTtm is 3.9× faster than the state-of-the-art GPU implementation. Our SspTtm implementations outperform SpTtms by up to 4.5×, which handles the input semi-sparse tensor in a general way. Tucker decomposition achieves up to 3.2× speedup after applying the optimized Ttms. The code will be publicly released in ParTI! library: https://github.com/hpcgarage/ParTI.
dc.language	en
dc.publisher	Elsevier
dc.relation.ispartof	Journal of Parallel and Distributed Computing
dc.relation.isbasedon	10.1016/j.jpdc.2018.07.018
dc.rights	info:eu-repo/semantics/restrictedAccess
dc.subject	0803 Computer Software, 0805 Distributed Computing
dc.subject.classification	Distributed Computing
dc.title	Optimizing sparse tensor times matrix on GPUs
dc.type	Journal Article
utslib.citation.volume	129
utslib.for	0803 Computer Software
utslib.for	0805 Distributed Computing
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Software
pubs.organisational-group	/University of Technology Sydney
utslib.copyright.status	closed_access	*
pubs.consider-herdc	false
dc.date.updated	2020-05-07T04:42:03Z
pubs.publication-status	Accepted
pubs.volume	129
utslib.start-page	99

Abstract:

© 2018 Elsevier Inc. This work optimizes tensor-times-dense matrix multiply (Ttm) for general sparse and semi-sparse tensors on CPU and NVIDIA GPU platforms. Ttm is a computational kernel in tensor methods-based data analytics and data mining applications, such as the popular Tucker decomposition. We first design an in-place sequential SpTtm to avoid explicit data reorganizing between a tensor and a matrix in its conventional approach. We further optimize SpTtm on NVIDIA GPU platforms. Five approaches including employing fine thread granularity, arranging coalesced memory access, rank blocking, and using fast GPU shared memory are developed for GPU-SpTtm. We also optimize semi-sparse tensor-times-dense matrix multiply (SspTtm) to take advantage of the inside dense sub-structures. The optimized SpTtm and SspTtm are applied to Tucker decomposition to improve its overall performance. Our sequential SpTtm is 3–120× faster than the SpTtm from Tensor Toolbox library. GPU-SpTtm obtains 6–19× speedup on NVIDIA K40c and 23–67× speedup on NVIDIA P100 over CPU-SpTtm respectively. Our GPU-SpTtm is 3.9× faster than the state-of-the-art GPU implementation. Our SspTtm implementations outperform SpTtms by up to 4.5×, which handles the input semi-sparse tensor in a general way. Tucker decomposition achieves up to 3.2× speedup after applying the optimized Ttms. The code will be publicly released in ParTI! library: https://github.com/hpcgarage/ParTI.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/140538