Efficient matrix sketching over distributed data

Huang, Z; Lin, X; Zhang, W; Zhang, Y

Efficient matrix sketching over distributed data

Huang, Z Lin, X Zhang, W

Zhang, Y

Permalink

Publication Type:: Conference Proceeding
Citation:: Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 2017, Part F127745 pp. 347 - 359
Issue Date:: 2017-05-09

Closed Access

	Filename	Description	Size
	p347-huang.pdf	Published version	639.41 kB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Huang, Z	en_US
dc.contributor.author	Lin, X	en_US
dc.contributor.author	Zhang, W https://orcid.org/0000-0001-6572-2600	en_US
dc.contributor.author	Zhang, Y https://orcid.org/0000-0002-2674-1638	en_US
dc.date.issued	2017-05-09	en_US
dc.identifier.citation	Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 2017, Part F127745 pp. 347 - 359	en_US
dc.identifier.isbn	9781450341981	en_US
dc.identifier.uri	http://hdl.handle.net/10453/127276
dc.description.abstract	© 2017 ACM. A sketch or synopsis of a large dataset captures vital properties of the original data while typically occupying much less space. In this paper, we consider the problem of computing a sketch of a massive data matrix A ϵ ℝ n×d , which is distributed across a large number of s servers. Our goal is to output a matrix B ϵ ℝ ℓ × d which is significantly smaller than but still approximates A well in terms of covariance error, i.e., \|\|A T A - B T B\|\|2. Here, for a matrix A, \|\|A\|\|2 is the spectral norm of A, which is defined as the largest singular value of A. Following previous works, we call B a covariance sketch of A. We are mainly focused on minimizing the communication cost, which is arguably the most valuable resource in distributed computations. We show a gap between deterministic and randomized communication complexity for computing a covariance sketch. More specifically, we first prove a tight deterministic lower bound, then show how to bypass this lower bound using randomization. In Principle Component Analysis (PCA), the goal is to find a low-dimensional subspace that captures as much of the variance of a dataset as possible. Based on a well-known connection between covariance sketch and PCA, we give a new algorithm for distributed PCA with improved communication cost. Moreover, in our algorithms, each server only needs to make one pass over the data with limited working space.	en_US
dc.relation.ispartof	Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems	en_US
dc.relation.isbasedon	10.1145/3034786.3056119	en_US
dc.title	Efficient matrix sketching over distributed data	en_US
dc.type	Conference Proceeding
utslib.citation.volume	Part F127745	en_US
utslib.for	0806 Information Systems	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
pubs.organisational-group	/University of Technology Sydney/Faculty of Science
pubs.organisational-group	/University of Technology Sydney/Faculty of Science/School of Life Sciences
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
utslib.copyright.status	closed_access
pubs.publication-status	Published	en_US
pubs.volume	Part F127745	en_US

Abstract:

© 2017 ACM. A sketch or synopsis of a large dataset captures vital properties of the original data while typically occupying much less space. In this paper, we consider the problem of computing a sketch of a massive data matrix A ϵ ℝ n×d , which is distributed across a large number of s servers. Our goal is to output a matrix B ϵ ℝ ℓ × d which is significantly smaller than but still approximates A well in terms of covariance error, i.e., ||A T A - B T B||2. Here, for a matrix A, ||A||2 is the spectral norm of A, which is defined as the largest singular value of A. Following previous works, we call B a covariance sketch of A. We are mainly focused on minimizing the communication cost, which is arguably the most valuable resource in distributed computations. We show a gap between deterministic and randomized communication complexity for computing a covariance sketch. More specifically, we first prove a tight deterministic lower bound, then show how to bypass this lower bound using randomization. In Principle Component Analysis (PCA), the goal is to find a low-dimensional subspace that captures as much of the variance of a dataset as possible. Based on a well-known connection between covariance sketch and PCA, we give a new algorithm for distributed PCA with improved communication cost. Moreover, in our algorithms, each server only needs to make one pass over the data with limited working space.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/127276