Distributed streaming set similarity join

Yang, J; Zhang, W; Wang, X; Zhang, Y; Lin, X

Distributed streaming set similarity join

Yang, J Zhang, W Wang, X Zhang, Y

Lin, X

Permalink

Publisher:: IEEE
Publication Type:: Conference Proceeding
Citation:: Proceedings - International Conference on Data Engineering, 2020, 2020-April, pp. 565-576
Issue Date:: 2020-04-01

Closed Access

	Filename	Description	Size
	ICDE_join.pdf	Accepted version	509.26 kB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Yang, J
dc.contributor.author	Zhang, W
dc.contributor.author	Wang, X
dc.contributor.author	Zhang, Y https://orcid.org/0000-0002-2674-1638
dc.contributor.author	Lin, X
dc.date	2020-04-20
dc.date.accessioned	2020-11-03T01:39:00Z
dc.date.available	2020-11-03T01:39:00Z
dc.date.issued	2020-04-01
dc.identifier.citation	Proceedings - International Conference on Data Engineering, 2020, 2020-April, pp. 565-576
dc.identifier.isbn	9781728129037
dc.identifier.issn	1084-4627
dc.identifier.uri	http://hdl.handle.net/10453/143719
dc.description.abstract	© 2020 IEEE. With the prevalence of Internet access and user generated content, a large number of documents/records, such as news and web pages, have been continuously generated in an unprecedented manner. In this paper, we study the problem of efficient stream set similarity join over distributed systems, which has broad applications in data cleaning and data integration tasks, such as on-line near-duplicate detection. In contrast to prefix-based distribution strategy which is widely adopted in offline distributed processing, we propose a simple yet efficient length-based distribution framework which dispatches incoming records by their length. A load-aware length partition method is developed to find a balanced partition by effectively estimating local join cost to achieve good load balance. Our length-based scheme is surprisingly superior to its competitors since it has no replication, small communication cost, and high throughput. We further observe that the join results from the current incoming record can be utilized to guide the index construction, which in turn can facilitate the join processing of future records. Inspired by this observation, we propose a novel bundle-based join algorithm by grouping similar records on-the-fly to reduce filtering cost. A by-product of this algorithm is an efficient verification technique, which verifies a batch of records by utilizing their token differences to share verification costs, rather than verifying them individually. Extensive experiments conducted on Storm, a popular distributed stream processing system, suggest that our methods can achieve up to one order of magnitude throughput improvement over baselines.
dc.language	en
dc.publisher	IEEE
dc.relation.ispartof	Proceedings - International Conference on Data Engineering
dc.relation.ispartof	2020 IEEE 36th International Conference on Data Engineering (ICDE)
dc.relation.isbasedon	10.1109/ICDE48307.2020.00055
dc.rights	© 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.	en_US
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	Distributed streaming set similarity join
dc.type	Conference Proceeding
utslib.citation.volume	2020-April
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
pubs.organisational-group	/University of Technology Sydney
utslib.copyright.status	closed_access	*
dc.date.updated	2020-11-03T01:38:51Z
pubs.finish-date	2020-04-24
pubs.publication-status	Published
pubs.start-date	2020-04-20
pubs.volume	2020-April

Abstract:

© 2020 IEEE. With the prevalence of Internet access and user generated content, a large number of documents/records, such as news and web pages, have been continuously generated in an unprecedented manner. In this paper, we study the problem of efficient stream set similarity join over distributed systems, which has broad applications in data cleaning and data integration tasks, such as on-line near-duplicate detection. In contrast to prefix-based distribution strategy which is widely adopted in offline distributed processing, we propose a simple yet efficient length-based distribution framework which dispatches incoming records by their length. A load-aware length partition method is developed to find a balanced partition by effectively estimating local join cost to achieve good load balance. Our length-based scheme is surprisingly superior to its competitors since it has no replication, small communication cost, and high throughput. We further observe that the join results from the current incoming record can be utilized to guide the index construction, which in turn can facilitate the join processing of future records. Inspired by this observation, we propose a novel bundle-based join algorithm by grouping similar records on-the-fly to reduce filtering cost. A by-product of this algorithm is an efficient verification technique, which verifies a batch of records by utilizing their token differences to share verification costs, rather than verifying them individually. Extensive experiments conducted on Storm, a popular distributed stream processing system, suggest that our methods can achieve up to one order of magnitude throughput improvement over baselines.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/143719