Leveraging set relations in exact and dynamic set similarity join

Wang, X; Qin, L; Lin, X; Zhang, Y; Chang, L

Leveraging set relations in exact and dynamic set similarity join

Wang, X Qin, L

Lin, X Zhang, Y

Chang, L

Permalink

Publication Type:: Journal Article
Citation:: VLDB Journal, 2019, 28 (2), pp. 267 - 292
Issue Date:: 2019-04-11

Closed Access

	Filename	Description	Size
	[2019 VLDBJ] Leveraging set relations in exact and dynamic set similarity join.pdf	Published Version	1.67 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Wang, X	en_US
dc.contributor.author	Qin, L https://orcid.org/0000-0001-6068-5062	en_US
dc.contributor.author	Lin, X	en_US
dc.contributor.author	Zhang, Y https://orcid.org/0000-0002-2674-1638	en_US
dc.contributor.author	Chang, L	en_US
dc.date.issued	2019-04-11	en_US
dc.identifier.citation	VLDB Journal, 2019, 28 (2), pp. 267 - 292	en_US
dc.identifier.issn	1066-8888	en_US
dc.identifier.uri	http://hdl.handle.net/10453/131175
dc.description.abstract	© 2018, Springer-Verlag GmbH Germany, part of Springer Nature. Set similarity join, which finds all the similar set pairs from two collections of sets, is a fundamental problem with a wide range of applications. Existing works study both exact set similarity join and approximate similarity join problems. In this paper, we focus on the exact set similarity join problem. The existing solutions for exact set similarity join follow a filtering-verification framework, which generates a list of candidate pairs through scanning indexes in the filtering phase and reports those similar pairs in the verification phase. Though much research has been conducted on this problem, set relations have not been well studied on improving the algorithm efficiency through computational cost sharing. Therefore, in this paper, we explore the set relations in different levels to reduce the overall computational cost. First, it has been shown that most of the computational time is spent on the filtering phase, which can be quadratic to the number of sets in the worst case for the existing solutions. Thus, we explore index-level set relations to reduce the filtering cost while keeping the same filtering power. We achieve this by grouping related sets into blocks in the index and skipping useless index probes in joins. Second, we explore answer-level set relations to further improve the algorithm based on the intuition that if two sets are similar, their answers may have a large overlap. We derive an algorithm which incrementally generates the answer of one set from an already computed answer of another similar set rather than compute the answer from scratch to reduce the computational cost. In addition, considering that in real applications, the data are usually updated dynamically, we extend our techniques and design efficient algorithms to incrementally update the join result when any element in the sets is updated. Finally, we conduct extensive performance studies using 21 real datasets with various data properties from a wide range of domains. The experimental results demonstrate that our algorithm outperforms all the existing algorithms across all datasets.	en_US
dc.relation	http://purl.org/au-research/grants/arc/DE140100999
dc.relation	http://purl.org/au-research/grants/arc/DP160101513
dc.relation	http://purl.org/au-research/grants/arc/DE140100679
dc.relation	http://purl.org/au-research/grants/arc/DP170103710
dc.relation.ispartof	VLDB Journal	en_US
dc.relation.isbasedon	10.1007/s00778-018-0529-2	en_US
dc.subject.classification	Information Systems	en_US
dc.title	Leveraging set relations in exact and dynamic set similarity join	en_US
dc.type	Journal Article
utslib.citation.volume	2	en_US
utslib.citation.volume	28	en_US
utslib.for	0804 Data Format	en_US
utslib.for	0805 Distributed Computing	en_US
utslib.for	0806 Information Systems	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
utslib.copyright.status	closed_access
pubs.issue	2	en_US
pubs.publication-status	Published	en_US
pubs.volume	28	en_US

Abstract:

© 2018, Springer-Verlag GmbH Germany, part of Springer Nature. Set similarity join, which finds all the similar set pairs from two collections of sets, is a fundamental problem with a wide range of applications. Existing works study both exact set similarity join and approximate similarity join problems. In this paper, we focus on the exact set similarity join problem. The existing solutions for exact set similarity join follow a filtering-verification framework, which generates a list of candidate pairs through scanning indexes in the filtering phase and reports those similar pairs in the verification phase. Though much research has been conducted on this problem, set relations have not been well studied on improving the algorithm efficiency through computational cost sharing. Therefore, in this paper, we explore the set relations in different levels to reduce the overall computational cost. First, it has been shown that most of the computational time is spent on the filtering phase, which can be quadratic to the number of sets in the worst case for the existing solutions. Thus, we explore index-level set relations to reduce the filtering cost while keeping the same filtering power. We achieve this by grouping related sets into blocks in the index and skipping useless index probes in joins. Second, we explore answer-level set relations to further improve the algorithm based on the intuition that if two sets are similar, their answers may have a large overlap. We derive an algorithm which incrementally generates the answer of one set from an already computed answer of another similar set rather than compute the answer from scratch to reduce the computational cost. In addition, considering that in real applications, the data are usually updated dynamically, we extend our techniques and design efficient algorithms to incrementally update the join result when any element in the sets is updated. Finally, we conduct extensive performance studies using 21 real datasets with various data properties from a wide range of domains. The experimental results demonstrate that our algorithm outperforms all the existing algorithms across all datasets.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/131175