Leveraging set relations in exact set similarity join

Wang, X; Qin, L; Lin, X; Zhang, Y; Chang, L

Leveraging set relations in exact set similarity join

Wang, X Qin, L

Lin, X

Zhang, Y

Chang, L

Permalink

Publication Type:: Journal Article
Citation:: Proceedings of the VLDB Endowment, 2017, 10 (9), pp. 925 - 936
Issue Date:: 2017-05-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Published versionAdobe PDF (726.71 kB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Wang, X	en_US
dc.contributor.author	Qin, L https://orcid.org/0000-0001-6068-5062	en_US
dc.contributor.author	Lin, X https://orcid.org/0000-0003-2396-7225	en_US
dc.contributor.author	Zhang, Y https://orcid.org/0000-0002-2674-1638	en_US
dc.contributor.author	Chang, L	en_US
dc.date.issued	2017-05-01	en_US
dc.identifier.citation	Proceedings of the VLDB Endowment, 2017, 10 (9), pp. 925 - 936	en_US
dc.identifier.uri	http://hdl.handle.net/10453/127415
dc.description.abstract	© 2017 VLDB. Exact set similarity join, which finds all the similar set pairs from two collections of sets, is a fundamental problem with a wide range of applications. The existing solutions for set similarity join follow a filtering-verification framework, which generates a list of candidate pairs through scanning indexes in the filtering phase, and reports those similar pairs in the verification phase. Though much research has been conducted on this problem, set relations, which we find out is quite effective on improving the algorithm effciency through computational cost sharing, have never been studied. Therefore, in this paper, instead of considering each set individually, we explore the set relations in different levels to reduce the overall computational costs. First, it has been shown that most of the computational time is spent on the filtering phase, which can be quadratic to the number of sets in the worst case for the existing solutions. Thus we explore index-level set relations to reduce the filtering cost to be linear to the size of the input while keeping the same filtering power. We achieve this by grouping related sets into blocks in the index and skipping useless index probes in joins. Second, we explore answer-level set relations to further improve the algorithm based on the intuition that if two sets are similar, their answers may have a large overlap. We derive an algorithm which incrementally generates the answer of one set from an already computed answer of another similar set rather than compute the answer from scratch to reduce the computational cost. Finally, we conduct extensive performance studies using 21 real datasets with various data properties from a wide range of domains. The experimental results demonstrate that our algorithm outperforms all the existing algorithms across all datasets and can achieve more than an order of magnitude speedup against the stateof-the-art algorithms.	en_US
dc.relation	http://purl.org/au-research/grants/arc/DE140100999
dc.relation	http://purl.org/au-research/grants/arc/DP170103710
dc.relation	http://purl.org/au-research/grants/arc/DE140100679
dc.relation	http://purl.org/au-research/grants/arc/FT170100128
dc.relation	http://purl.org/au-research/grants/arc/DP180103096
dc.relation.ispartof	Proceedings of the VLDB Endowment	en_US
dc.relation.isbasedon	10.14778/3099622.3099624	en_US
dc.title	Leveraging set relations in exact set similarity join	en_US
dc.type	Journal Article
utslib.citation.volume	9	en_US
utslib.citation.volume	10	en_US
utslib.for	0802 Computation Theory and Mathematics	en_US
utslib.for	0806 Information Systems	en_US
utslib.for	0807 Library and Information Studies	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
utslib.copyright.status	open_access
pubs.issue	9	en_US
pubs.publication-status	Published	en_US
pubs.volume	10	en_US

Abstract:

© 2017 VLDB. Exact set similarity join, which finds all the similar set pairs from two collections of sets, is a fundamental problem with a wide range of applications. The existing solutions for set similarity join follow a filtering-verification framework, which generates a list of candidate pairs through scanning indexes in the filtering phase, and reports those similar pairs in the verification phase. Though much research has been conducted on this problem, set relations, which we find out is quite effective on improving the algorithm effciency through computational cost sharing, have never been studied. Therefore, in this paper, instead of considering each set individually, we explore the set relations in different levels to reduce the overall computational costs. First, it has been shown that most of the computational time is spent on the filtering phase, which can be quadratic to the number of sets in the worst case for the existing solutions. Thus we explore index-level set relations to reduce the filtering cost to be linear to the size of the input while keeping the same filtering power. We achieve this by grouping related sets into blocks in the index and skipping useless index probes in joins. Second, we explore answer-level set relations to further improve the algorithm based on the intuition that if two sets are similar, their answers may have a large overlap. We derive an algorithm which incrementally generates the answer of one set from an already computed answer of another similar set rather than compute the answer from scratch to reduce the computational cost. Finally, we conduct extensive performance studies using 21 real datasets with various data properties from a wide range of domains. The experimental results demonstrate that our algorithm outperforms all the existing algorithms across all datasets and can achieve more than an order of magnitude speedup against the stateof-the-art algorithms.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/127415