Leveraging set relations in exact set similarity join

Wang, X; Qin, L; Lin, X; Zhang, Y; Chang, L

Leveraging set relations in exact set similarity join

Wang, X Qin, L

Lin, X Zhang, Y

Chang, L

Permalink

Publication Type:: Conference Proceeding
Citation:: Proceedings of the VLDB Endowment, 2017, 10 (9), pp. 925 - 936
Issue Date:: 2017-05-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Published versionAdobe PDF (726.71 kB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Wang, X	en_US
dc.contributor.author	Qin, L https://orcid.org/0000-0001-6068-5062	en_US
dc.contributor.author	Lin, X	en_US
dc.contributor.author	Zhang, Y https://orcid.org/0000-0002-2674-1638	en_US
dc.contributor.author	Chang, L	en_US
dc.date.issued	2017-05-01	en_US
dc.identifier.citation	Proceedings of the VLDB Endowment, 2017, 10 (9), pp. 925 - 936	en_US
dc.identifier.uri	http://hdl.handle.net/10453/123053
dc.description.abstract	© 2017 VLDB. Exact set similarity join, which finds all the similar set pairs from two collections of sets, is a fundamental problem with a wide range of applications. The existing solutions for set similarity join follow a filtering-verification framework, which generates a list of candidate pairs through scanning indexes in the filtering phase, and reports those similar pairs in the verification phase. Though much research has been conducted on this problem, set relations, which we find out is quite effective on improving the algorithm effciency through computational cost sharing, have never been studied. Therefore, in this paper, instead of considering each set individually, we explore the set relations in different levels to reduce the overall computational costs. First, it has been shown that most of the computational time is spent on the filtering phase, which can be quadratic to the number of sets in the worst case for the existing solutions. Thus we explore index-level set relations to reduce the filtering cost to be linear to the size of the input while keeping the same filtering power. We achieve this by grouping related sets into blocks in the index and skipping useless index probes in joins. Second, we explore answer-level set relations to further improve the algorithm based on the intuition that if two sets are similar, their answers may have a large overlap. We derive an algorithm which incrementally generates the answer of one set from an already computed answer of another similar set rather than compute the answer from scratch to reduce the computational cost. Finally, we conduct extensive performance studies using 21 real datasets with various data properties from a wide range of domains. The experimental results demonstrate that our algorithm outperforms all the existing algorithms across all datasets and can achieve more than an order of magnitude speedup against the stateof-the-art algorithms.	en_US
dc.relation	http://purl.org/au-research/grants/arc/FT170100128
dc.relation	http://purl.org/au-research/grants/arc/DP180103096
dc.relation.ispartof	Proceedings of the VLDB Endowment	en_US
dc.relation.isbasedon	10.14778/3099622.3099624	en_US
dc.title	Leveraging set relations in exact set similarity join	en_US
dc.type	Conference Proceeding
utslib.citation.volume	9	en_US
utslib.citation.volume	10	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Software
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
utslib.copyright.status	open_access
pubs.declined	2018-03-26T13:53:47.560+1100
pubs.deleted	2018-03-26T13:53:47.560+1100
pubs.issue	9	en_US
pubs.publication-status	Published	en_US
pubs.volume	10	en_US

Abstract:

© 2017 VLDB. Exact set similarity join, which finds all the similar set pairs from two collections of sets, is a fundamental problem with a wide range of applications. The existing solutions for set similarity join follow a filtering-verification framework, which generates a list of candidate pairs through scanning indexes in the filtering phase, and reports those similar pairs in the verification phase. Though much research has been conducted on this problem, set relations, which we find out is quite effective on improving the algorithm effciency through computational cost sharing, have never been studied. Therefore, in this paper, instead of considering each set individually, we explore the set relations in different levels to reduce the overall computational costs. First, it has been shown that most of the computational time is spent on the filtering phase, which can be quadratic to the number of sets in the worst case for the existing solutions. Thus we explore index-level set relations to reduce the filtering cost to be linear to the size of the input while keeping the same filtering power. We achieve this by grouping related sets into blocks in the index and skipping useless index probes in joins. Second, we explore answer-level set relations to further improve the algorithm based on the intuition that if two sets are similar, their answers may have a large overlap. We derive an algorithm which incrementally generates the answer of one set from an already computed answer of another similar set rather than compute the answer from scratch to reduce the computational cost. Finally, we conduct extensive performance studies using 21 real datasets with various data properties from a wide range of domains. The experimental results demonstrate that our algorithm outperforms all the existing algorithms across all datasets and can achieve more than an order of magnitude speedup against the stateof-the-art algorithms.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/123053