Efficient set containment join

Yang, J; Zhang, W; Yang, S; Zhang, Y; Lin, X; Yuan, L

Efficient set containment join

Yang, J Zhang, W

Yang, S Zhang, Y

Lin, X

Yuan, L

Permalink

Publication Type:: Journal Article
Citation:: VLDB Journal, 2018, 27 (4), pp. 471 - 495
Issue Date:: 2018-08-01

Closed Access

	Filename	Description	Size
	Yang2018_Article_EfficientSetContainmentJoin.pdf	Published Version	1.42 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Yang, J	en_US
dc.contributor.author	Zhang, W https://orcid.org/0000-0001-6572-2600	en_US
dc.contributor.author	Yang, S	en_US
dc.contributor.author	Zhang, Y https://orcid.org/0000-0002-2674-1638	en_US
dc.contributor.author	Lin, X https://orcid.org/0000-0003-2396-7225	en_US
dc.contributor.author	Yuan, L	en_US
dc.date.issued	2018-08-01	en_US
dc.identifier.citation	VLDB Journal, 2018, 27 (4), pp. 471 - 495	en_US
dc.identifier.issn	1066-8888	en_US
dc.identifier.uri	http://hdl.handle.net/10453/131763
dc.description.abstract	© 2018, Springer-Verlag GmbH Germany, part of Springer Nature. In this paper, we study the problem of set containment join. Given two collections R and S of records, the set containment join R⋈ ⊆S retrieves all record pairs { (r, s) } ∈ R× S such that r⊆ s. This problem has been extensively studied in the literature and has many important applications in commercial and scientific fields. Recent research focuses on the in-memory set containment join algorithms, and several techniques have been developed following intersection-oriented or union-oriented computing paradigms. Nevertheless, we observe that two computing paradigms have their limits due to the nature of the intersection and union operators. Particularly, intersection-oriented method relies on the intersection of the relevant inverted lists built on the elements of S. A nice property of the intersection-oriented method is that the join computation is verification free. However, the number of records explored during the join process may be large because there are multiple replicas for each record in S. On the other hand, the union-oriented method generates a signature for each record in R and the candidate pairs are obtained by the union of the inverted lists of the relevant signatures. The candidate size of the union-oriented method is usually small because each record contributes only one replica in the index. Unfortunately, union-oriented method needs to verify the candidate pairs, which may be cost expensive especially when the join result size is large. As a matter of fact, the state-of-the-art union-oriented solution is not competitive compared to the intersection-oriented ones. In this paper, we propose a new union-oriented method, namely TT-Join, which not only enhances the advantage of the previous union-oriented methods but also integrates the goodness of intersection-oriented methods by imposing a variant of prefix tree structure. We conduct extensive experiments on 20 real-life datasets and synthetic datasets by comparing our method with 7 existing methods. The experiment results demonstrate that TT-Join significantly outperforms the existing algorithms on most of the datasets and can achieve up to two orders of magnitude speedup. Furthermore, to support large scale of datasets, we extend our techniques to distributed systems on top of MapReduce framework. With the help of careful designed load-aware distribution mechanisms, our distributed join algorithm can achieve up to an order of magnitude speedup than the baselines methods.	en_US
dc.relation.ispartof	VLDB Journal	en_US
dc.relation.isbasedon	10.1007/s00778-018-0505-x	en_US
dc.subject.classification	Information Systems	en_US
dc.title	Efficient set containment join	en_US
dc.type	Journal Article
utslib.citation.volume	4	en_US
utslib.citation.volume	27	en_US
utslib.for	0806 Information Systems	en_US
utslib.for	0804 Data Format	en_US
utslib.for	0805 Distributed Computing	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
pubs.organisational-group	/University of Technology Sydney/Faculty of Science
pubs.organisational-group	/University of Technology Sydney/Faculty of Science/School of Life Sciences
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
utslib.copyright.status	closed_access
pubs.issue	4	en_US
pubs.publication-status	Published	en_US
pubs.volume	27	en_US

Abstract:

© 2018, Springer-Verlag GmbH Germany, part of Springer Nature. In this paper, we study the problem of set containment join. Given two collections R and S of records, the set containment join R⋈ ⊆S retrieves all record pairs { (r, s) } ∈ R× S such that r⊆ s. This problem has been extensively studied in the literature and has many important applications in commercial and scientific fields. Recent research focuses on the in-memory set containment join algorithms, and several techniques have been developed following intersection-oriented or union-oriented computing paradigms. Nevertheless, we observe that two computing paradigms have their limits due to the nature of the intersection and union operators. Particularly, intersection-oriented method relies on the intersection of the relevant inverted lists built on the elements of S. A nice property of the intersection-oriented method is that the join computation is verification free. However, the number of records explored during the join process may be large because there are multiple replicas for each record in S. On the other hand, the union-oriented method generates a signature for each record in R and the candidate pairs are obtained by the union of the inverted lists of the relevant signatures. The candidate size of the union-oriented method is usually small because each record contributes only one replica in the index. Unfortunately, union-oriented method needs to verify the candidate pairs, which may be cost expensive especially when the join result size is large. As a matter of fact, the state-of-the-art union-oriented solution is not competitive compared to the intersection-oriented ones. In this paper, we propose a new union-oriented method, namely TT-Join, which not only enhances the advantage of the previous union-oriented methods but also integrates the goodness of intersection-oriented methods by imposing a variant of prefix tree structure. We conduct extensive experiments on 20 real-life datasets and synthetic datasets by comparing our method with 7 existing methods. The experiment results demonstrate that TT-Join significantly outperforms the existing algorithms on most of the datasets and can achieve up to two orders of magnitude speedup. Furthermore, to support large scale of datasets, we extend our techniques to distributed systems on top of MapReduce framework. With the help of careful designed load-aware distribution mechanisms, our distributed join algorithm can achieve up to an order of magnitude speedup than the baselines methods.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/131763