Scalable distributed subgraph enumeration

Lai, L; Qin, L; Lin, X; Zhang, Y; Chang, L; Yang, S

Scalable distributed subgraph enumeration

Lai, L Qin, L

Lin, X Zhang, Y

Chang, L Yang, S

Permalink

Publication Type:: Conference Proceeding
Citation:: Proceedings of the VLDB Endowment, 2016, 10 (3), pp. 217 - 228
Issue Date:: 2016-01-01

Closed Access

	Filename	Description	Size
	p217-lai.pdf	Published version	761.12 kB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Lai, L	en_US
dc.contributor.author	Qin, L https://orcid.org/0000-0001-6068-5062	en_US
dc.contributor.author	Lin, X	en_US
dc.contributor.author	Zhang, Y https://orcid.org/0000-0002-2674-1638	en_US
dc.contributor.author	Chang, L	en_US
dc.contributor.author	Yang, S	en_US
dc.date.issued	2016-01-01	en_US
dc.identifier.citation	Proceedings of the VLDB Endowment, 2016, 10 (3), pp. 217 - 228	en_US
dc.identifier.uri	http://hdl.handle.net/10453/92132
dc.description.abstract	© 2016. VLDB Endowment. Subgraph enumeration aims to find all the subgraphs of a large data graph that are isomorphic to a given pattern graph. As the subgraph isomorphism operation is computationally intensive, researchers have recently focused on solving this problem in distributed environments, such as MapReduce and Pregel. Among them, the state-of-the-art algorithm, Twin Twig Join, is proven to be instance optimal based on a left-deep join framework. However, it is still not scalable to large graphs because of the constraints in the left-deep join framework and that each decomposed component (join unit) must be a star. In this paper, we propose SEED - a scalable subgraph enumeration approach in the distributed environment. Compared to Twin Twig Join, SEED returns optimal solution in a generalized join framework without the constraints in Twin Twig Join. We use both star and clique as the join units, and design an effective distributed graph storage mechanism to support such an extension. We develop a comprehensive cost model, that estimates the number of matches of any given pattern graph by considering powerlaw degree distribution in the data graph. We then generalize the left-deep join framework and develop a dynamic-programming algorithm to compute an optimal bushy join plan. We also consider overlaps among the join units. Finally, we propose clique compression to further improve the algorithm by reducing the number of the intermediate results. Extensive performance studies are conducted on several real graphs, one containing billions of edges. The results demonstrate that our algorithm outperforms all other state-of-the-art algorithms by more than one order of magnitude.	en_US
dc.relation	http://purl.org/au-research/grants/arc/DE140100679
dc.relation.ispartof	Proceedings of the VLDB Endowment	en_US
dc.relation.isbasedon	10.14778/3021924.3021937	en_US
dc.title	Scalable distributed subgraph enumeration	en_US
dc.type	Conference Proceeding
utslib.citation.volume	3	en_US
utslib.citation.volume	10	en_US
utslib.for	080101 Adaptive Agents and Intelligent Robotics	en_US
utslib.for	080109 Pattern Recognition and Data Mining	en_US
utslib.for	0802 Computation Theory and Mathematics	en_US
utslib.for	0806 Information Systems	en_US
utslib.for	0807 Library and Information Studies	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
utslib.copyright.status	closed_access
pubs.issue	3	en_US
pubs.publication-status	Published	en_US
pubs.volume	10	en_US

Abstract:

© 2016. VLDB Endowment. Subgraph enumeration aims to find all the subgraphs of a large data graph that are isomorphic to a given pattern graph. As the subgraph isomorphism operation is computationally intensive, researchers have recently focused on solving this problem in distributed environments, such as MapReduce and Pregel. Among them, the state-of-the-art algorithm, Twin Twig Join, is proven to be instance optimal based on a left-deep join framework. However, it is still not scalable to large graphs because of the constraints in the left-deep join framework and that each decomposed component (join unit) must be a star. In this paper, we propose SEED - a scalable subgraph enumeration approach in the distributed environment. Compared to Twin Twig Join, SEED returns optimal solution in a generalized join framework without the constraints in Twin Twig Join. We use both star and clique as the join units, and design an effective distributed graph storage mechanism to support such an extension. We develop a comprehensive cost model, that estimates the number of matches of any given pattern graph by considering powerlaw degree distribution in the data graph. We then generalize the left-deep join framework and develop a dynamic-programming algorithm to compute an optimal bushy join plan. We also consider overlaps among the join units. Finally, we propose clique compression to further improve the algorithm by reducing the number of the intermediate results. Extensive performance studies are conducted on several real graphs, one containing billions of edges. The results demonstrate that our algorithm outperforms all other state-of-the-art algorithms by more than one order of magnitude.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/92132