Ten thousand sqls: Parallel keyword queries computing

Qin, L; Yu, JX; Chang, L

Ten thousand sqls: Parallel keyword queries computing

Qin, L

Yu, JX

Chang, L

Permalink

Publication Type:: Conference Proceeding
Citation:: Proceedings of the VLDB Endowment, 2010, 3 (1), pp. 58 - 69
Issue Date:: 2010-01-01

Closed Access

	Filename	Description	Size
	2013005187OK.pdf		709.97 kB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Qin, L https://orcid.org/0000-0001-6068-5062	en_US
dc.contributor.author	Yu, JX https://orcid.org/0000-0002-9738-827X	en_US
dc.contributor.author	Chang, L	en_US
dc.date.issued	2010-01-01	en_US
dc.identifier.citation	Proceedings of the VLDB Endowment, 2010, 3 (1), pp. 58 - 69	en_US
dc.identifier.uri	http://hdl.handle.net/10453/28934
dc.description.abstract	Keyword search in relational databases has been extensively studied. Given a relational database, a keyword query finds a set of interconnected tuple structures connected by foreign key references. On rdbms, a keyword query is processed in two steps, namely, candidate networks (CNs) generation and CNs evaluation, where a CN is an sql. In common, a keyword query needs to be processed using over 10,000 sqls. There are several approaches to process a keyword query on rdbms, but there is a limit to achieve high performance on a uniprocessor architecture. In this paper, we study parallel computing keyword queries on a multicore architecture. We give three observations on keyword query computing, namely, a large number of sqls that needs to be processed, high sharing possibility among sqls, and large intermediate results with small number of final results. All make it challenging for parallel keyword queries computing. We investigate three approaches. We first study the query level parallelism, where each sql is processed by one core. We distribute the sqls into different cores based on three objectives, regarding minimizing workload skew, minimizing intercore sharing and maximizing intra-core sharing respectively. Such an approach has the potential risk of load unbalancing through accumulating errors of cost estimation. We then study the operation level parallelism, where each operation of an sql is processed by one core. All operations are processed in stages, where in each stage the costs of operations are reestimated to reduce the accumulated error. Such operation level parallelism still has drawbacks of workload skew when large operations are involved and a large number of cores are used. Finally, we propose a new algorithm that partitions relations adaptively in order to minimize the extra cost of partitioning and at the same time reduce workload skew. We conducted extensive performance studies using two large real datasets, DBLP and IMDB, and we report the efficiency of our approaches in this paper. © 2010 VLDB Endowment.	en_US
dc.relation.ispartof	Proceedings of the VLDB Endowment	en_US
dc.relation.isbasedon	10.14778/1920841.1920854	en_US
dc.title	Ten thousand sqls: Parallel keyword queries computing	en_US
dc.type	Conference Proceeding
utslib.citation.volume	1	en_US
utslib.citation.volume	3	en_US
utslib.for	0806 Information Systems	en_US
utslib.for	0802 Computation Theory and Mathematics	en_US
utslib.for	0807 Library and Information Studies	en_US
dc.location.activity	Singapore	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
utslib.copyright.status	closed_access
pubs.issue	1	en_US
pubs.publication-status	Published	en_US
pubs.volume	3	en_US

Abstract:

Keyword search in relational databases has been extensively studied. Given a relational database, a keyword query finds a set of interconnected tuple structures connected by foreign key references. On rdbms, a keyword query is processed in two steps, namely, candidate networks (CNs) generation and CNs evaluation, where a CN is an sql. In common, a keyword query needs to be processed using over 10,000 sqls. There are several approaches to process a keyword query on rdbms, but there is a limit to achieve high performance on a uniprocessor architecture. In this paper, we study parallel computing keyword queries on a multicore architecture. We give three observations on keyword query computing, namely, a large number of sqls that needs to be processed, high sharing possibility among sqls, and large intermediate results with small number of final results. All make it challenging for parallel keyword queries computing. We investigate three approaches. We first study the query level parallelism, where each sql is processed by one core. We distribute the sqls into different cores based on three objectives, regarding minimizing workload skew, minimizing intercore sharing and maximizing intra-core sharing respectively. Such an approach has the potential risk of load unbalancing through accumulating errors of cost estimation. We then study the operation level parallelism, where each operation of an sql is processed by one core. All operations are processed in stages, where in each stage the costs of operations are reestimated to reduce the accumulated error. Such operation level parallelism still has drawbacks of workload skew when large operations are involved and a large number of cores are used. Finally, we propose a new algorithm that partitions relations adaptively in order to minimize the extra cost of partitioning and at the same time reduce workload skew. We conducted extensive performance studies using two large real datasets, DBLP and IMDB, and we report the efficiency of our approaches in this paper. © 2010 VLDB Endowment.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/28934