Efficient top-k similarity join processing over multi-valued objects

Zhang, W; Zhan, L; Zhang, Y; Cheema, MA; Lin, X

Efficient top-k similarity join processing over multi-valued objects

Zhang, W

Zhan, L Zhang, Y

Cheema, MA Lin, X

Permalink

Publication Type:: Journal Article
Citation:: World Wide Web, 2014, 17 (3), pp. 285 - 309
Issue Date:: 2014-05-01

Closed Access

	Filename	Description	Size
	WWWJ_topk_sim.pdf	Published Version	1.03 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Zhang, W https://orcid.org/0000-0001-6572-2600	en_US
dc.contributor.author	Zhan, L	en_US
dc.contributor.author	Zhang, Y https://orcid.org/0000-0002-2674-1638	en_US
dc.contributor.author	Cheema, MA	en_US
dc.contributor.author	Lin, X https://orcid.org/0000-0003-2396-7225	en_US
dc.date.issued	2014-05-01	en_US
dc.identifier.citation	World Wide Web, 2014, 17 (3), pp. 285 - 309	en_US
dc.identifier.issn	1386-145X	en_US
dc.identifier.uri	http://hdl.handle.net/10453/32987
dc.description.abstract	© 2013, Springer Science+Business Media New York. The top-k similarity joins have been extensively studied and used in a wide spectrum of applications such as information retrieval, decision making, spatial data analysis and data mining. Given two sets of objects $\mathcal U$ and $\mathcal V$, a top-k similarity join returns k pairs of most similar objects from $\mathcal U \times \mathcal V$. In the conventional model of top-k similarity join processing, an object is usually regarded as a point in a multi-dimensional space and the similarity is measured by some simple distance metrics like Euclidean distance. However, in many applications an object may be described by multiple values (instances) and the conventional model is not applicable since it does not address the distributions of object instances. In this paper, we study top-k similarity join over multi-valued objects. We apply two types of quantile based distance measures, ϕ-quantile distance and ϕ-quantile group-base distance, to explore the relative instance distribution among the multiple instances of objects. Efficient and effective techniques to process top-k similarity joins over multi-valued objects are developed following a filtering-refinement framework. Novel distance, statistic and weight based pruning techniques are proposed. Comprehensive experiments on both real and synthetic datasets demonstrate the efficiency and effectiveness of our techniques.	en_US
dc.relation.ispartof	World Wide Web	en_US
dc.relation.isbasedon	10.1007/s11280-012-0201-5	en_US
dc.subject.classification	Information Systems	en_US
dc.title	Efficient top-k similarity join processing over multi-valued objects	en_US
dc.type	Journal Article
utslib.citation.volume	3	en_US
utslib.citation.volume	17	en_US
utslib.for	0805 Distributed Computing	en_US
utslib.for	0806 Information Systems	en_US
utslib.for	0804 Data Format	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
pubs.organisational-group	/University of Technology Sydney/Faculty of Science
pubs.organisational-group	/University of Technology Sydney/Faculty of Science/School of Life Sciences
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
utslib.copyright.status	closed_access
pubs.issue	3	en_US
pubs.publication-status	Published	en_US
pubs.volume	17	en_US

Abstract:

© 2013, Springer Science+Business Media New York. The top-k similarity joins have been extensively studied and used in a wide spectrum of applications such as information retrieval, decision making, spatial data analysis and data mining. Given two sets of objects $\mathcal U$ and $\mathcal V$, a top-k similarity join returns k pairs of most similar objects from $\mathcal U \times \mathcal V$. In the conventional model of top-k similarity join processing, an object is usually regarded as a point in a multi-dimensional space and the similarity is measured by some simple distance metrics like Euclidean distance. However, in many applications an object may be described by multiple values (instances) and the conventional model is not applicable since it does not address the distributions of object instances. In this paper, we study top-k similarity join over multi-valued objects. We apply two types of quantile based distance measures, ϕ-quantile distance and ϕ-quantile group-base distance, to explore the relative instance distribution among the multiple instances of objects. Efficient and effective techniques to process top-k similarity joins over multi-valued objects are developed following a filtering-refinement framework. Novel distance, statistic and weight based pruning techniques are proposed. Comprehensive experiments on both real and synthetic datasets demonstrate the efficiency and effectiveness of our techniques.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/32987