High-Dimensional Similarity Query Processing for Data Science

Qin, J; Wang, W; Xiao, C; Zhang, Y; Wang, Y

High-Dimensional Similarity Query Processing for Data Science

Qin, J Wang, W Xiao, C Zhang, Y

Wang, Y

Permalink

Publisher:: ACM
Publication Type:: Conference Proceeding
Citation:: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2021, pp. 4062-4063
Issue Date:: 2021-08-14

Closed Access

	Filename	Description	Size
	KDD_2021.pdf	Published version	757.78 kB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Qin, J
dc.contributor.author	Wang, W
dc.contributor.author	Xiao, C
dc.contributor.author	Zhang, Y https://orcid.org/0000-0002-2674-1638
dc.contributor.author	Wang, Y
dc.date.accessioned	2021-12-01T23:29:15Z
dc.date.available	2021-12-01T23:29:15Z
dc.date.issued	2021-08-14
dc.identifier.citation	Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2021, pp. 4062-4063
dc.identifier.isbn	9781450383325
dc.identifier.uri	http://hdl.handle.net/10453/152016
dc.description.abstract	Similarity query (a.k.a. nearest neighbor query) processing has been an active research topic for several decades. It is an essential procedure in a wide range of applications (e.g., classification & regression, deduplication, image retrieval, and recommender systems). Recently, representation learning and auto-encoding methods as well as pre-trained models have gained popularity. They basically deal with dense high-dimensional data, and this trend brings new opportunities and challenges to similarity query processing. Meanwhile, new techniques have emerged to tackle this long-standing problem theoretically and empirically. This tutorial aims to provide a comprehensive review of high-dimensional similarity query processing for data science. It introduces solutions from a variety of research communities, including data mining (DM), database (DB), machine learning (ML), computer vision (CV), natural language processing (NLP), and theoretical computer science (TCS), thereby highlighting the interplay between modern computer science and artificial intelligence technologies. We first discuss the importance of high-dimensional similarity query processing in data science applications, and then review query processing algorithms such as cover tree, locality sensitive hashing, product quantization, proximity graphs, as well as recent advancements such as learned indexes. We analyze their strengths and weaknesses and discuss the selection of algorithms in various application scenarios. Moreover, we consider the selectivity estimation of high-dimensional similarity queries, and show how researchers are bringing in state-of-the-art ML techniques to address this problem. We expect that this tutorial will provide an impetus towards new technologies for data science.
dc.language	en
dc.publisher	ACM
dc.relation	http://purl.org/au-research/grants/arc/FT170100128
dc.relation	http://purl.org/au-research/grants/arc/DP210101393
dc.relation.ispartof	Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
dc.relation.ispartof	KDD '21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
dc.relation.isbasedon	10.1145/3447548.3470811
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	High-Dimensional Similarity Query Processing for Data Science
dc.type	Conference Proceeding
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	closed_access	*
dc.date.updated	2021-12-01T23:29:14Z
pubs.publication-status	Published

Abstract:

Similarity query (a.k.a. nearest neighbor query) processing has been an active research topic for several decades. It is an essential procedure in a wide range of applications (e.g., classification & regression, deduplication, image retrieval, and recommender systems). Recently, representation learning and auto-encoding methods as well as pre-trained models have gained popularity. They basically deal with dense high-dimensional data, and this trend brings new opportunities and challenges to similarity query processing. Meanwhile, new techniques have emerged to tackle this long-standing problem theoretically and empirically. This tutorial aims to provide a comprehensive review of high-dimensional similarity query processing for data science. It introduces solutions from a variety of research communities, including data mining (DM), database (DB), machine learning (ML), computer vision (CV), natural language processing (NLP), and theoretical computer science (TCS), thereby highlighting the interplay between modern computer science and artificial intelligence technologies. We first discuss the importance of high-dimensional similarity query processing in data science applications, and then review query processing algorithms such as cover tree, locality sensitive hashing, product quantization, proximity graphs, as well as recent advancements such as learned indexes. We analyze their strengths and weaknesses and discuss the selection of algorithms in various application scenarios. Moreover, we consider the selectivity estimation of high-dimensional similarity queries, and show how researchers are bringing in state-of-the-art ML techniques to address this problem. We expect that this tutorial will provide an impetus towards new technologies for data science.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/152016