Nearest Neighbor Search in High Dimensional Space

Li, Mingjie

Nearest Neighbor Search in High Dimensional Space

Li, Mingjie

Permalink

Publication Type:: Thesis
Issue Date:: 2020

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download contents and abstractAdobe PDF (382.24 kB)

Adobe PDF

Download thesisAdobe PDF (2.33 MB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Li, Mingjie
dc.date.accessioned	2021-03-19T00:22:16Z
dc.date.available	2021-03-19T00:22:16Z
dc.date.issued	2020
dc.identifier.uri	http://hdl.handle.net/10453/147363
dc.description	University of Technology Sydney. Faculty of Engineering and Information Technology.	en_US
dc.description.abstract	Nearest neighbor search (NNS) in high dimensional space is a fundamental and essential operation in applications from many domains, such as machine learning, databases, multimedia and computer vision, to name a few. In this thesis, we investigate both exact and approximate NNS in high dimensional space. For the exact NNS, we propose an efficient technique which can have a significant speedup over the state-of-the-art exact solutions. Specifically, we first propose a novel compounded embedding technique, by which we achieve a tight distance lower bound for Euclidean distance. Then each point in a high dimensional space can be embedded into a low dimensional space such that the distance between two embedded points lower bounds their distance in the original space. Following the 𝘧𝘪𝘭𝘵𝘦𝘳-𝘢𝘯𝘥-𝘷𝘦𝘳𝘪𝘧𝘺 paradigm, we develop an efficient exact NNS algorithm by pruning a large number of candidates using the new lower bounding technique. Comprehensive experiments on many real-world data demonstrate the effectiveness and efficiency of our new algorithm. In terms of the approximate NNS, we propose an external memory-based approximate NNS algorithm by learning to hash. Specifically, we introduce a novel data-sensitive indexing and query processing framework for approximate NNS with an emphasis on optimizing the I/O efficiency, especially, the sequential I/Os. The proposed index consists of several lists of point IDs, ordered by values that are obtained by learned hashing functions on each corresponding data point. The functions are learned from the data and approximately preserve the order in the high-dimensional space. We consider two instantiations of the functions (linear and non-linear), both learned from the data with novel objective functions. Comprehensive experiments on six large scale high dimensional datasets show that our proposed methods with learned index structure perform much better than the state-of-the-art external memory-based approximate NNS methods in terms of I/O efficiency and search accuracy. Although lots of approximate NNS algorithms have been continuously proposed in the literature each year, there is no comprehensive evaluation and analysis of their performance. Therefore, we conduct a comprehensive and systematic experimental evaluation for the state-of-the-art approximate methods. Our study (1) is cross-disciplinary (i.e., including 19 algorithms in different domains, and from practitioners) and (2) has evaluated a diverse range of settings, including 20 datasets, several evaluation metrics, and different query workloads. The experimental results are carefully reported and analyzed to understand the performance results. Furthermore, we propose a new method that achieves both high query efficiency and high recall empirically on majority of the datasets under a wide range of settings.	en_US
dc.format	Thesis (PhD)
dc.language.iso	en	en_US
dc.relation	https://opus.lib.uts.edu.au/bitstream/10453/147363/2/02whole.pdf
dc.rights	The author owns the copyright in this thesis including all reproduction and reuse rights for the work. The work may not be altered without the permission of the copyright owner. Attribution is essential when quoting or paraphrasing from this thesis.
dc.rights	au.edu.uts.lib/ppc
dc.rights	info:eu-repo/semantics/openAccess
dc.title	Nearest Neighbor Search in High Dimensional Space	en_US
dc.type	Thesis
utslib.copyright.status	open_access	*

Abstract:

Nearest neighbor search (NNS) in high dimensional space is a fundamental and essential operation in applications from many domains, such as machine learning, databases, multimedia and computer vision, to name a few. In this thesis, we investigate both exact and approximate NNS in high dimensional space. For the exact NNS, we propose an efficient technique which can have a significant speedup over the state-of-the-art exact solutions. Specifically, we first propose a novel compounded embedding technique, by which we achieve a tight distance lower bound for Euclidean distance. Then each point in a high dimensional space can be embedded into a low dimensional space such that the distance between two embedded points lower bounds their distance in the original space. Following the 𝘧𝘪𝘭𝘵𝘦𝘳-𝘢𝘯𝘥-𝘷𝘦𝘳𝘪𝘧𝘺 paradigm, we develop an efficient exact NNS algorithm by pruning a large number of candidates using the new lower bounding technique. Comprehensive experiments on many real-world data demonstrate the effectiveness and efficiency of our new algorithm. In terms of the approximate NNS, we propose an external memory-based approximate NNS algorithm by learning to hash. Specifically, we introduce a novel data-sensitive indexing and query processing framework for approximate NNS with an emphasis on optimizing the I/O efficiency, especially, the sequential I/Os. The proposed index consists of several lists of point IDs, ordered by values that are obtained by learned hashing functions on each corresponding data point. The functions are learned from the data and approximately preserve the order in the high-dimensional space. We consider two instantiations of the functions (linear and non-linear), both learned from the data with novel objective functions. Comprehensive experiments on six large scale high dimensional datasets show that our proposed methods with learned index structure perform much better than the state-of-the-art external memory-based approximate NNS methods in terms of I/O efficiency and search accuracy. Although lots of approximate NNS algorithms have been continuously proposed in the literature each year, there is no comprehensive evaluation and analysis of their performance. Therefore, we conduct a comprehensive and systematic experimental evaluation for the state-of-the-art approximate methods. Our study (1) is cross-disciplinary (i.e., including 19 algorithms in different domains, and from practitioners) and (2) has evaluated a diverse range of settings, including 20 datasets, several evaluation metrics, and different query workloads. The experimental results are carefully reported and analyzed to understand the performance results. Furthermore, we propose a new method that achieves both high query efficiency and high recall empirically on majority of the datasets under a wide range of settings.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/147363