Efficient Query Processing and Analytics on High Dimensional Data

Publication Type:
Thesis
Issue Date:
2020
Full metadata record
As a fundamental problem in query processing, similarity search has been applied in many fields including multimedia, machine learning, database, recommendation systems and so on. Generally, it will be challengeable when it comes to the high-dimensional space due to “the curse of dimensionality”. Since it would be too expensive to find exact results, approximate solutions have been studied in many papers. There are various distance metrics to evaluate the similarity between the query object and other points in a dataset. In this thesis, we focus on some well-known distance metrics including Euclidean distance and inner product. Except for Euclidean space, we also study query processing on graphs and propose a novel distance metric on graph. The thesis contains four similarity search problems regarding to different distance metrics, which are Approximate Nearest Neighbour, Approximate Inner Product Search, Approximate Furthest Neighbour and Skyline Nearest Neighbour. Given a query point, the four problems all focus on retrieving a set of “similar” points from the dataset. Given a set of 𝘥-dimensional data points, and a query point 𝘲, Approximate Nearest Neighbour Search (ANNS) aims to find the approximate closest object to 𝘲 in the set. More specifically, in this thesis, we focus on 𝘤-ANNS problem, which means given a constant c, the purpose is to find a result whose distance is not larger than 𝘤 times of the exact smallest distance with a certain possibility. Even though this problem has been researched for a long time, there are still some shortage of current algorithms. We studied the existed works, and proposed a novel I/O efficient algorithm to solve 𝘤-approximate nearest neighbour problem in external memory, which can dramatically reduce I/O cost and provide rigorous proof of its correctness. Maximum Inner Product Search (MIPS) is another valuable problem. It returns an object with maximum inner product value to query point 𝘲. There are hundreds of solutions for MIPS but still short of comprehensive evaluation and analysis of these methods’ performance. In this thesis, we chose several state-of-the-art algorithms of MIPS using different techniques, and conducted a set of comprehensive experiments to evaluate their performance fairly. Approximate Furthest Neighbour Search is an opposite problem of Nearest Neighbour Search. It finds the furthest object to query point 𝘲 in a dataset instead of the closest one. Since most recent works for approximate furthest neighbour search in external memory are only suitable for low-dimensional data, we proposed a new I/O efficient technique to achieve a better performance on I/O cost. In addition to Euclidian space, similarity search is also a fundamental problem in other spaces like graphs. Considering real-world applications, the multi-layer graph model is extensively studied to reveal the multi-dimensional relations between the graph entities. In this thesis, we formulated a new problem called skyline nearest neighbor search on multi-layer graphs, and proposed a baseline algorithm, and two optimizations instead of naively adopting the traditional skyline procedure as a subroutine. We also investigated the rule to optimize search order in the algorithm so that further improve the algorithmic efficiency.
Please use this identifier to cite or link to this item: