String similarity search: A hash-based approach

Publication Type:
Journal Article
IEEE Transactions on Knowledge and Data Engineering, 2018, 30 (1), pp. 170 - 184
Issue Date:
Filename Description Size
08051071.pdfPublished Version1.09 MB
Adobe PDF
Full metadata record
© 2017 IEEE. String similarity search is a fundamental query that has been widely used for DNA sequencing, error-tolerant query autocompletion, and data cleaning needed in database, data warehouse, and data mining. In this paper, we study string similarity search based on edit distance that is supported by many database management systems such as Oracle and PostgreSQL. Given the edit distance, ed(s; t), between two strings, s and t, the string similarity search is to find every string t in a string database D which is similar to a query string s such that ed(s; t) ≤ τ for a given threshold τ. In the literature, most existing work takes a filter-and-verify approach, where the filter step is introduced to reduce the high verification cost of two strings by utilizing an index built offline for D. The two up-to-date approaches are prefix filtering and local filtering. In this paper, we study string similarity search where strings can be either short or long. Our approach can support long strings, which are not well supported by the existing approaches due to the size of the index built and the time to build such index. We propose two new hash-based labeling techniques, named OX label and XX label, for string similarity search. We assign a hash-label, Hs, to a string s, and prune the dissimilar strings by comparing two hash-labels, Hs and Ht, for two strings s and t in the filter step. The key idea is to take the dissimilar bit-patterns between two hash-labels. We discuss our hash-based approaches, address their pruning power, and give the algorithms. Our hash-based approaches achieve high efficiency, and keep its index size and index construction time one order of magnitude smaller than the existing approaches in our experiment at the same time.
Please use this identifier to cite or link to this item: