Canonical consistent weighted sampling for real-valueweighted min-hash

Publication Type:
Conference Proceeding
Citation:
Proceedings - IEEE International Conference on Data Mining, ICDM, 2017, pp. 1287 - 1292
Issue Date:
2017-01-31
Filename Description Size
07837987.pdfPublished version255.55 kB
Adobe PDF
Full metadata record
© 2016 IEEE. Min-Hash, as a member of the Locality Sensitive Hashing (LSH) family for sketching sets, plays an important role in the big data era. It is widely used for efficiently estimating similarities of bag-of-words represented data and has been extended to dealing with multi-sets and real-value weighted sets. Improved ConsistentWeighted Sampling (ICWS) has been recognized as the state-of-The-Art for real-value weighted Min- Hash. However, the algorithmic implementation of ICWS is flawed because it violates the uniformity of the Min-Hash scheme. In this paper, we propose a Canonical Consistent Weighted Sampling (CCWS) algorithm, which not only retains the same theoretical complexity as ICWS but also strictly complies with the definition of Min-Hash. The experimental results demonstrate that the proposed CCWS algorithm runs faster than the state-of-The-Arts while achieving similar classification performance on a number of real-world text data sets.
Please use this identifier to cite or link to this item: