Canonical Consistent Weighted Sampling for Real-Value Weighted Min-Hash
- Publication Type:
- Conference Proceeding
- Proceedings of the 2016 IEEE 16th International Conference on Data Mining, 2016, pp. 1287 - 1292
- Issue Date:
Min-Hash, as a member of the Locality Sensitive Hashing (LSH) family for sketching sets, plays an important role in the big data era. It is widely used for efficiently estimating similarities of bag-of-words represented data and has been extended to dealing with multi-sets and real-value weighted sets. Improved Consistent Weighted Sampling (ICWS) has been recognized as the state-of-the-art for real-value weighted Min-Hash. However, the algorithmic implementation of ICWS is flawed because it violates the uniformity of the Min-Hash scheme. In this paper, we propose a Canonical Consistent Weighted Sampling (CCWS) algorithm, which not only retains the same theoretical complexity as ICWS but also strictly complies with the definition of Min-Hash. The experimental results demonstrate that the proposed CCWS algorithm runs faster than the state-of-the-arts while achieving similar classification performance on a number of real-world text data sets.
Please use this identifier to cite or link to this item: