Consistent weighted sampling made more practical

Publication Type:
Conference Proceeding
Citation:
26th International World Wide Web Conference, WWW 2017, 2017, pp. 1035 - 1043
Issue Date:
2017-01-01
Full metadata record
© 2017 International World Wide Web Conference Committee (IW3C2) Min-Hash, which is widely used for efficiently estimating similarities of bag-of-words represented data, plays an increasingly important role in the era of big data. It has been extended to deal with real-value weighted sets – Improved Consistent Weighted Sampling (ICWS) is considered as the state-of-the-art for this problem. In this paper, we propose a Practical CWS (PCWS) algorithm. We first transform the original form of ICWS into an equivalent expression, based on which we find some interesting properties that inspire us to make the ICWS algorithm simpler and more efficient in both space and time complexities. PCWS is not only mathematically equivalent to ICWS and preserves the same theoretical properties, but also saves 20% memory footprint and substantial computational cost compared to ICWS. The experimental results on a number of real-world text data sets demonstrate that PCWS obtains the same (even better) classification and retrieval performance as ICWS with 1/5 ∼ 1/3 reduced empirical runtime.
Please use this identifier to cite or link to this item: