Consistent Weighted Sampling Made More Practical

Min-Hash, which is widely used for efficiently estimating similarities of bag-of-words represented data, plays an increasingly important role in the era of big data. It has been extended to deal with real-value weighted sets -- Improved Consistent Weighted Sampling (ICWS) is considered as the state-of-the-art for this problem. In this paper, we propose a Practical CWS (PCWS) algorithm. We first transform the original form of ICWS into an equivalent expression, based on which we find some interesting properties that inspire us to make the ICWS algorithm simpler and more efficient in both space and time complexities. PCWS is not only mathematically equivalent to ICWS and preserves the same theoretical properties, but also saves 20% memory footprint and substantial computational cost compared to ICWS. The experimental results on a number of real-world text data sets demonstrate that PCWS obtains the same (even better) classification and retrieval performance as ICWS with 1/5~1/3 reduced empirical runtime.


INTRODUCTION
Nowadays, data are growing explosively on the Web. In 2008, Google processed 20PB data every day [23]; in 2012, Google received more than 2 million search queries per minute; while in 2014 this number had been more than doubled [9,27]. Social networking services also have to face the data explosion challenge. More than 500TB of data need to be processed in Facebook every day [28] and 7 million check-in records are produced in Foursquare every day [30] -Big data have been driving data mining research in both academia and industry [7,22]. A fundamental research that underpins many high-level applications is to efficiently compute similarities (or distances) of data. Based on the data similarity, one can further conduct information retrieval, classification, and many other data mining tasks. However, the standard . similarity computation has been incompetent for big data due to the "3V" nature (volume, velocity and variety). For example, in text mining, it is intractable to enumerate the complete feature set (e.g., over 10 8 elements in the case of 5-grams in the original data [22]). Therefore, it is urgent to develop efficient yet accurate similarity estimation algorithms.
A typical solution to the aforementioned problem is to approximate data similarities using a family of Locality Sensitive Hashing (LSH) techniques [12]. By adopting a collection of hash functions to map similar objects to the same hash code with higher probability than dissimilar ones, LSH is able to approximate certain similarity (or distance) measures. One can thus efficiently and, in many cases, unbiasedly calculate the similarity (or distances) between objects. Many LSH schemes have been successively proposed, e.g., Min-Hash for estimating the Jaccard similarity [1], Sim-Hash for estimating angle-based distance [3,20], and LSH with p-stable distribution for estimating lp distance [6]. In particular, Min-Hash has been widely used for approximating the similarity of documents which are usually represented as sets or bags-of-words. Recently, some variations of Min-Hash have further improved its efficiency. For example, b-bit Min-Hash [16] and odd sketches [21] remarkably improve the storage efficiency by storing only b bits of each hash value; while one-permutation Min-Hash [17,25] employs only one permutation to reduce the computational workload.
Although the set similarity can be efficiently estimated based on the standard Min-Hash scheme and its variations, the target data are restricted to binary sets. However, in most real-world scenarios, weighted sets are more common. For example, a tf-idf value is assigned to each word to represent its importance in a collection of documents. As Min-Hash treats all the elements in a set equally for computing the Jaccard similarity, it cannot handle weighted sets properly. To address the limitation, weighted Min-Hash algorithms have been explored to approximate the generalized Jaccard similarity [11], which is used for measuring the similarity of weighted sets. Roughly speaking, existing works on weighted Min-Hash algorithms can be classified into quantization-based and sampling-based approaches.
Quantization-based methods quantize each weighted element into a number of distinct and equal-sized subelements. The resulting subelements are treated equally just like independent elements in the universal set. One can simply apply the standard Min-Hash scheme to the collection of subelements. Although in [8] an improved integer-value weighted Min-Hash algorithm is proposed to avoid computing the hash value for each individual subelement, the algorithm is still inefficient for dealing with real-value weighted sets because each weight must be transformed into integer weight by multiplying a large constant, which dramatically expands the universal set with numerous quantized subelements.
To avoid computing hash values for all the subelements, researchers have resorted to sampling-based methods. In [24], a uniform sampling algorithm is proposed, which strictly complies with the definition of the generalized Jaccard similarity. However, this algorithm requires knowing the upper bound of each element in the universal set in advance, which makes it impractical in real-world applications. In [5], a sampling method is derived through the distribution of the minimum of a set of random variables for integer weighted sets, which results in a biased estimator. Later, Consistent Weighted Sampling (CWS) [19,10] and Improved CWS (ICWS) [13] are proposed to remarkably improve the efficiency of weighted Min-Hash by introducing the notion of "active index" [8] into real-value weighted elements (the "active indices" on a weighted element are independently sampled as a sequence of subelements whose hash values monotonically decrease [29]). Thus far, ICWS [13], as an efficient and unbiased estimator of the generalized Jaccard similarity, is recognized as the state-of-the-art. Recently, [15] approximates ICWS by simply discarding one component of the Min-Hash values.
The mystery of ICWS [13] is that it implicitly constructs an exponential distribution for each real-value weighted element using only two "active indices". The minimum of the collection of exponential variables of all the elements yields one of the two components of the real-value weighted Min-Hash code (the other component can be easily obtained), meanwhile complying with the uniformity of the Min-Hash scheme -This enables ICWS to produce Min-Hash code for a real-value weight with computational complexity that is independent of the number of quantized subelements.
In this paper, we aim to further improve the efficiency of ICWS. By transforming the original form of ICWS into a different but mathematically equivalent expression, we find some interesting properties that inspire us to further make ICWS more practical. Thus, we propose the Practical CWS (PCWS) algorithm, which is simpler and more efficient in both space and time complexities compared to ICWS. Furthermore, we theoretically prove the uniformity and consistency of the PCWS algorithm for real-value weighted Min-Hash. We conduct extensive empirical tests on a number of real-world text data sets to compare the proposed PCWS algorithm and the state-of-the-arts for classification and retrieval. In summary, our contributions are three-fold: 1. We transform the original form of ICWS [13] into a simpler and easy-to-understand expression, uncovering the working mechanism of ICWS.
2. We propose the PCWS algorithm, which is mathematically equivalent to ICWS and preserves the same theoretical properties as ICWS.
3. PCWS has the same (even better) classification and retrieval performance as ICWS while saving 20% memory footprint and 1/5 ∼ 1/3 empirical runtime, which makes it more practical.
The remainder of the paper is organized as follows: Section 2 introduces the definitions of (generalized) Jaccard similarity, weighted Min-Hash, and ICWS [13]. Then, in Section 3 we revisit ICWS and present its equivalent version, based on which we propose our PCWS algorithm and give its theoretical analysis. The experimental results are presented in Section 4 and the paper is concluded in Section 5.

PRELIMINARIES
In this section, we first give some notations which will be used throughout the paper. Next we will introduce the Min-Hash scheme and present the state-of-the-art method, ICWS [13], for weighted Min-Hash.
Given a universal set U = (U1, U2, · · · , Un) and its subset S ⊆ U, if for any element S k ∈ S, S k = 1 or S k = 0, then we call S a binary set; if for any element S k ∈ S, S k ≥ 0, then we call S a weighted set. A typical example of universal set is the dictionary of a collection of documents, where each element corresponds to a word.
For hashing a binary set S, a Min-Hash scheme assigns a hash value to each element S k , h : k → v k . By contrast, for hashing a weighted set, there is a different form of hash function: h : (k, y k ) → v k,y , where y k ∈ [0, S k ]. A random permutation (or sampling) process returns the first (or uniformly selected) k from a binary set or (k, y k ) from a weighted set. If the set is sampled D times, we will obtain a fingerprint with D hash values. Min-Hash [1] is an approximate algorithm for computing the Jaccard similarity of two sets. It is proved that the probability of two sets, S and T , to generate the same Min-Hash value (hash collision) is exactly equal to the Jaccard similarity of the two sets:

The Min-Hash Scheme
The Jaccard similarity is simple and effective in many applications, especially for document analysis based on the bagof-words representations [26]. From the above Min-Hash scheme we can see that all the elements in U are considered equally because all the elements can be mapped to the minimum hash value with equal probability. To sample a weighted set based on the standard Min-Hash scheme, the weights, which indicates different importance of each element, will be simply replaced with 1 or 0, which in turn leads to serious information loss.
In most real-world scenarios, weighted sets are more commonly seen than binary sets. For example, a document is usually represented as a tf-idf set. In order to reasonably compute the similarity of two weighted sets, the generalized Jaccard similarity was introduced in [11]. Considering two weighted sets, S and T , the generalized Jaccard similarity is defined as

Improved CWS
Based on the generalized Jaccard similarity, some weighted Min-Hash algorithms have been proposed [8,19,13,24]. To the best of our knowledge, Improved Consistent Weighted Sampling (ICWS) [13] is remarkable in both theory and practice, and considered as the state-of-the-art method for weighted Min-Hash [15].
• Uniformity: The subelement (k, y k ) should be uniformly sampled from k ({k} × [0, S k ]), i.e., the probability of selecting the k-th element is proportion to S k , and y k is uniformly distributed on [0, S k ].
• Consistency: Given two non-empty weighted sets, S and T , if ∀k, T k ≤ S k , a subelement (k, y k ) is selected from S and satisfies y k ≤ T k , then (k, y k ) will also be selected from T .
CWS has the following property In the following we briefly review how to deduce ICWS to meet the two conditions of CWS, uniformity and consistency. Firstly we note that the exponential distribution has an important and interesting property about the distribution of the minimum of exponential random variables: Let X1, . . . , Xn be independently exponentially distributed random variables, with the parameters being λ1, . . . , λn, respectively, then we have Pr(X k = min{X1, . . . , Xn}) = λ k λ 1 +...+λn . Therefore, one can employ this property to implement the uniformity for CWS -If each hash value a k of the k -th element is drawn from an exponential distribution parameterized with its corresponding weight, i.e., a k ∼ Exp(S k ), the minimum hash value a k will be sampled in proportion to S k , In addition, note that k and y k are mutually independent, which means that a k and y k are mutually independent as well. Formally, the condition of uniformity in terms of (y k , a k ) can be expressed as where y k ∼ Uniform(0, S k ) and a k ∼ Exp(S k ). In order to make y k uniformly distributed in [0, S k ], ICWS employs the following equation where r k ∼ Gamma(2, 1) and b k ∼ Uniform(0, 1). Eq. (3) is used to prove the uniformity in [13]. However, in its algorithmic implementation of ICWS, the above equation is replaced with the following equation where β k ∼ Uniform(0, 1). It is indicated in [13] that via Eq. (4), ln y k is sampled from the same uniform distribution in [ln S k − r k , ln S k ] as that sampled through Eq. (3). The floor function and the uniform random variable β k in Eq. (4) ensures that a fixed y k is sampled in an interval of r k . Obviously, Eq. (4) gives rise to consistency because small changes in S k cannot affect the value of y k . In order to sample k in proportion to its corresponding weight S k , in addition to an "active index", y k , ICWS [13] introduces a second "active index", z k ∈ [S k , +∞), and builds the relationship among y k , z k and r k : Furthermore, by using the two "active indices", ICWS implicitly constructs an exponential distribution parameterized with S k , i.e., a k ∼ Exp(S k ): where c k ∼ Gamma(2, 1). Essentially, ICWS proceeds the sampling process as follows: It first samples y k using Eq. (7), which is derived from Eq. (4). Then the sampled y k , as an independent variable, is fed into Eq. (8), which is derived from Eqs. (5) and (6), and outputs a hash value conforming to the exponential distribution parameterized with the corresponding weight S k .
In ICWS [13], a Min-Hash code (y k , a k ) for a weighted element S k is calculated using the hash functions, Eqs. (7) and (8), respectively. The hash functions seem already simple but mysterious. Why does Eq. (8) work? Can we further improve the efficiency of ICWS?

PRACTICAL CWS
In this section we propose a more practical algorithm for consistent weighted sampling. First, we revisit ICWS [13] and transform its original form Eq. (8) for sampling a k into an equivalent version. Based on this equivalent form, we find some interesting properties which inspire us to make the ICWS algorithm simpler and more efficient in both space and time complexities. We also demonstrate that the proposed PCWS algorithm still complies with the uniformity and consistency of CWS [19].

ICWS Revisited
Recall that in ICWS [13] one need to sample two Gamma variables, r k ∼ Gamma(2, 1) and c k ∼ Gamma(2, 1), in Eqs. (7) and (8) to directly generate y k and a k . In fact, the simplest programming implementation 1 to generate a random variable x from Gamma(2, 1) is to first sample two uniform random variables, u1, u2 ∼ Uniform(0, 1) and then conduct a transformation as x = − ln(u1u2). Thus if we represent the two independent Gamma variables r k and c k in terms of uniform random variables, r k = − ln(u k1 u k2 ) and we can obtain an equivalent version of Eqs. (7) and (8): where Now, the ICWS algorithm in [13] can be presented equivalently in Algorithm 1 for producing D Min-Hash codes. Algorithm 1 is the standard implementation of ICWS. If we further slightly rearrange Eq. (10) as follows we can find two interesting properties: (1) the denominator, y k u −1 k1 , is essentially an unbiased estimator of the weight S k ; and (2) the numerator − ln(v k1 v k2 )u k2 is actually a standard exponential distribution.
The first property can be easily verified based on the uniformity y k = u k1 S k , where u k1 ∼ Uniform(0, 1), because the derivation of the ICWS algorithm [13] is based on y k = u k1 S k . However, in its algorithm, ICWS [13] samples y k through Eq.
is not exactly equivalent to S k but its unbiased estimator: To see the second property, let Substituting Eqs. (12) and (13) into Eq. (11) we obtain a k = m k S k . Through the Jocobian transformation, we have . Thus far, we have found that the mystery of ICWS is to implicitly employ an exponential distribution parameterized withŜ k to sample a k , that is In this case, the probability of the k-th element being sampled is equal to the probability of the k-th element being assigned with the minimum exponential random variable a k , where the probability is in proportion to its weight: Pr(a k = min{a1, · · · , an}) =Ŝ k k Ŝ k .
In addition to understanding the underlying mechanism of ICWS, one may ask what else we have been inspired to improve the ICWS algorithm. In the following, we will make use of the uncovered property Eq. (13) to reduce both time and space complexities of ICWS. Recall in Eq. (11), the numerator employs three independent uniform random variables to produce a sample in the form of − ln(v1v2)u2, which is proved to be a standard exponential distribution, m k ∼ Exp(1) in Eq. (13). Obviously, it is costly in terms of both time and space to produce − ln(v1v2)u2. Due to the fact that − ln x k ∼ Exp(1), x k ∼ Uniform(0, 1), we can adopt − ln x k insted of − ln(v1v2)u2 to achieve the same goal.
To this end, we can simply replace a k = − ln(v k1 v k2 ) and obtain the proposed PCWS algorithm (see Algorithm 2). The only two differences between ICWS and PCWS lie in • ICWS has to generate one more uniform random variable than PCWS for each element k (Line 5).
• ICWS has a relatively more complicated expression than PCWS to calculate a k (Line 12).
Complexity: In programming implementation, ICWS requires sampling five global uniform random variables for each element k (i.e., u k1 , u k2 , β k , v k1 , v k2 ); while PCWS only requires sampling four (i.e., u k1 , u k2 , β k , x k ). From Algorithm 1 and Algorithm 2 it is easy to see that the space complexity of PCWS is O(4nD) while the space complexity of ICWS is O(5nD), where n denotes the size of the universal set and D the number of samples (number of Min-Hashes). Although the uniform random variables can be sampled off-line, it is worth noting that these variables have to be cached in the memory during the hashing process. In text mining applications, the size of the universal set (number of features) can easily reach 10 7 ; if we adopt 10 3 samples, an additional memory footprint of 10 10 floats have to be al-

Analysis
In this subsection, we will prove that our PCWS algorithm generates (y k , a k ) which indeed satisfy uniformity and consistency of the CWS scheme [19] .

Uniformity
We drop the element index k for conciseness. Following ICWS [13], the random variable ln y = r ln S r + β − β , where r = − ln(u1u2) ∼ Gamma(2, 1), β ∼ Uniform(0, 1), shares the same distribution as ln y = ln S − rb, where b ∼ Uniform(0, 1). In the both situations, ln y is uniformly sampled from [ln S − r, ln S]. Since our PCWS algorithm still employs two "active indices", y and z, as ICWS does, we have r = ln z − ln y for the sake of proof of uniformity. The distribution pdf(y, z, a), defined for y ≤ S, z ≥ S and a > 0, can be obtained by transforming the distributions of the random variables as pdf(y, z, a) = pdf(r, b, x) det ∂(r, b, x) ∂(y, z, a) , where r = ln z − ln y, b = ln S−ln y ln z−ln y , and x = exp(−ayu −1 1 ). Recall that r, b, x are mutually independent and have the following probability density functions: pdf(r) = re −r and pdf(b) = pdf(x) = 1. By computing the Jacobian determinant, we obtain pdf(y, z, a) = 1 z 2 (yu −1 1 )e −(yu −1 1 )a .
Actually pdf(y, a) is still conditioned on u1. We can do an expectation over the distribution of yu −1 1 and, according to Eq. (12), obtain E(yu −1 1 ) = S. Therefore, we have pdf(y, a) = 1 S (Se −Sa ) = pdf(y)pdf(a).
It is easy to see that y is uniformly distributed in [0, S] and a complies with an exponential distribution parameterized with S, that is, a ∼ Exp(S); meanwhile, y and a are independent. For all the weights {S1, . . . , Sn} in weighted set S, there exist a set of exponential distributions parameterized with the corresponding weights. According to Eq. (1), a k * is the minimum hash value with a probability in proportion to S k * , Pr(a k * = min k a k ) = S k * k S k . Therefore, (k * , y k * ) is uniformly sampled from k ({k} × [0, S k ]).

Consistency
Following ICWS [13], we will demonstrate that, for two non-empty weighted sets S and T , if ∀k, T k ≤ S k , a subelement (k * , y k * ) is sampled from S and satisfies y k * ≤ T k * , then (k * , y k * ) will be sampled from T .
Considering an element k * , t S k * = ln S k * − ln(u k * 1 u k * 2 ) + β k * , and thus which indicates y S k * = y k * = y T k * . Thus y S k * and y T k * will be sampled from the k * -th elements of S and T , respectively.
On the other hand, we note that, for any k, a k is essentially a monotonically non-increasing function of S k : where r k = − ln(u k1 u k2 ). Therefore, ∀k, a T k ≥ a S k due to T k ≤ S k , while a T k * = a S k * = min k a S k because of y S k * = y T k * . As a result, a T k * ≤ a S k ≤ a T k and in turn arg min k a T k = arg min k a S k = k * , which demonstrates that (k * , y k * ) is sampled from S and T simultaneously. Thus consistency holds.
In summary, our PCWS algorithm satisfies the two properties, uniformity and consistency, of the CWS scheme. Thus, PCWS not only has an equivalent (but more efficient) algorithm to ICWS but also holds the same theoretical properties as ICWS.

EXPERIMENTAL RESULTS
In this section, we report the performance of our PCWS algorithm and its competitors on five real-world text data sets. In Subsection 4.2, we investigate the effectiveness and efficiency of the compared methods for classification. In Subsection 4.3, we investigate the effectiveness and efficiency of the compared methods for information retrieval.

Experimental Preliminaries
Length of Fingerprints Realsim [Gollapudi et.al., 2006] [Chum et.al.,2008] ICWS [Li,2015] Min-Hash [Haeupler, 2014 Realsim [Gollapudi et.al., 2006] [Chum et.al.,2008] ICWS [Li,2015] Min-Hash [Haeupler, 2014 We compare our PCWS algorithm with six state-of-thearts: (1) Min-Hash: The standard Min-Hash scheme is applied by simply treating weighted sets as binary sets; (2) [Gollapudi et.al., 2006] [8]: It transforms weighted sets into binary sets by thresholding real-value weights with random samples and then applies the standard Min-Hash scheme (another algorithm is introduced in the same paper which is however extremely inefficient for real-value weights and thus not reported). (3) [Chum et.al., 2008] [5]: It approximates the generalized Jaccard similarity with a bias for real-value weighted sets despite that the derivation is based on integer weighted sets; (4) ICWS [13]: It is introduced in Section 3.1, which is currently the state-of-the-art for weighted Min-Hash in terms of both effectiveness and efficiency; (5) [Li, 2015] [15]: It approximates ICWS by simply discarding one of the two components, that is y k in Eq. (7), of ICWS; (6) [Haeupler et.al., 2014] [10]: It approximates the generalized Jaccard similarity by rounding the real-value weights with probability and then quantizing the integer weights.
All the compared algorithms are implemented in Matlab. For [Haeupler et.al., 2014], each weight is scaled up by a factor of 100. We first apply all the algorithms to generate the fingerprints of the data. Suppose that each algorithm generates xS and xT , which are the fingerprints with the length of D for the two real-value weighted sets, S and T , respectively, the similarity between S and T is SimS,T = All the random variables are globally generated at random. That is, in one sampling process, the same elements in different sets use the same set of random variables. All the experiments are conducted on a node of a Linux Cluster with 8 × 3.1 GHz Intel Xeon CPU (64 bit) and 1TB RAM.

Results on Classification
We investigate classification performance of the compared methods using LIBSVM [2] with 10-fold cross-validation on three binary classification benchmarks 2 : 1. Real-sim: The data set is a collection of UseNet articles from four discussion groups about simulated auto racing, simulated aviation, real autos and real aviation, respectively. The formatted document data set, with 72,309 samples and 20,958 features, has been arranged into two classes: real and simulated. Since the real class and the simulated class of the original data are unbalanced, data preprocessing is performed by randomly selecting 10,000 real samples and 10,000 simulated samples as positive and negative instances, respectively.

Rcv1:
The data set is a large collection of newswire stories drawn from online databases. The formatted data set has 20,242 training samples with 47,236 features. The data set has been categorized into two classes: positive instances contain CCAT and ECAT while negative ones contain GCAT and MACT on the website. Similarly, we randomly select 10,000 positive instances and 10,000 negative ones to compose a balanced data set.

3.
Kdd: This is a large educational data set from the KDD Cup 2010 competition. The formatted data set Webspam [Gollapudi et.al  We also randomly select 10,000 positive instances and 10,000 negative ones to form a balanced data set for classification.
We repeat each experiment 5 times and compute the mean and the standard derivation of results.
Discussions on Real-sim: The subplots in the first row in Figure 1 show the comparison results on Real-sim. One can see that our PCWS algorithm achieves almost the same accuracy as ICWS, [Li, 2015] and [Chum et.al., 2008]. The comparison between PCWS and ICWS demonstrates that our PCWS algorithm indeed satisfies the CWS scheme. The reason that [Chum et.al., 2008] also performs similarly with [Li, 2015] may be the one discussed in [15], that is, one component of ICWS, y k , is trivial to approximate the generalized Jaccard similarity for most data sets. In terms of runtime, our PCWS algorithm performs similarly with ICWS and [Li, 2015] when D ranges from 2 to 128, and subsequently, the gap between PCWS and ICWS clearly widens. Particularly, PCWS takes around 3/4 of ICWS runtime and nearly 3/5 of [Li, 2015] runtime, when D = 1024.
Discussions on Rcv1: The subplots in the second row of Figure 1 show the comparison results on Rcv1. Our PCWS algorithm preserves almost the same accuracy as ICWS. Furthermore, the two CWS algorithms clearly perform better than the standard Min-Hash scheme and other algorithms which biasedly estimate the generalized Jaccard Similarity. In terms of runtime, our PCWS algorithm maintains the same level as ICWS and [Li, 2015] with D varying from 2 to 64. Our PCWS algorithm is remarkably superior to ICWS and [Li, 2015] when D varies from 128 to 1024. Again, the runtime of PCWS is reduced by a factor of 1/3 compared to ICWS and [Li, 2015] when D = 1024.
Discussions on Kdd: In order to evaluate the classification ability on data with a large number of features (i.e., size of the universal set), we test the compared methods on the Kdd data set. The comparison results are reported in the subplots in the third row of Figure 1 (in this experiment, each algorithm is given a cutoff time of 100 seconds). Our PCWS algorithm still preserves the same accuracy as ICWS. In terms of runtime, PCWS runs much more efficiently than ICWS and [Li, 2015] this time on the data set with a large number of features. The performance gain of our PCWS algorithm in terms of runtime starts in the very beginning; as the length of fingerprint D increases, the gap becomes more significant: 1/3 faster than ICWS and [Li, 2015] when D approaches to 1024.

Results on Top-K Retrieval
In this experiment, we carry out top-K retrieval, for K = {1, 20, 50, 100, 500, 1000}. We adopt Precision@K and Mean Average Precision (MAP)@K to measure the performance in terms of accuracy because precision is relatively more important than recall in large-scale retrieval; furthermore, MAP contains information of relative orders of the retrieved samples, which can reflect the retrieval quality more accurately. To this end, we select two large-scale public data sets: 1. Webspam: It is a web text data set provided by a large-scale learning challenge. The data set has 350,000 instances and 16,609,143 features. We randomly select 1,000 samples from the original data set as query examples and the rest as the database.   Discussions on Webspam: Figure 3 reports the comparison results on the Webspam data set which has more than 16 million features. We observe that our PCWS algorithm generally outperforms all the competitors under all configurations. Again, as the length of fingerprints D increases, the performance gain of PCWS compared to the other algorithms becomes clearer. PCWS outperforms ICWS and [Li, 2015] by about 5% when D = 1024; this improvement over its counterparts might be due to the positive effect of using one less random variable. PCWS performs much better than Min-Hash and [Gollapudi et.al., 2006] and it is superior to the two algorithms by around 25% and 40%, respectively, when D = 1024. PCWS also shows better performance than [Chum et.al., 2008] and [Haeupler et.al., 2014] in most cases. In the left subplot of Figure 2 for comparison of runtime, PCWS clearly runs faster than ICWS and [Li, 2015] as the length of fingerprints D increases. For example, PCWS performs approximately 30% faster than the two algorithms when D = 1024.
Discussions on Url: Figure 4 reports the comparison results on the Url data set which has more than 2 million instances and over 3 million features (in this experiment, each algorithm is given a cutoff time of 40,000 seconds). Generally, our PCWS algorithm performs similarly with other algorithms, and even does better than ICWS with D ranging from 2 to 8. In the right subplot of Figure 2 for comparison of runtime, PCWS clearly outperforms ICWS and [Li, 2015]; in particular, PCWS runs nearly 30% faster than ICWS and around 1.25 times as fast as [Li, 2015] when D = 1024.

Discussion on Space Efficiency
Finally, we analyze the advantage of our PCWS algorithm on space efficiency. Recall that our PCWS algorithm is able to save O(nD) of memory footprint (see Section 3.2), where n is the number of features and D is the length of fingerprints, because PCWS generates one less uniform random variable than ICWS does. This advantage in space complexity may not be remarkable when the size of the universal set n is small; however, when n is large in most cases of real-world data sets, the saved memory footprint can be substantial. For example, Kdd, Webspam, and Url have 20.2 million, 16.6 million and 3.2 million features, respectively, and our PCWS algorithm can enjoy 150 GB, 130 GB and 24 GB lower memory footprints in the case of D = 1024, compared to ICWS without accuracy degradation. If the number of features becomes even larger or more samples of Min-Hashes are used, the advantage of PCWS on space efficiency will be more remarkable. Obviously, our PCWS algorithm significantly improves ICWS in terms of both time and space efficiency, which indicates that PCWS is more practical.

CONCLUSION AND FUTURE WORK
In this paper, we propose Practical Consistent Weighted Sampling (PCWS) to further improve the efficiency of Improved CWS [13], which is considered as the state-of-the-art method for real-value weighted Min-Hash. The proposed PCWS algorithm is mathematically equivalent to ICWS and preserves the same theoretical properties as ICWS, but with reduced theoretical complexity in both time and space. We conduct extensive empirical tests of our PCWS algorithm and a number of state-of-the-art methods on five real-world text data sets for classification and information retrieval. The experimental results show that PCWS is able to achieve the same (even better) performance than ICWS with 1/5 ∼ 1/3 reduced empirical runtime and 20% reduced memory footprint. In the cases of large number of features, PCWS can save hundreds of GB of memory footprint, which makes it more practical in dealing with real-world data sets in the era of big data.
Existing similarity-preserving hashing techniques can only deal with nested binary sets [14] and tree-structured categorical data [4]. It will be interesting to extend CWS schemes to hash nested weighted sets, which not only encode the importance of feature but also preserve the multi-level exchangeability [4] of feature, in our future work.