ProbMinHash -- A Class of Locality-Sensitive Hash Algorithms for the (Probability) Jaccard Similarity

The probability Jaccard similarity was recently proposed as a natural generalization of the Jaccard similarity to measure the proximity of sets whose elements are associated with relative frequencies or probabilities. In combination with a hash algor…

Authors: Otmar Ertl

ProbMinHash -- A Class of Locality-Sensitive Hash Algorithms for the   (Probability) Jaccard Similarity
IEEE TRANSACTIONS ON KNO WLEDGE AND D A T A ENGINEERING, V OL. XX, NO . X, XXXX 2020 1 ProbMinHash – A Class of Locality-Sensitiv e Hash Algor ithms f or the (Probability) J accard Similar ity Otmar Er tl Abstract —The probability Jaccard similarity was recently proposed as a natur al generalization of the Jaccard similarity to measure the proximity of sets whose elements are associated with relativ e frequencies or probabilities. In combination with a hash algorithm that maps those weighted sets to compact signatures which allow f ast estimation of pairwise similarities, it constitutes a valuab le method f or big data applications such as near-duplicate detection, nearest neighbor search, or clustering. This paper introduces a class of one-pass locality-sensitive hash algorithms that are orders of magnitude f aster than the original approach. The performance gain is achiev ed by calculating signature components not independently , but collectiv ely . Four different algorithms are proposed based on this idea. T wo of them are statistically equiv alent to the or iginal approach and can be used as drop-in replacements. The other tw o may ev en improv e the estimation error by introducing statistical dependence between signature components . Moreover , the presented techniques can be specialized f or the conv entional Jaccard similarity , resulting in highly efficient algorithms that outperform traditional minwise hashing and that are able to compete with the state of the art. Index T erms —Minwise hashing, Jaccard similarity , locality-sensitive hashing, streaming algorithms, probabilistic data structures F 1 I N T R O D U C T I O N T H E calculation of pairwise object similarities is an im- portant task for clustering, near-duplicate detection, or nearest neighbor sear ch. Big data applications requir e sophisticated algorithms to overcome time and space con- straints. A widely used technique to reduce costs for pair- wise similarity computations is minwise hashing (MinHash) [1]. It allows the calculation of signatures for individual objects that can be repr esented as sets of features. Using only the corresponding signatures, the Jaccar d similarity can be estimated from the number of equal signature components. The Jaccard similarity J of two sets A and B is given by J = | A ∩ B | | A ∪ B | . Minwise hashing maps a set S to an m -dimensional signa- ture vector z ( S ) = ( z 1 ( S ) , z 2 ( S ) , . . . , z m ( S )) with statisti- cally independent components that are defined as z k ( S ) := arg min d ∈ S h k ( d ) (1) where h k are independent hash functions with identically distributed output. If hash collisions of h k are very unlikely and can be ignored, the pr obability that the same signature components of two dif ferent sets A and B have the identical value is equal to the Jaccard similarity Pr( z k ( A ) = z k ( B )) = J. (2) • O. Ertl is affiliated with Dynatrace Research, Linz, Austria. E-mail: otmar .ertl@dynatrace.com or otmar .ertl@gmail.com Manuscript received XXXX XX, 2020; revised XXXX XX, XXXX. The final version is available at http:// dx.doi.org/ 10.1109/ TKDE.2020.3021176. This property allows unbiased estimation of J using the estimator ˆ J ( z ( A ) , z ( B )) = 1 m m X k =1 1 ( z k ( A ) = z k ( B )) (3) where 1 denotes the indicator function. The variance of this estimator is V ar( ˆ J ( z ( A ) , z ( B ))) = J (1 − J ) m , (4) because the number of equal components in z ( A ) and z ( B ) is binomially distributed with success probability J , if the signature components are independent. 1.1 Incorporating W eights Describing objects as featur e sets is not always appropriate, because the features are sometimes associated with weights. For example, text documents can be repr esented as bag of words weighted according to their term frequency–inverse document frequency [2]. For the mathematical repr esenta- tion of weighted sets we use weight functions which map elements to their associated nonnegative weights. W eight functions can be treated as a mapping from the universe of all possible elements D , if elements not belonging to the corresponding weighted set are consider ed to have a weight equal to 0. Multiple approaches have been proposed to generalize the Jaccard similarity to weighted sets. The weighted Jaccard similarity J W is one possibility to generalize the Jaccard similarity J and is defined as J W = P d ∈ D min( w A ( d ) , w B ( d )) P d ∈ D max( w A ( d ) , w B ( d )) . (5) Copyright (c) 2020 IEEE. P ersonal use is per mitted. For an y other pur poses, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. IEEE TRANSACTIONS ON KNO WLEDGE AND D A T A ENGINEERING, VOL. XX, NO . X, XXXX 2020 2 Algorithm 1: P-MinHash. Input: w Output: z 1 , z 2 , . . . , z m ( q 1 , q 2 , . . . , q m ) ← ( ∞ , ∞ , . . . , ∞ ) forall d ∈ D such that w ( d ) > 0 do w inv ← 1 /w ( d ) R ← new PRNG with seed d for k ← 1 to m do h ← w inv · R [Exp(1)] if h < q k then q k ← h z k ← d Here w A and w B are the weight functions for weighted sets A and B , respectively . J and J W are identical for a set whose elements have a weight of 1 while all other elements of the universe D have a weight of 0 . V arious hash algorithms have been developed to calculate signatures that allow the estimation of J W using estimator (3) [3], [4], [5]. Sometimes the scale of feature weights is not important. This is for example the case if the weights repr esent relative frequencies or probabilities. Multiplying all weights of a set by the same factor should not change the similarities to other sets. An obvious solution to achieve this scale- invariance is the normalization of weights before calculating J W [6], [7]. This leads to the normalized weighted Jaccard similarity J N = P d ∈ D min  w A ( d ) P d 0 ∈ D w A ( d 0 ) , w B ( d ) P d 0 ∈ D w B ( d 0 )  P d ∈ D max  w A ( d ) P d 0 ∈ D w A ( d 0 ) , w B ( d ) P d 0 ∈ D w B ( d 0 )  . In this way the same locality-sensitive hash algorithms can be used as for J W . Recently , another scale-invariant metric was proposed called probability Jaccard similarity J P = X d ∈ D 1 P d 0 ∈ D max  w A ( d 0 ) w A ( d ) , w B ( d 0 ) w B ( d )  . Since J P is Pareto optimal [8], which is a property not shared by J N , it is a more natural extension of J to discr ete probability distributions. The probability Jaccard similarity J P was analyzed in detail and presented together with an appropriate locality-sensitive hash algorithm in [8]. The algorithm uses the hash functions h k ( d ) ∼ Exp( λw ( d )) ∼ 1 λw ( d ) Exp(1) , (6) which yield hash values distributed exponentially with rate proportional to the weight w ( d ) of the given element d . The proportionality constant λ is a free parameter and has no influence on the signature defined by (1). It was shown [8] that the resulting signature satisfies Pr( z k ( A ) = z k ( B )) = J P and therefore allows unbiased estimation of J P from the proportion of equal components using estimator (3). If the hash values h k ( d ) with k ∈ { 1 , 2 , . . . , m } are statistically independent, the signature components will be independent as well and the variance of the estimator will be J P (1 − J P ) /m analogous to (4). As J W , J P corresponds to J in case of binary weights w ( d ) ∈ { 0 , 1 } and can therefore be regar ded as generalization of the Jaccard similarity J . The P- MinHash signature is equivalent to the MinHash signature (1) in this case, because the hash functions h k defined by (6) are identically distributed, if w ( d ) is always equal to 1. The signature definition (1) together with hash functions (6) can be straightforwardly translated into an algorithm called P-MinHash [8] shown as Algorithm 1. Instead of using m independent hash functions, we use a pseudo- random number generator (PRNG) R seeded with a hash value of d to generate independent and exponentially dis- tributed values. Since the rate parameter of the exponential distributed random values in (6) is a free parameter , λ = 1 is used for simplicity . R [Exp( λ )] denotes the generation of an exponentially distributed random value with rate parameter λ using random bits taken from R . Since floating-point divisions are more expensive than multiplications, it makes sense to precalculate the reciprocal weight w inv as done in Algorithm 1. 1.2 Related W ork Interestingly , hash algorithms with collision probabilities equal to J P have alr eady been unintentionally presented be- fore J P was actually discovered and thoroughly analyzed in [8]. In [7] a data structur e called HistoSketch was proposed to calculate signatures for J N . The derivation started fr om 0- bit consistent weighted sampling (0-bit CWS) [6], which was proposed as simplification of improved consistent weighted sampling (ICWS) [4]. While the collision probability for ICWS equals J W , it is actually not known for 0-bit CWS and may be far off from J W [3]. For example, consider two weighted sets, both consisting of the same single el- ement. 0-bit CWS will always have a collision probability of 1 regar dless of the actual weights which contradicts (5). Therefor e, choosing this algorithm with undefined behavior as starting point for the derivation of a new algorithm is questionable. Nevertheless, after some simplifications and thanks to a nonequivalent transformation that eliminated the scale dependence, the final HistoSketch algorithm had a collision probability equal to J P instead of the origi- nally desired J N . W ithout awareness of the correct collision probability , the same authors simplified the HistoSketch algorithm [9] and finally obtained an algorithm that was equivalent to P-MinHash [8]. Another but similar attempt to derive a simplified algo- rithm from 0-bit CWS is given in [10]. However , the use of a slightly different nonequivalent transformation also led to a slightly different hash algorithm. Even though the algorithm is scale-invariant, the hash collision probability is neither equal to J N nor equal to J P . The straightforward implementation of signatures based on (1) leads to time complexities of O ( nm ) , where m de- notes the signatur e size and n is the set size or the number of elements with nonzero weight n = |{ d : w ( d ) > 0 }| in the weighted case. A lot of effort was done to break this O ( nm ) barrier . By calculating all m signature components in a more collective fashion, much better time complexi- ties of kind O ( n + f ( m )) are possible, where f is some function only dependent on m . In case of the conventional Jaccard similarity J , One Permutation Hashing (OPH) [11], [12], [13], [14], Fast Similarity Sketching (FSS) [15], or the SuperMinHash algorithm [16] are repr esentatives of such algorithms. They have in common that signature compo- nents are not statistically independent. Because of that, the IEEE TRANSACTIONS ON KNO WLEDGE AND D A T A ENGINEERING, VOL. XX, NO . X, XXXX 2020 3 10 0 10 1 10 2 10 3 10 4 10 5 10 6 u 0 . 0 0 . 5 1 . 0 α ( m, u ) m = 16 m = 64 m = 256 m = 1024 m = 4096 Fig. 1. The function α ( m, u ) over u := | A ∪ B | for diff erent values of m . latter two algorithms even have the property that estimation errors are significantly reduced, if n is not much larger than m . As example, the SuperMinHash algorithm [16] defines the hash functions h k as h k ( d ) := u k ( d ) + π k ( d ) . Here u k is uniformly distributed over [0 , 1) and π k ( d ) are the elements of a random permutation of values { 0 , 1 , 2 , . . . , m − 1 } . The corresponding signature given by (1) satisfies (2) and the variance of the estimator (3) is V ar( ˆ J ) = J (1 − J ) m α ( m, u ) (7) where u := | A ∪ B | denotes the union cardinality . α ( m, u ) is given by α ( m, u ) := 1 − P m − 1 l =1 l u (( l + 1) u + ( l − 1) u − 2 l u ) ( m − 1) u − 1 m u ( u − 1) (8) and shown in Fig. 1 for differ ent values of m . For u < m the function value tends to be in the range of 0 . 5 , and therefor e, the variance (7) is significantly smaller than that of the original MinHash algorithm given by (4). The first algorithm for the conventional Jaccard similar- ity J that has overcome the O ( nm ) barrier with provable statistically independent signatur e components was pre- sented in [3] as a special case of the BagMinHash algorithm. More generally , BagMinHash is the first algorithm that has overcome the O ( nm ) barrier for J W . Only for special cases with beforehand known universe and upper bounds for weights another fast approach was presented before in [5]. A time complexity proportional to nm would not be a big problem, if both n and m were small. However , real world pr oblems often have large feature sizes n . Mor eover , it is common that the signature size m is in the hundreds or even thousands [4], [6], [14], [17], [18], [19]. In particu- lar , indexing techniques like locality-sensitive hashing [20], [21], [22], which enable sublinear nearest neighbor lookups, requir e many signature components to increase sensitivity and specificity . Even larger signature sizes are needed, if b -bit minwise hashing is used [23]. This technique reduces each signature component to only a few bits. The loss of information must be compensated by increasing the number of components in order to achieve the same estimation error . Neverthe- less, this approach can significantly reduce the total space requir ements of signatures, especially if one is mainly in- terested in high similarities. Any signature can be easily reduced to a b -bit signatur e using Algorithm 2 which trans- forms each component by taking b bits from a hash value Algorithm 2: Reduction to a b -bit signature. Input: z 1 , z 2 , . . . , z m Output: ˆ z 1 , ˆ z 2 , . . . , ˆ z m ∈ { 0 , 1 , 2 , . . . , 2 b − 1 } for i ← 1 , 2 , . . . , m do ˆ z i ← least significant b bits of hash value calculated from pair ( z i , i ) calculated from the component itself and its index. Since b -bit integer values of different elements will collide with high probability , estimator (3) will have a bias that must be accounted for . Moreover , to avoid correlated collisions of different elements over multiple signature components, it is crucial that the involved hash value computation also includes the component index as done by Algorithm 2. The cosine similarity is a widely-used alternative to the Jaccard similarity metrics for weighted sets for which also well established locality-sensitive hash algorithms exist [24], [25]. Even though they can be efficiently implemented, they also do not overcome the O ( nm ) time complexity . A hash algorithm designed for a certain metric is not necessarily the best option to be used in the context of locality-sensitive hashing [26]. For example, b -bit minwise hashing outper- forms SimHash even for the cosine similarity in case of binary weights [27]. Fast hash algorithms with time com- plexities below O ( nm ) could therefore also be valuable for other metrics. 1.3 Applications The probability Jaccard similarity J P is a metric that was discovered only recently [8] and is therefore not yet very widespread. However , since HistoSketch [7], [9] finally turned out to be a hash algorithm for J P , it has already been successfully used to calculate graph embeddings [28] or to create sketches of k-mer spectra for microbiome analytics [19]. A faster hash algorithm with a time complexity below O ( nm ) could dir ectly improve the performance of those applications, making them even mor e attractive. At the same time, a fast algorithm would make J P more appealing for many more applications for which other metrics such as the Jaccard or cosine similarity were previously preferr ed because of their well-known hash algorithms. 1.4 Our Contrib utions Motivated by recently developed algorithms like Super- MinHash [16] for the conventional Jaccard similarity J or BagMinHash [3] for the weighted Jaccard similarity J W , which both achieved superior performance by calculating signature components in a collective fashion instead of calculating them independently , we applied the same prin- ciple to design new minwise-hashing algorithms for the probability Jaccard similarity J P [8]. These algorithms are the first to undercut the O ( nm ) runtime complexity of the state-of-the-art P-MinHash algorithm. W e present four differ ent ways to generate exponentially distributed hash values satisfying (6), which then can be directly translated into four new one-pass algorithms called ProbMinHash1, ProbMinHash2, ProbMinHash3, and Prob- MinHash4, respectively . All of them are, due to the much better time complexity , orders of magnitude faster than the original P-MinHash algorithm with the exception of very IEEE TRANSACTIONS ON KNO WLEDGE AND D A T A ENGINEERING, VOL. XX, NO . X, XXXX 2020 4 small input sizes n . The first two are statistically equivalent to the original approach. The latter two are even able to reduce the estimation err or due to the statistical dependence of individual signature components, as our experimental results will show . Similar to the SuperMinHash algorithm the variance of estimator (3) is decreased by up to a factor of two for input sizes n smaller than the signature size m . W e also present a performance optimization for Prob- MinHash1 and ProbMinHash3 leading to corr espond- ing equivalent algorithms ProbMinHash1a and ProbMin- Hash3a, respectively . The interleaved processing of input elements by using an additional buffer can significantly improve the performance for medium input sizes n . W e investigated specializations of all our algorithms on the conventional Jaccard similarity J , where we only have binary weights w ( d ) ∈ { 0 , 1 } . In this case, ProbMinHash1, ProbMinHash1a, and ProbMinHash2 are statistically equiv- alent to the original minwise hashing approach and can be used, in contrast to the state of the art, as drop-in replace- ments. A very interesting algorithm is also the specialization of ProbMinHash3a, which results in a very fast algorithm for J with a time complexity of O ( n + m log m ) and a space complexity of O ( m log m ) . W e conducted rigor ous experiments using synthetic data to investigate the runtime behavior as well as the esti- mation error for differ ent input and signature sizes. The results are in accordance with the theoretical considera- tions. The source code requir ed to reproduce the results presented in this paper has been published on GitHub at https://github.com/oertl/probminhash. 2 M E T H O D O L O G Y The computation of signatures for J P as defined by (1) and (6) involves nm hash values. m hash values, one for each signature component, must be calculated per element. The elements with the smallest hash values finally define the signature. Therefore, while processing elements, the mini- mum hash values seen so far must be kept for each signature component. In Algorithm 1 the array ( q 1 , q 2 , . . . , q m ) is used for that purpose. The maximum of all those minimum hash values q max := max k q k defines a limit, which is decreasing with the number of processed elements, and beyond which hash values of any further elements will not have an impact on the final signature. Thus, if we were able to track this limit efficiently over time, and at the same time, if we were able to generate hash values of an element in ascending order , processing of that element could be stopped as soon as its hash values exceed the current value of q max . T o produce the hash values h k ( d ) of a given ele- ment d in ascending order we propose to pick them from a positive monotonic increasing random sequence X ( d ) = ( x 1 ( d ) , x 2 ( d ) , . . . ) of points with random labels from { 1 , 2 , . . . , m } . Let l i ( d ) denote the label of point x i ( d ) and L ( d ) = ( l 1 ( d ) , l 2 ( d ) , . . . ) be the corresponding sequence of labels. W e define h k ( d ) as the first point of sequence X ( d ) with a label equal to k : h k ( d ) := x min { i : l i ( d )= k } ( d ) = min i : l i ( d )= k x i ( d ) . (9) The sequences X ( d ) and L ( d ) are generated using a PRNG that is initialized with d as seed to have independence between dif ferent elements. If the sequences are chosen such that the hash values h k ( d ) satisfy (6) for all k ∈ { 1 , 2 , . . . , m } , they can be used in (1) to obtain signatures for J P . As shown later , this requir es, among other things, that the point density in X ( d ) depends on the weight w ( d ) . Algorithm 3 shows the signature calculation using this technique. Elements of sequences X ( d ) and L ( d ) are lazily generated using the PRNG R which is initialized using d as seed. As soon as points x i ( d ) are greater than or equal to q max processing of element d can be stopped. For the very first element d , when q max is still infinite, this stop condition is satisfied as soon as all possible labels have appeared at least once in L ( d ) . For further elements, q max has already become smaller and the stop condition will likely be satisfied much earlier . If n  m , q max will get very small such that the while-loop can be entirely skipped for most elements and a huge speedup factor in the order of m can be expected compared to Algorithm 1. Fig. 2 demonstrates Algorithm 3 for an example with m = 4 and a set of 4 weighted elements. 2.1 Sequence Generation If the hash values h k ( d ) satisfy (6), they must be iden- tically distributed. This implies that labels are uniformly distributed and therefore Pr( l i ( d ) = k ) = 1 m for all k ∈ { 1 , 2 , . . . , m } . This can be achieved by sampling the labels l i ( d ) either from the multiset I r := { 1 r 2 r . . . m r } without replacement or from { 1 , 2 , . . . , m } with replacement. In the following, sampling with replacement is treated as a special case of sampling without replacement where r → ∞ . Given r , which defines the multiset I r used for generating L ( d ) , we need to find an appropriate monotonic increasing random sequence X ( d ) such that (9) satisfies (6). W e propose two differ ent methods which we refer to as uncorrelated and correlated generation of X ( d ) , respectively . The uncorrelated approach uses a well known property about exponential spacings described in [29] which allows sampling of m independent random values from an ex- ponential distribution with rate parameter equal to 1 in ascending order by using the recursion x i ( d ) ∼ x i − 1 ( d ) + 1 m − i +1 Exp(1) and x 0 ( d ) := 0 . Hence, the i -th largest random value is simply obtained by adding a value drawn from an exponential distribution with rate ( m − i + 1) to the ( i − 1) -th largest random value. W e apply this to generate the sequence X ( d ) by sampling mr independent and ex- ponentially distributed random values with rate parameter λw ( d ) /r in ascending order . The corresponding recursion formula is obtained by r eplacing m by mr and scaling with factor r / ( λw ( d )) x i ( d ) ∼ x i − 1 ( d ) + 1 λw ( d ) r mr − i + 1 Exp(1) . (10) Since the label sequence L ( d ) is pr oduced by sampling without replacement from I r , there will always be exactly r points with the same label. The minimum of those r points, which are exponentially distributed with rate λw ( d ) /r , is exponentially distributed with rate λw ( d ) , because the minimum of independent exponentially distributed random values is also exponentially distributed and the rate is equal to the sum of rates of all contributing random values [30]. IEEE TRANSACTIONS ON KNO WLEDGE AND D A T A ENGINEERING, VOL. XX, NO . X, XXXX 2020 5 Fig. 2. Illustration of Algorithm 3 f or m = 4 and a set of 4 weighted elements { d 1 , d 2 , d 3 , d 4 } . For each element d a random sequence of points X ( d ) and a random sequence of labels L ( d ) are generated until the stop limit q max is reached or exceeded. The density of points is propor tional to the corresponding weight w ( d ) . If a generated point x i ( d ) with label l i ( d ) is smaller than q l i ( d ) its value gets replaced. At the same time the corresponding z l i ( d ) is updated with d . Since q max is decreasing with the number of processed elements, the stop condition is satisfied v er y quic kly for later elements , often already after the first generated point as f or d 4 in this example . The resulting signature is z = ( d 3 , d 1 , d 1 , d 2 ) . Algorithm 3: Basic structure of ProbMinHash algorithms. Input: w Output: z 1 , z 2 , . . . , z m ( q 1 , q 2 , . . . , q m ) ← ( ∞ , ∞ , . . . , ∞ ) forall d ∈ D such that w ( d ) > 0 do w inv ← 1 /w ( d ) R ← new PRNG with seed d h ← generate first element of X ( d ) using R i ← 1 while h < q max do k ← generate i -th element of L ( d ) using R if h < q k then q k ← h z k ← d q max ← max k q k if h ≥ q max then break i ← i + 1 h ← generate i -th element of X ( d ) using R As a consequence, h k ( d ) as defined in (9) is exponentially distributed with rate λw ( d ) and satisfies (6) as requir ed. By construction, the hash values of an element and therefore also the resulting signature components (1) are mutually independent and the variance of estimator (3) will be the same as for P-MinHash given by J P (1 − J P ) /m . For the correlated approach, we consider the probability p i := Pr(min( { j : l j ( d ) = k } ) = i ) that h k ( d ) as defined by (9) is given by the i -th point x i ( d ) . This corresponds to the pr obability that the i -th label l i ( d ) is the first label with value k . Since the labels are drawn without replacement from I r , the probability that l j ( d ) = k is r mr − j +1 given that all ( j − 1) previous labels are differ ent from k . Therefore, the probability that the first ( i − 1) labels are different from k and the i -th label is equal to k is given by p i = i − 1 Y j =1  1 − r mr − j + 1  · r mr − i + 1 . (11) p i is zero for i > mr − r + 1 , because latest after sam- pling mr − r + 1 labels without r eplacement from I r , each label has been sampled at least once. By nature, we have P mr − r +1 i =1 p i = 1 . h k ( d ) is given by the i -th point x i ( d ) with probability p i . Therefor e, if x i ( d ) is sampled from the two- sided truncated exponential distribution x i ( d ) ∼ Exp( λw ( d ); a i − 1 , a i ) ∼ a i − 1 + Exp( λw ( d ); 0 , a i − a i − 1 ) (12) ∼ a i − 1 + ( a i − a i − 1 ) Exp( λw ( d )( a i − a i − 1 ); 0 , 1) with rate λw ( d ) and support [ a i − 1 , a i ) , where the bound- aries a i are chosen such that Pr( a i − 1 ≤ x < a i ) = p i with a 0 := 0 , h k ( d ) will be exponentially distributed with rate λw ( d ) by construction. Since the points x i ( d ) are sampled from disjoint intervals, h k ( d ) will not be independent for differ ent k . As a consequence, the variance of estimator (3) will differ from (4), in contrast to uncorrelated generation of X ( d ) . If n is in the range of m or smaller , the frequency of elements in the signature is usually not very balanced. Correlated generation results in more evenly distributed points, because there will always be exactly one single point in each interval [ a i − 1 , a i ) . This leads to a more balanced frequency distribution of elements in the signature in case the input size n is in the order of m or smaller . The similarity estimates using (3) will be more accurate and the variance will be lower than for P-MinHash or the correlated approach, respectively , as confirmed by our experimental results presented later . Although the choice of r is arbitrary , we consider only two cases to be practical. The first is r = 1 , because it minimizes the number of points that need to be generated. The second case is r → ∞ . Even though this means that there is no worst case upper bound for the number of requir ed points, it offers the advantage of sampling with replacement for L ( d ) . Sampling with replacement is less expensive, because it does not requir e an algorithm like Fisher-Y ates shuffling which must incorporate the sampling history [31]. The two cases r = 1 and r → ∞ combined either with uncorrelated or correlated generation of X ( d ) result in four different algorithms which will be described and analyzed in more detail in Sections 2.3 to 2.6. But first we have to explain the last missing part of our approach, namely how q max is efficiently tracked in Algorithm 3. 2.2 Stop Limit Update The naive calculation of q max by iterating over the array ( q 1 , q 2 , . . . , q m ) every time one of its values has changed IEEE TRANSACTIONS ON KNO WLEDGE AND D A T A ENGINEERING, VOL. XX, NO . X, XXXX 2020 6 Algorithm 4: Maintenance of stop limit q max := max( q 1 , q 2 , . . . , q m ) . q m +1 , . . . , q 2 m − 1 are the parent nodes of a binary tree spanned over q 1 , q 2 , . . . , q m . q m + d k/ 2 e is the parent of q k . If a leaf node q k with k ∈ { 1 , 2 , . . . , m } is replaced by a smaller value h < q k , following pr ocedure updates the root node q 2 m − 1 which corresponds to q max . Input: h , k while h < q k do q k ← h i ← m + d k / 2 e index of parent if i ≥ 2 m then break j ← (( k − 1) XOR 1) + 1 index of sibling, bitwise XOR operation if q j ≥ q i then break if h < q j then h ← q j k ← i takes O ( m ) time and would make our approach very slow . Therefor e we need a data structure that is able to update q max in sublinear time if the value of some q k is replaced by a smaller value. A max-heap data structure cannot be used, because it only gives access to the maximum and does not allow fast updates of other elements. The solution is a binary tree constructed over the array ( q 1 , q 2 , . . . , q m ) . The parent nodes are stor ed in the same array q by expanding it to the size 2 m − 1 . The value of a parent node is defined to be the maximum of the values of both child nodes. By definition, the value of the root node will be q max . If some value q k is replaced by some smaller hash value h , Algorithm 4 can be used to update the tree including q max . Starting fr om the modified leaf node, the algorithm makes a bottom-up tree traversal until no further change is necessary . It is a slightly modified but equivalent version of the algorithm presented in [3]. Clearly , the worst case time complexity is O (log m ) . However , for ProbMinHash where the values of labels l i ( d ) are equally frequent, the expected time complexity is O (1) , which can be explained as follows: Assume l i ( d ) = k and hence q k has been chosen to be potentially updated by x i ( d ) . Obviously , the probability that the value of q k is actually changed is at most 1. The probability that the parent of q k is modified, is at most 1 2 , because it is equally likely that the parent value, which is the maximum of the values of its two children, is given by the sibling of q k . Decrementing the value of q k has no impact on the parent in this case. Continuation of this ar gumentation shows that the expected number of node updates must be bounded by the geometric series 1 + 1 2 + 1 4 + . . . = 2 and is therefor e constant. 2.3 Pr obMinHash1 Sampling with replacement ( r → ∞ ) for L ( d ) together with uncorrelated sampling for X ( d ) are the ingredients for the ProbMinHash1 algorithm. r → ∞ and the choice λ := 1 m for the free parameter λ simplifies the recursion (10) to x i ( d ) ∼ x i − 1 ( d ) + 1 w ( d ) Exp(1) . The specialization of Algorithm 3 for this case is shown as Algorithm 5. T o analyze the runtime behavior we consider the number of points that need to be generated for each element d j to satisfy the stop condition h ≥ q max . For the first element d 1 , the stop condition is fulfilled as soon as all possible labels have appeared in L ( d 1 ) . As this corresponds to the coupon collector ’s problem [32], [33], this takes mH m points on average, where H m := 1 + 1 2 + . . . + 1 m denotes the m -th harmonic number . Processing of the second element Algorithm 5: ProbMinHash1. Input: w Output: z 1 , z 2 , . . . , z m ( q 1 , q 2 , . . . , q m ) ← ( ∞ , ∞ , . . . , ∞ ) forall d ∈ D such that w ( d ) > 0 do w inv ← 1 /w ( d ) R ← new PRNG with seed d h ← w inv · R [Exp(1)] while h < q max do k ← R [Uniform( { 1 , 2 , . . . , m } )] if h < q k then q k ← h z k ← d update q max using Algorithm 4 if h ≥ q max then break h ← h + w inv · R [Exp(1)] d 2 already takes less time, because q max has decreased and is no longer infinite. Consider the point sequence obtained by combining and sorting X ( d 1 ) and X ( d 2 ) together with the corresponding joint label sequence given by L ( d 1 ) and L ( d 2 ) . Again, the stop condition is satisfied as soon as all possible labels have shown up in the combined label sequence. As before, the expected sequence length is mH m . However , since the density of points that come from X ( d 1 ) and X ( d 2 ) is proportional to w ( d 1 ) and w ( d 2 ) , respectively , we expect that the proportion of points originating from d 2 is w ( d 2 ) / ( w ( d 1 ) + w ( d 2 )) . Therefore, the expected number of generated points for d 2 is mH m w ( d 2 ) / ( w ( d 1 ) + w ( d 2 )) . Continuing in this way leads to mH m w ( d j ) / P j i =1 w ( d i ) for d j . If we assume that elements are not processed in any particular order with respect to their weights, the expecta- tion of w ( d j ) / P j i =1 w ( d i ) is equal to 1 j . Summation over all n input elements yields the expected total number of generated points which is mH m H n . Using H m = O (log m ) and incorporating the fixed costs O ( n ) associated with each of all n pr ocessed elements, the amortized overall time complexity is O ( n + m (log m )(log n )) ≤ O ( n + m log 2 m ) . The inequality means that ther e exists a constant C such that n + m (log m )(log n ) ≤ C ( n + m log 2 m ) holds for all m ≥ 1 and n ≥ 1 . In particular , we have been able to prove this inequality for C = 3 2 . Theoretically , if all elements are processed in ascend- ing order with respect to their weights and the sequence w ( d j ) / P j i =1 w ( d i ) is rather constant than decreasing like 1 j , the time complexity would become O ( nm (log m )) in the worst case. However , if the data is expected to be ordered, it is unlikely that the data will have to be processed in streaming mode, since most likely a sorting step has been performed beforehand. Therefore, to avoid the worst-case behavior , the elements could either be pr ocessed in reverse order so that the weights decrease, or , if the data fits in mem- ory , the elements could be processed in random order using Fisher-Y ates shuffling [31] which is an O ( n ) -operation. 2.4 Pr obMinHash2 Our second algorithm uses sampling without replacement ( r = 1 ) for L ( d ) and uncorrelated sampling for X ( d ) . Unfortunately , sampling without replacement is more ex- pensive than sampling with r eplacement. It is usually done using Fisher-Y ates shuffling which requires an array g = ( g 1 , g 2 , . . . , g m ) of size m with initial values g i = i IEEE TRANSACTIONS ON KNO WLEDGE AND D A T A ENGINEERING, VOL. XX, NO . X, XXXX 2020 7 Algorithm 6: Lazy generation of random permutation elements based on Fisher-Y ates shuffling. def InitPermutationGenerator( m ) allocate array ( g 1 , g 2 , . . . , g m ) ( v 1 , v 2 , . . . , v m ) ← (0 , 0 , . . . , 0) c ← 0 def ResetPermutationGenerator() i ← 0 c ← c + 1 def GenerateNextPermutationElement( R ) i ← i + 1 j ← i + R [Uniform( { 0 , 1 , . . . , m − i } )] k ← ( g j v j = c j v j 6 = c g j ← ( g i v i = c i v i 6 = c v j ← c return k Algorithm 7: ProbMinHash2. Input: w Output: z 1 , z 2 , . . . , z m ( q 1 , q 2 , . . . , q m ) ← ( ∞ , ∞ , . . . , ∞ ) InitPermutationGenerator( m ) forall d ∈ D such that w ( d ) > 0 do w inv ← 1 /w ( d ) R ← new PRNG with seed d ResetPermutationGenerator() i ← 1 h ← w inv · R [Exp(1)] while h < q max do k ← GenerateNextPermutationElement( R ) if h < q k then always satisfied, if i = m q k ← h z k ← d update q max using Algorithm 4 if h ≥ q max then break always satisfied, if i = m i ← i + 1 h ← h + w inv · β i · R [Exp(1)] β i := m/ ( m − i + 1) [31]. Due to the stop condition, ProbMinHash only needs the labels of a few points for most input elements. Since the O ( m ) allocation and initialization costs would lead to an O ( nm ) algorithm, we propose Algorithm 6, a variant of Fisher-Y ates shuffling. It reuses the array g to amortize allocation costs. Furthermore, it applies lazy initialization by using a permutation counter c and an additional array v that indicates already initialized array components of g for the current permutation. v i = c means that g i has already been initialized. Otherwise, g i is considered to be equal to its initial value i . T o start a new permutation for the next input element it is sufficient to increment the counter which is just an O (1) operation. The free parameter in (10) is chosen again as λ := 1 m . In this way the recursion simplifies together with r = 1 to x i ( d ) ∼ x i − 1 ( d ) + 1 w ( d ) m m − i +1 Exp(1) and the very first and most frequently calculated point is simply given by x 1 ( d ) ∼ 1 w ( d ) Exp(1) . The factors β i := m m − i +1 can be pre- computed to avoid the costly floating-point divisions. Prob- MinHash2 as a whole is shown as Algorithm 7. Sampling the labels from I 1 = { 1 , 2 , . . . , m } without replacement guarantees that the stop condition will always be satisfied after generating m points. The last point will always be greater than or equal to q max . Therefor e, the worst case time complexity is bounded by O ( nm ) which equals the time complexity of P-MinHash. In contrast, ther e is no such worst Algorithm 8: ProbMinHash3, requires m ≥ 2 . Input: w Output: z 1 , z 2 , . . . , z m ( q 1 , q 2 , . . . , q m ) ← ( ∞ , ∞ , . . . , ∞ ) forall d ∈ D such that w ( d ) > 0 do w inv ← 1 /w ( d ) R ← new PRNG with seed d h ← w inv · R [Exp( λ ; 0 , 1)] λ := log(1 + 1 / ( m − 1)) i ← 1 while h < q max do k ← R [Uniform( { 1 , 2 , . . . , m } )] if h < q k then q k ← h z k ← d update q max using Algorithm 4 i ← i + 1 h ← w inv · ( i − 1) if h ≥ q max then break h ← h + w inv · R [Exp( λ ; 0 , 1)] case upper bound for ProbMinHash1. It is obvious that the expected time complexity of ProbMinHash2 is bounded by that of ProbMinHash1, because ProbMinHash2 generates at most one point per label and thus fewer points, which according to (9) have no influence on h k ( d ) . 2.5 Pr obMinHash3 The third algorithm combines sampling with replacement ( r → ∞ ) for label generation and the correlated sampling approach for X ( d ) . Accor ding to (11) p i = 1 m (1 − 1 m ) i − 1 as r → ∞ . The interval boundaries a i that satisfy Pr( a i − 1 ≤ x < a i ) = p i , if x is exponentially distributed with rate λw ( d ) , are given by a i = i λw ( d ) log(1 + 1 m − 1 ) . By choosing λ := log(1 + 1 m − 1 ) , which requires m ≥ 2 , we have a i = i w ( d ) and the points can be generated according to (12) using x i ( d ) ∼ i − 1 w ( d ) + 1 w ( d ) Exp( λ ; 0 , 1) . ProbMinHash3 is shown as Algorithm 8. In contrast to Pr obMinHash1 and ProbMinHash2, the stop condition within the inner loop makes use of the fact that the i -th point x i ( d ) is sampled from [ a i − 1 , a i ) = [ i − 1 w ( d ) , i w ( d ) ) . Therefore, if the lower bound of this interval is already equal to or greater than q max , the stop condition will be satisfied in any case. The generation of x i ( d ) can therefore be omitted. The time complexity is similar to that of ProbMinHash1. The same argumentation can be used to derive the upper bounds O ( n + m (log m )(log n )) ≤ O ( n + m log 2 m ) . 2.6 Pr obMinHash4 The fourth variant combines sampling without replacement ( r = 1) for label generation and correlated sampling for X ( d ) . In this case (11) reduces to p i = 1 m and the interval boundaries are given by a i = 1 λw ( d ) log(1 + i m − i ) . Setting again λ := log(1 + 1 m − 1 ) simplifies (12) for the first and most frequently computed point to x 1 ( d ) ∼ 1 w ( d ) Exp( λ ; 0 , 1) . More generally , we obtain x i ( d ) ∼ 1 w ( d ) ( γ i − 1 + ( γ i − γ i − 1 ) Exp( λ i ; 0 , 1)) with λ i := log (1 + 1 / ( m − i )) and γ i := log(1 + i/ ( m − i )) /λ 1 for i < m and x m ( d ) ∼ 1 w ( d ) ( γ m − 1 + δ Exp(1)) with δ := 1 /λ 1 . Pr obMinHash4 is shown as Algorithm 9. Similar to ProbMinHash2, Al- gorithm 6 is used for sampling without replacement and the number of generated points is limited by m resulting IEEE TRANSACTIONS ON KNO WLEDGE AND D A T A ENGINEERING, VOL. XX, NO . X, XXXX 2020 8 Algorithm 9: ProbMinHash4, requires m ≥ 2 . Input: w Output: z 1 , z 2 , . . . , z m ( q 1 , q 2 , . . . , q m ) ← ( ∞ , ∞ , . . . , ∞ ) InitPermutationGenerator( m ) forall d ∈ D such that w ( d ) > 0 do w inv ← 1 /w ( d ) R ← new PRNG with seed d ResetPermutationGenerator() h ← w inv · R [Exp( λ 1 ; 0 , 1)] λ i := log(1 + 1 / ( m − i )) i ← 1 while h < q max do k ← GenerateNextPermutationElement( R ) if h < q k then q k ← h z k ← d update q max using Algorithm 4 i ← i + 1 if w inv · γ i − 1 ≥ q max then break γ i := log(1 + i/ ( m − i )) /λ 1 if i < m then h ← w inv · ( γ i − 1 + ( γ i − γ i − 1 ) · R [Exp( λ i ; 0 , 1)]) else h ← w inv · ( γ m − 1 + δ · R [Exp(1)]) δ := 1 /λ 1 if h < q max then h < q max and i = m imply h < q k k ← GenerateNextPermutationElement( R ) q k ← h z k ← d update q max using Algorithm 4 break in a worst case time complexity of O ( nm ) . Like for Prob- MinHash3, the first stop condition within the while-loop makes use of the fact that the i -th point x i ( d ) originates from [ a i − 1 , a i ) . Since ProbMinHash4 avoids the generation of points with the same labels, its expected time complexity is bounded by that of ProbMinHash3. 2.7 Interleaved Element Pr ocessing The presented algorithms process elements sequentially . Processing the first elements is much more time-consuming than later elements when q max has already decreased and the stop condition is satisfied earlier . In the best case, if q max is already small enough, processing of elements can be ter - minated immediately when checking the stop condition the first time. In any case, the first point x 1 ( d ) of each element d must be generated. Therefor e it makes sense to do what needs to be done first and to calculate only the first points of all elements in a first pass. Elements, that do not satisfy the stop condition right after the first generated point, need to be stored in a buffer , because further points may contribute to the signature. In the next pass we take the elements from the buffer , generate and process the second smallest points x 2 ( d ) , and, if the stop condition is still not fulfilled, reinsert the elements into the buf fer . This pr ocedure is repeated until the buffer is finally empty . The performance gain will be most significant for in- termediate input sizes n , because we do not expect any speedup for both extreme cases, n = 1 and n → ∞ . When processing just a single input element, interleaving is not possible and the same number of random points must be generated. For n → ∞ , q max will be small enough for most except the first elements anyway , so that the stop condition is almost always satisfied by their first points. The relative performance gain is negligible in this case. The space requirements are given by the maximum number of elements that are simultaneously stored in the buffer . Obviously , this maximum is given by the number of elements in the buffer right after the first pass. The actual memory costs are composed by the elements and the corresponding states required for further point generation. This includes PRNG states, and in case of sampling without replacement, the states of the shuffling algorithm. It is not feasible to store an array of length m for each element in the buffer as needed by shuffling algorithms. Therefore, a more complex data structure like a hash table that only stores initialized elements of that array is needed. W e did a couple of experiments using hash tables, but found that the additional hash table lookups outweigh the performance gain due to interleaved processing. Therefor e, we focused on Pr obMinHash1 and ProbMinHash3 which are both based on sampling with replacement and which do not rely on the sampling history . ProbMinHash1a and ProbMinHash3a are the corresponding logically equivalent algorithms using interleaved pr ocessing and are shown as Algorithm 10 and Algorithm 11, respectively . T o analyze the space complexity we assume that ele- ments d j are not processed in a sorted order with regar d to their weights. As a consequence, the first points x 1 ( d j ) of differ ent elements d j can be considered to be identically distributed. The expected maximum buffer size is given by the sum of individual probabilities that an element is added to the buffer during the first pass. For ProbMinHash1a an element d j is added to the buffer , if its first point x 1 ( d j ) is smaller than q max . Since it takes mH m elements on average, according to the coupon collector ’s problem [32], [33], until the labels of their first points l 1 ( d j ) cover all possible label values { 1 , 2 , . . . , m } , q max is roughly given by the mH m -th small- est point seen so far . The probability , that the first point of the j -th element x 1 ( d j ) is among the mH m smallest points, is given by min(1 , mH m /j ) . As this corresponds to the probability that x 1 ( d j ) is smaller than q max and that d j is inserted into the buffer , summation of these probabilities for all n elements finally yields the expected buffer size P n j =1 min(1 , mH m /j ) = O ( m (log m )(log n )) . The expected time complexity of ProbMinHash1a is O ( n + m (log m )(log n )) , because the space complexity and the number of processed elements n are obvious lower bounds while the time complexity of ProbMinHash1 is an obvious upper bound. For ProbMinHash3a we first consider the unweighted case with all weights equal to 1. If we are in the i -th pass, when all points ar e sampled from [ i − 1 , i ) , the stop condition is always satisfied and elements are not (re)inserted into the buffer as soon as q max ≤ i . On average, this will be the case after generating the first mH m = O ( m log m ) points, because q max is appr oximately given by the mH m -th small- est point as before. The processing time of the remaining elements, which then immediately fulfill the stop condition, is naturally limited by O ( n ) , resulting in an overall time complexity of O ( n + m log m ) . The expected maximum number of elements in the buffer after the first pass is given by min( n, mH m ) ≤ O ( m log m ) . It is remarkable that the space complexity does not depend on the input size n in contrast to ProbMinHash1a. In the general case, the space complexity of ProbMin- Hash3a depends on the distribution of weights F w ( y ) = IEEE TRANSACTIONS ON KNO WLEDGE AND D A T A ENGINEERING, VOL. XX, NO . X, XXXX 2020 9 Algorithm 10: ProbMinHash1a. Input: w Output: z 1 , z 2 , . . . , z m ( q 1 , q 2 , . . . , q m ) ← ( ∞ , ∞ , . . . , ∞ ) b ← empty dynamic array initialize buffer s ← 0 number of elements in buffer forall d ∈ D such that w ( d ) > 0 do w inv ← 1 /w ( d ) R ← new PRNG with seed d h ← w inv · R [Exp(1)] if h ≥ q max then continue k ← R [Uniform( { 1 , 2 , . . . , m } )] if h < q k then q k ← h z k ← d update q max using Algorithm 4 if h ≥ q max then continue s ← s + 1 b s ← ( d, w inv , R, h ) while s > 0 do t ← 0 for j ← 1 , 2 , . . . , s do ( d, w inv , R, h ) ← b j if h ≥ q max then continue h ← h + w inv · R [Exp(1)] if h ≥ q max then continue k ← R [Uniform( { 1 , 2 , . . . , m } )] if h < q k then q k ← h z k ← d update q max using Algorithm 4 if h ≥ q max then continue t ← t + 1 b t ← ( d, w inv , R, h ) s ← t Pr( w ≤ y ) . Based on our experimental results and the following theoretical considerations, we assume that there is an upper bound that is independent of n as long as F w ( y ) has a power-tail with index η > 1 which means that 1 − F w ( y ) ∼ y − η as y → 0 and that the mean is finite. The first points x 1 of elements given the weight w are distributed as x 1 | w ∼ 1 w Exp( λ 1 ; 0 , 1) as mentioned in Section 2.6. It can be shown that the power-tail of F w yields an asymptotic lower bound for the distribution function F x 1 ( y ) = Pr( x 1 ≤ y ) ≥ y E ( w ) C as y → 0 where E ( w ) denotes the average weight and C is some constant. q max is again roughly given by the mH m -smallest point. Therefor e, after inserting the first j elements, q max is approximated by the ( mH m /j ) -quantile q max ≈ F − 1 x 1 ( mH m /j ) ≤ mH m j C E ( w ) as j → ∞ . An element is added to the buffer only if w inv < q max . The corresponding probability is Pr(1 /w < q max ) = 1 − F w (1 /q max ) ≤ 1 − F w ( j C E ( w ) / ( mH m )) ∝ j − η as j → ∞ . Since the hyperharmonic series P ∞ j =1 j − η con- verges for η > 1 , the sum of probabilities, that elements are added to the buffer , has an upper bound independent of n . 2.8 T runcated Exponential Sampling ProbMinHash3 and ProbMinHash4 require the generation of many random values that follow a truncated exponen- tial distribution Exp( λ ; 0 , 1) . The straightforward approach, inverse transform sampling, requir es the evaluation of a logarithm. W e have not found any other method in literature that avoids, like the ziggurat method for the exponential distribution [34], expensive function calls. Therefore, we propose the following method based on rejection sampling. Fig. 3 shows the function ρ ( x ) = e − λx which is pro- portional to the probability density function of Exp( λ ; 0 , 1) . Algorithm 11: ProbMinHash3a, requires m ≥ 2 . Input: w Output: z 1 , z 2 , . . . , z m ( q 1 , q 2 , . . . , q m ) ← ( ∞ , ∞ , . . . , ∞ ) b ← empty dynamic array initialize buffer s ← 0 number of elements in buffer forall d ∈ D such that w ( d ) > 0 do w inv ← 1 /w ( d ) R ← new PRNG with seed d h ← w inv · R [Exp( λ ; 0 , 1)] λ := log(1 + 1 / ( m − 1)) if h ≥ q max then continue k ← R [Uniform( { 1 , 2 , . . . , m } )] if h < q k then q k ← h z k ← d update q max using Algorithm 4 if w inv ≥ q max then continue s ← s + 1 b s ← ( d, w inv , R ) i ← 2 while s > 0 do t ← 0 for j ← 1 , 2 , . . . , s do ( d, w inv , R ) ← b j h ← w inv · ( i − 1) if h ≥ q max then continue h ← h + w inv · R [Exp( λ ; 0 , 1)] if h ≥ q max then continue k ← R [Uniform( { 1 , 2 , . . . , m } )] if h < q k then q k ← h z k ← d update q max using Algorithm 4 if w inv · i ≥ q max then continue t ← t + 1 b t ← ( d, w inv , R ) s ← t i ← i + 1 Clearly , if we sample points ( x, y ) uniformly fr om the region below ρ with x ∈ [0 , 1) and y ≤ ρ ( x ) , x will be distributed as Exp( λ ; 0 , 1) . The region is split into A 1 = [0 , 1) × [0 , e − λ ) and the remaining part A 2 . In order to achieve uniformity a point is sampled from A 1 with propability | A 1 | / | A 1 ∪ A 2 | . Otherwise, a point is taken from A 2 . The area sizes are given by | A 1 | = e − λ and | A 1 ∪ A 2 | = (1 − e − λ ) /λ . T o realize the corresponding Bernoulli trial we generate a uniformly distributed random value x ∈ [0 , c 1 ) with c 1 := | A 1 ∪ A 2 | / | A 1 | = ( e λ − 1) /λ . Since Pr( x < 1) = | A 1 | / | A 1 ∪ A 2 | , we sample a point from A 1 if x < 1 . In this case x is uniformly distributed over [0 , 1) , and x can therefor e be reused as the x -coordinate of the point we want to generate. Because A 1 is an axis-aligned rectangle, all x - coordinates ar e equally likely , so generating the y -coordinate can be omitted, and x can be returned directly as result. Rejection sampling is used to get a point from A 2 in the other case, when x ≥ 1 . For that we first in- troduce a differ ent y -scale defined by the transformation ˜ y = ( y − e − λ ) / (1 − e − λ ) as shown in the right part of Fig. 3. Instead from A 2 , we sample points from the triangle A 3 ∪ A 4 ∪ A 6 and r eject those which ar e not below ρ . T o do so we sample fr om the rectangle A 3 ∪ A 4 ∪ A 5 = [0 , 1) × [0 , 0 . 5) which has the same area as the triangle, because the areas of A 5 and A 6 are equal. In case a sampled point belongs to A 5 , it is mapped to A 6 by reflection at point (0 . 5 , 0 . 5) . Next we need to test whether the point is below ρ . In order to avoid the expensive exponential function evaluation, we first check whether it is either below the tangent at 0 or 1 given by ˜ y = 1 − x λ 1 − e − λ and ˜ y = (1 − x ) λ e λ − 1 , respectively . IEEE TRANSACTIONS ON KNO WLEDGE AND D A T A ENGINEERING, VOL. XX, NO . X, XXXX 2020 10 0 0.5 1 x 0 e − λ 1 y A 1 A 2 ρ : y = e − λx A 1 A 2 0 c 2 0.5 1 x 0 0.5 1 ˜ y (0.5,0.5) A 3 A 4 A 5 A 6 A 3 A 4 A 5 A 6 ρ : ˜ y = e λ (1 − x ) − 1 e λ − 1 tangen t at 0 : ˜ y = 1 − x λ 1 − e − λ tangen t at 1 : ˜ y = (1 − x ) λ e λ − 1 Fig. 3. Sampling from a truncated exponential distribution Exp( λ ; 0 , 1) . Algorithm 12: Fast generation of random values from Exp( λ ; 0 , 1) . Input: λ , R Output: x x ← c 1 · R [Uniform([0 , 1))] c 1 := ( e λ − 1) /λ if x < 1 then return x sample from A 1 loop otherwise, start rejection sampling from A 2 x ← R [Uniform([0 , 1))] sample x from A 3 ∪ A 4 ∪ A 5 if x < c 2 then return x point is in A 3 , c 2 := log(2 / (1 + e − λ )) /λ ˜ y ← 0 . 5 · R [Uniform([0 , 1))] sample ˜ y from A 3 ∪ A 4 ∪ A 5 if ˜ y > 1 − x then point is in A 5 ( x, ˜ y ) ← (1 − x, 1 − ˜ y ) map from A 5 to A 6 by reflection at (0 . 5 , 0 . 5) if x ≤ c 3 · (1 − ˜ y ) then return x below tangent at 0, c 3 := (1 − e − λ ) /λ if ˜ y · c 1 ≤ 1 − x then return x below tangent at 1 if ˜ y · c 1 · λ ≤ exp( λ · (1 − x )) − 1 then return x below ρ Since ρ is convex, the point can be accepted in both cases. A test against ρ is only needed in the remaining case. If the point is finally accepted, its x -coordinate is returned as result. An additional performance optimization can be introduced when sampling fr om the rectangle A 3 ∪ A 4 ∪ A 5 . W e first sample its x -coordinate from [0 , 1) . If it is less than c 2 := log(2 / (1 + e − λ )) /λ , the point belongs to A 3 which is the part of the rectangle entirely below ρ . Therefor e, a point in A 3 can be immediately accepted without considering its y -coordinate. Algorithm 12 summarizes the whole procedur e. It is especially efficient for small λ , because then the first if- condition x < 1 is satisfied with high probability which al- lows an early termination. For ProbMinHash3 we need ran- dom values from Exp( λ ; 0 , 1) with rate λ = log(1 + 1 m − 1 ) . Therefor e Pr( x < 1) = λ/ ( e λ − 1) = ( m − 1) log(1 + 1 m − 1 ) ∼ 1 − 1 2( m − 1) as m → ∞ . Since m is typically in the hundreds or even greater , drawing a value from Exp( λ ; 0 , 1) is almost as cheap as generating a uniform random value. For Prob- MinHash4, the rate parameter increases with the number of points from λ 1 = log(1 + 1 m − 1 ) , which is the same as for ProbMinHash3, to λ m − 1 = log (2) for the second last point x m − 1 ( d ) . Even in this worst case the first if-condition is still satisfied with a probability of Pr( x < 1) = log (2) ≈ 69 . 3% . 2.9 Con ventional Jaccard Similarity Since the probability Jaccard similarity J P is a generaliza- tion of the conventional Jaccard similarity J , all ProbMi n- Hash algorithms can also be used to calculate signatures for J by allowing only binary weights w ( d ) ∈ { 0 , 1 } . In this section we investigate whether the corresponding special- izations lead to novel or already known locality-sensitive hash algorithms for the conventional Jaccard similarity J . For ProbMinHash1 and ProbMinHash2 the restriction to binary weights means that there is no need to compute w inv because it is always 1. As a consequence, floating- point multiplications with w inv can be saved. In addition, ProbMinHash1a benefits from a smaller memory footprint, because w inv does not need to be stored in the buffer . Since ProbMinHash1, Pr obMinHash1a, and ProbMinHash2 are all based on uncorrelated point generation as discussed in Section 2.1, they are statistically equivalent to the orig- inal MinHash algorithm in contrast to other state-of-the- art algorithms like OPH [11], SuperMinHash [16], or FSS [15]. Therefor e they can be used as faster drop-in replace- ments without changing any statistical properties. This is especially important, if signatures are used in the context of locality-sensitive hashing [20], [21], [22] where signature components are usually considered to be independent. ProbMinHash3 and ProbMinHash4, which use corre- lated point generation, benefit even more from the restric- tion to binary weights. The interval boundaries a i are the same for all elements in this case. This means that the i - th largest points of different elements are identically dis- tributed. Since only the relative order of points plays a role according to (9), we can use any other continuous distri- bution. For the sake of simplicity and also for performance reasons we take the uniform distribution. This means, we replace R [Exp( λ ; 0 , 1)] by R [Uniform([0 , 1))] in ProbMin- Hash3 and Pr obMinHash3a. In Pr obMinHash4 we set γ i = i and substitute both R [Exp( λ i ; 0 , 1)] and δ · R [Exp(1)] by R [Uniform([0 , 1))] . Although the ProbMinHash algorithms are new for the general case, some specialized variants for the conven- tional Jaccard similarity J r elate to already known algo- rithms. ProbMinHash1 corresponds to the BagMinHash1 algorithm [3] if both ar e fully specialized for the unweighted case. ProbMinHash1a has some similarities to BagMinHash2 which uses a min-heap for buffering elements that are still able to contribute to the signature while processing them in ascending order of their last points. In contrast, Prob- MinHash1a calculates the i -th smallest points of all relevant elements in the i -th pass regar dless of the preceding points. In the unweighted case, the for-loop of ProbMinHash3a in Algorithm 11 is similar to OPH without densification [11] and to the first iteration of FSS [15], respectively . Thus, ProbMinHash3a is nearly equivalent, if its while-loop does not contribute to the signature, which is typically the case for large input sizes n  m . However , ProbMinHash3a additionally tracks q max , allowing earlier termination. In this way the generation of the first label l 1 ( d ) and also the access to the corresponding signature component can often be avoided, which results in a small advantage in performance. In contrast, the other algorithms must always randomly select and access at least one signature component for each element. The while-loop in Algorithm 11 can be regar ded as a new alternative densification scheme for OPH [12], [13], [14]. At the expense of a buffer of size O ( m log m ) the estimation error for J is significantly reduced for small input sizes, as our experimental results will show . ProbMinHash4 for the unweighted case corresponds to the SuperMinHash algorithm [16]. The main difference is again the stop condition. SuperMinHash only tracks a histogram of points which only allows discrete stop limits. ProbMinHash4 keeps track of q max , which is slightly more expensive. However , it pays off for larger input sizes, be- IEEE TRANSACTIONS ON KNO WLEDGE AND D A T A ENGINEERING, VOL. XX, NO . X, XXXX 2020 11 cause the stop limit can go below 1. As a consequence, the label generation can be avoided in many cases leading again to a slightly better performance. As ProbMinHash4 is logically equivalent to the SuperMinHash algorithm it also shares the theoretically pr oven better variance of the Jaccard similarity estimator given by (7). 3 E X P E R I M E N T S All algorithms were implemented using C++. The corre- sponding source code and the Python scripts to reproduce the results and figures shown in the following are avail- able on Github at https://github.com/oertl/probminhash. The PRNG, which is used by the presented algorithms, is particularly important for good performance, since it is called in the innermost loops. For our theoretical considera- tions we have assumed an ideal random number generator . Therefor e, the output of the chosen PRNG should be in- distinguishable from that. Poor randomness would lead to estimation err ors significantly dif ferent from the theoretical predictions. W e used the W yrand algorithm (version 4) which was recently developed and published on GitHub [35]. It is very fast, has a state of 64 bits, and passes a series of statistical quality tests. Seeded with a 64-bit hash value of the input element, it produces a sequence of 64-bit pseudo- random integers. W e consume random bits very economi- cally . For example, to generate a double-precision floating- point number from [0 , 1) 53 random bits corresponding to the significand precision are sufficient. Only if all 64 bits are consumed, the next bunch of 64 bits will be generated. W e would like to point out that the use of a weak PRNG could potentially cause our algorithms to run endlessly . On the one hand, random values are partly generated using rejection sampling techniques for which the number of trials follows a geometric distribution and on the other hand, ProbMinHash1 and ProbMinHash3 which are based on sampling with replacement requir e that all possible label values appear at least once to satisfy the stop condition for the first element. W e have never experienced an endless it- eration with our implementation so far . However , if this is a concern, the number of iterations of the corr esponding loops should be explicitly limited. The limits should be chosen in a way that, assuming an ideal random number generator , they are only r eached with a negligible probability . For example, the pr obability that any label will not appear among the first ( µ m log m ) points is less than m 1 − µ [33]. All algorithms including P-MinHash were implemented with the same methods of random value generation to allow a fair comparison. The ziggurat method [34] as implemented in the Boost C++ libraries [36] was used to generate ex- ponentially distributed random values. W e applied Algo- rithm 12 for truncated exponential distributions. Uniform random integers were produced as described in [37]. All our experiments were performed with synthetic data. The reason for this is that realistic data sets usually do not contain enough different pairs of sets that have exactly the same predefined similarity . However , this is fundamental to verify the theoretically predicted distribution of estimation errors. W e made the experience that testing locality-sensitive hash functions with realistic data sets is not very reliable. For example, if the algorithms presented in [7], [9], [10] had been checked with synthetic data using our test setup, it would have been easy to find that they are not suitable for the claimed metrics. More such examples can be found in [3]. Another advantage of synthetic data is that it facilitates the repr oduction of our results. Although the test data consists of sets of randomly generated 64-bit integers, they are also repr esentative for real-world situations. Realistic el- ements can always be reduced by hashing to 64-bit integers first, and with a good hash function they do not differ from the random numbers of our synthetic data sets. For comparison, all experiments were also performed with P-MinHash [8] which is the state of the art for the prob- ability Jaccard similarity J P . Other algorithms for weighted sets like ICWS [4] or BagMinHash [3] were not consider ed, because they serve to calculate signatures for differ ent Jac- card similarities J W or J N , as mentioned in Section 1. Since these algorithms ar e also more complex, they tend to be much slower than the algorithms for J P anyway . For our examples with binary weights w ( d ) ∈ { 0 , 1 } , where J P corresponds to the conventional Jaccard similarity J , we also compared our ProbMinHash algorithms to MinHash, OPH with optimal densification [13], and SuperMinHash [16]. The latter two repr esent the state of the art of one-pass algorithms for the conventional Jaccard similarity J . 3.1 V erification T o verify our pr oposed algorithms we applied them to 12 differ ent examples. Each example is characterized by a multiset W of weight value pairs. Each pair represents the weights of some random element in two different sets, respectively . Given W , arbitrary many pairs of weighted sets A and B can be generated whose weight functions satisfy W = S d : w A ( d ) > 0 ∨ w B ( d ) > 0 { ( w A ( d ) , w B ( d )) } . This is done by drawing a random element for each weight value pair in W , which is then added to sets A and B together with the corresponding weights, r espectively . The resulting pairs of weighted sets will always have the same probability Jaccard similarity J P by definition as it is uniquely defined by W . 10 000 differ ent pairs of such weighted random sets have been generated for each example. After computing the cor- responding signatures with sizes m ∈ { 1 , 2 , 4 , 8 , . . . , 2 14 } , the similarity was estimated using (3) and the empirical mean squared error (MSE) with respect to the true J P was calculated. Since the expected empirical MSEs for P- MinHash and for the statistically equivalent algorithms ProbMinHash1 and ProbMinHash2 ar e equal to J P (1 − J P ) /m , we considered the corresponding relative empirical MSE which was finally plotted over the signature size m for all 12 examples in Fig. 4. The regions covering the middle 99 . 99% of the expected z -scor es are also shown to indicate realistic deviations fr om the expected empirical MSE. The z - score is obtained from the empirical MSE by normalization. The expectation and the variance of the empirical MSE are given by J P (1 − J P ) /m and J P 2 (1 − J P ) 2 m 2 c (2 − 6 m ) + J P (1 − J P ) m 3 c , respectively [38], where c = 10 000 is the sample size. The empirical MSE for P-MinHash, ProbMinHash1, and ProbMinHash2 agree with the theor etical prediction, as the observed relative error is actually close to 1 and within the expected variation range. However , for ProbMinHash3 and IEEE TRANSACTIONS ON KNO WLEDGE AND D A T A ENGINEERING, VOL. XX, NO . X, XXXX 2020 12 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 1 . 25 1 . 50 1 . 75 2 . 00 relativ e empirical MSE J p = 0 . 350 W = { (3 , 20) , (30 , 7) } P-MinHash ProbMinHash1 ProbMinHash2 ProbMinHash3 ProbMinHash4 99.99% J p = 0 . 620 W = { (0 , 2) , (3 , 4) , (6 , 3) , (2 , 4) } P-MinHash ProbMinHash1 ProbMinHash2 ProbMinHash3 ProbMinHash4 99.99% J p = 0 . 377 W = { (4 , 2) 15 , (1 , 4) 10 , (12 , 0) 5 } P-MinHash ProbMinHash1 ProbMinHash2 ProbMinHash3 ProbMinHash4 99.99% J p = 0 . 333 W = { (0 , 1) , (1 , 0) , (1 , 1) } P-MinHash ProbMinHash1 ProbMinHash2 ProbMinHash3 ProbMinHash4 99.99% α ( m, 3) MinHash Sup erMinHash OPH 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 1 . 25 1 . 50 1 . 75 2 . 00 relativ e empirical MSE J p = 0 . 804 W = { (1 , 3) 300 , (2 , 1) 500 , (5 , 4) 700 } P-MinHash ProbMinHash1 ProbMinHash2 ProbMinHash3 ProbMinHash4 99.99% J p = 0 . 082 W = S 300 u =0 { (0 . 96 u , 1 . 01 u ) } P-MinHash ProbMinHash1 ProbMinHash2 ProbMinHash3 ProbMinHash4 99.99% J p = 0 . 852 W = S 1000 u =0 { (1 . 001 u , 1 . 002 u ) } P-MinHash ProbMinHash1 ProbMinHash2 ProbMinHash3 ProbMinHash4 99.99% J p = 0 . 800 W = { (0 , 1) 30 , (1 , 0) 10 , (1 , 1) 160 } P-MinHash ProbMinHash1 ProbMinHash2 ProbMinHash3 ProbMinHash4 99.99% α ( m, 200) MinHash Sup erMinHash OPH 2 0 2 2 2 4 2 6 2 8 2 10 2 12 2 14 m 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 1 . 25 1 . 50 1 . 75 2 . 00 relativ e empirical MSE J p = 0 . 418 W = S 1000 u =0 { ( u, 1000 − u ) } P-MinHash ProbMinHash1 ProbMinHash2 ProbMinHash3 ProbMinHash4 99.99% 2 0 2 2 2 4 2 6 2 8 2 10 2 12 2 14 m J p = 0 . 048 W = S 500 u =0 { (0 . 98 u , 1 . 01 u ) } P-MinHash ProbMinHash1 ProbMinHash2 ProbMinHash3 ProbMinHash4 99.99% 2 0 2 2 2 4 2 6 2 8 2 10 2 12 2 14 m J p = 0 . 701 W = S 1000 u =0 { (0 . 999 u , 1 . 001 u ) } P-MinHash ProbMinHash1 ProbMinHash2 ProbMinHash3 ProbMinHash4 99.99% 2 0 2 2 2 4 2 6 2 8 2 10 2 12 2 14 m J p = 0 . 600 W = { (0 , 1) 300 , (1 , 0) 500 , (1 , 1) 1200 } P-MinHash ProbMinHash1 ProbMinHash2 ProbMinHash3 ProbMinHash4 99.99% α ( m, 2000) MinHash Sup erMinHash OPH Fig. 4. The empir ical mean squared error (MSE) relative to J P (1 − J P ) /m over the signature size m for different multisets of weight pairs W . Each data point is calculated from 10 000 pairs of randomly generated sets with weight functions satisfying W = S d : w A ( d ) > 0 ∨ w B ( d ) > 0 { ( w A ( d ) , w B ( d )) } . The gra y band cov ers the middle 99.99% of all z -scores, if the number of equal signature components is binomially distributed with success probability J P . ProbMinHash4 we observed that the error is significantly smaller , especially if the signature size m exceeds the input size n . As both algorithms are not defined for m = 1 the corresponding points are missing in Fig. 4. Dependent on the example, the correlated generation of points is able to reduce the empirical MSE by up to a factor of two. W e also observed that the empirical MSE of ProbMinHash4 is slightly smaller than that of ProbMinHash3 for tiny sets as can be seen for the examples with W = { (3 , 20) , (30 , 7) } and W = { (0 , 1) , (1 , 0) , (1 , 1) } . The last column of Fig. 4 shows the results for three examples with binary weights for which we applied the unweighted variants of the ProbMinHash algorithms as de- scribed in Section 2.9. Furthermore, we also considered Min- Hash, OPH with optimal densification [13], and SuperMin- Hash [16]. Due to the independent signature components, the relative error of MinHash is as expected equal to 1. The theoretical relative MSE of SuperMinHash is given by (8) and also shown in Fig. 4. Since ProbMinHash4 corresponds to SuperMinHash in the unweighted case, the empirical MSEs of both were in perfect agreement with the theoretical prediction. The relative error of OPH is larger than that of SuperMinHash and may even be much larger than that of MinHash for small n compared to m . ProbMinHash1a and ProbMinHash3a have also been covered by our exper- iments. However , the logical equivalence to Pr obMinHash1 and ProbMinHash3, respectively , led to identical r esults which were therefore omitted in Fig. 4. 3.2 P erformance The performance of the pr esented algorithms was analyzed on a Dell Precision 5530 notebook with an Intel Cor e i9- 8950HK processor and 32 GB of memory . The average calcu- lation time for a single signature was measured for signature sizes 256, 1024, and 4096, and different assumed weight distributions. Fig. 5 shows the results as log-log plots over the input size n from 1 to 10 6 . For each data point we first generated 100 randomly weighted sets and kept them in main memory to avoid distortion of the measurement. Only then the time for calculating the corresponding signatures was taken. All Pr obMinHash algorithms are significantly faster than the linearly scaling P-MinHash algorithm for large n with break-even points between 10 and 100. The maximum speedup factor is in the order of m , because only the first point needs to be calculated for most elements compared to the m hash values per element in case of P-MinHash. The maximum speedup of ProbMinHash3, ProbMinHash3a, and Pr obMinHash4 tends to be slightly higher , because the random value generation is less costly for the truncated than for the regular exponential distribution. P-MinHash is faster for very small n . In the extreme case n = 1 , all ProbMinHash variants which sample with replacement have a time complexity of O ( m log m ) and are therefore significantly slower . ProbMinHash2 and ProbMinHash4 do slightly better , because they sample without r eplacement and have like P-MinHash a complexity of O ( m ) . Neverthe- less they are still about 3 times slower than P-MinHash. IEEE TRANSACTIONS ON KNO WLEDGE AND D A T A ENGINEERING, VOL. XX, NO . X, XXXX 2020 13 10 − 6 10 − 5 10 − 4 10 − 3 10 − 2 10 − 1 10 0 calculation time (s) m = 256 w ( d ) ∼ Exp(1) ProbMinHash1 ProbMinHash1a ProbMinHash2 ProbMinHash3 ProbMinHash3a ProbMinHash4 P-MinHash m = 256 w ( d ) ∼ P areto(1 , 2) ProbMinHash1 ProbMinHash1a ProbMinHash2 ProbMinHash3 ProbMinHash3a ProbMinHash4 P-MinHash m = 256 w ( d ) ∼ P areto(1 , 0 . 5) ProbMinHash1 ProbMinHash1a ProbMinHash2 ProbMinHash3 ProbMinHash3a ProbMinHash4 P-MinHash m = 256 w ( d ) = 1 ProbMinHash1 ProbMinHash1a ProbMinHash2 ProbMinHash3 ProbMinHash3a ProbMinHash4 P-MinHash MinHash Sup erMinHash OPH 10 − 6 10 − 5 10 − 4 10 − 3 10 − 2 10 − 1 10 0 10 1 calculation time (s) m = 1024 w ( d ) ∼ Exp(1) ProbMinHash1 ProbMinHash1a ProbMinHash2 ProbMinHash3 ProbMinHash3a ProbMinHash4 P-MinHash m = 1024 w ( d ) ∼ P areto(1 , 2) ProbMinHash1 ProbMinHash1a ProbMinHash2 ProbMinHash3 ProbMinHash3a ProbMinHash4 P-MinHash m = 1024 w ( d ) ∼ P areto(1 , 0 . 5) ProbMinHash1 ProbMinHash1a ProbMinHash2 ProbMinHash3 ProbMinHash3a ProbMinHash4 P-MinHash m = 1024 w ( d ) = 1 ProbMinHash1 ProbMinHash1a ProbMinHash2 ProbMinHash3 ProbMinHash3a ProbMinHash4 P-MinHash MinHash Sup erMinHash OPH 10 0 10 1 10 2 10 3 10 4 10 5 10 6 n 10 − 5 10 − 4 10 − 3 10 − 2 10 − 1 10 0 10 1 calculation time (s) m = 4096 w ( d ) ∼ Exp(1) ProbMinHash1 ProbMinHash1a ProbMinHash2 ProbMinHash3 ProbMinHash3a ProbMinHash4 P-MinHash 10 0 10 1 10 2 10 3 10 4 10 5 10 6 n m = 4096 w ( d ) ∼ P areto(1 , 2) ProbMinHash1 ProbMinHash1a ProbMinHash2 ProbMinHash3 ProbMinHash3a ProbMinHash4 P-MinHash 10 0 10 1 10 2 10 3 10 4 10 5 10 6 n m = 4096 w ( d ) ∼ P areto(1 , 0 . 5) ProbMinHash1 ProbMinHash1a ProbMinHash2 ProbMinHash3 ProbMinHash3a ProbMinHash4 P-MinHash 10 0 10 1 10 2 10 3 10 4 10 5 10 6 n m = 4096 w ( d ) = 1 ProbMinHash1 ProbMinHash1a ProbMinHash2 ProbMinHash3 ProbMinHash3a ProbMinHash4 P-MinHash MinHash Sup erMinHash OPH Fig. 5. The av erage calculation time f or different signature sizes m and for diff erent assumed distributions of w ( d ) over the input siz e n . The last column in Fig. 5 presents the results of exam- ples with binary weights for which we also measured the performance of MinHash, OPH with optimal densification [13], and SuperMinHash [16]. P-MinHash is about an order of magnitude slower than MinHash, because the random value generation is more expensive for the exponential distribution than for the uniform distribution. Comparison of the ProbMinHash algorithms with MinHash gives break- even points between n = 50 and n = 1000 . The specialized variants of ProbMinHash3, ProbMinHash3a, and ProbMin- Hash4 are faster than SuperMinHash for large n , which comes from the more adaptive stop limit as discussed in Section 2.9. OPH has by far the best performance for input sizes n which are in the order of m , but is quite slow for small n . Interleaved processing as described in Section 2.7 is not always faster for intermediate input sizes n as observed in the thir d column of Fig. 5 when comparing ProbMin- Hash1a and ProbMinHash3a with Pr obMinHash1 and Pr ob- MinHash3, respectively . This probably depends on whether the expected weight is finite, as a significant speedup was observed for the Pareto(1 , 2) -distribution with a mean of 2 and not for the Pareto(1 , 0 . 5) -distribution with an infinite mean. Since an infinite average weight is not common in practice, we almost always expect an improvement with interleaved processing. W e also analyzed the additional space requir ements of ProbMinHash1a and ProbMinHash3a. Fig. 6 shows the av- erage and the middle 99% of all observed buffer sizes when calculating the signatures of 10 000 randomly generated weighted sets of size n . The results for ProbMinHash1a perfectly match the theoretical considerations in Section 2.7. Also for ProbMinHash3a our prediction is correct, that the average buffer size for weight distributions with a power- tail with an index less than 1 reaches a plateau. In the unweighted case shown in the last column of Fig. 6 we could even confirm the predicted level of this plateau. 3.3 Practical Considerations The optimal choice among the presented algorithms de- pends on the concr ete application. The primary question to be answered is whether signature components should be independent or not. For example, if P-MinHash or MinHash should be replaced without changing the previous statistical behavior , or if it is desirable that signature components are statistically independent, as is the case for locality-sensitive hashing, we recommend using either ProbMinHash1a or ProbMinHash2. ProbMinHash1 is less relevant, because it is slower compared to ProbMinHash2 for very small input sizes n and does not have a worst case runtime complexity of O ( nm ) . ProbMinHash1a, which is logically equivalent to ProbMinHash1, might be more attractive as it is faster for medium n , provided one is willing to sacrifice more memory . On the other hand, if the signature is used for similarity estimation, we recommend ProbMinHash3a or ProbMin- Hash4. Both are based on correlated generation and there- fore allow more accurate similarity estimations for n that are not much greater than the signature size m . Our exper - iments have shown that ProbMinHash4 gives a better error reduction and is also faster for small n than ProbMinHash3, so that the latter is rather the second choice. However , if speed is most important, the logically equivalent ProbMin- IEEE TRANSACTIONS ON KNO WLEDGE AND D A T A ENGINEERING, VOL. XX, NO . X, XXXX 2020 14 10 0 10 1 10 2 10 3 10 4 10 5 buffer size m = 256 w ( d ) ∼ Exp(1) ProbMinHash1a ProbMinHash3a P n i =1 min(1 , mH m /i ) min( n, mH m ) m = 256 w ( d ) ∼ P areto(1 , 2) ProbMinHash1a ProbMinHash3a P n i =1 min(1 , mH m /i ) min( n, mH m ) m = 256 w ( d ) ∼ P areto(1 , 0 . 5) ProbMinHash1a ProbMinHash3a P n i =1 min(1 , mH m /i ) min( n, mH m ) m = 256 w ( d ) = 1 ProbMinHash1a ProbMinHash3a P n i =1 min(1 , mH m /i ) min( n, mH m ) 10 0 10 1 10 2 10 3 10 4 10 5 buffer size m = 1024 w ( d ) ∼ Exp(1) ProbMinHash1a ProbMinHash3a P n i =1 min(1 , mH m /i ) min( n, mH m ) m = 1024 w ( d ) ∼ P areto(1 , 2) ProbMinHash1a ProbMinHash3a P n i =1 min(1 , mH m /i ) min( n, mH m ) m = 1024 w ( d ) ∼ P areto(1 , 0 . 5) ProbMinHash1a ProbMinHash3a P n i =1 min(1 , mH m /i ) min( n, mH m ) m = 1024 w ( d ) = 1 ProbMinHash1a ProbMinHash3a P n i =1 min(1 , mH m /i ) min( n, mH m ) 10 0 10 1 10 2 10 3 10 4 10 5 10 6 n 10 0 10 1 10 2 10 3 10 4 10 5 buffer size m = 4096 w ( d ) ∼ Exp(1) ProbMinHash1a ProbMinHash3a P n i =1 min(1 , mH m /i ) min( n, mH m ) 10 0 10 1 10 2 10 3 10 4 10 5 10 6 n m = 4096 w ( d ) ∼ P areto(1 , 2) ProbMinHash1a ProbMinHash3a P n i =1 min(1 , mH m /i ) min( n, mH m ) 10 0 10 1 10 2 10 3 10 4 10 5 10 6 n m = 4096 w ( d ) ∼ P areto(1 , 0 . 5) ProbMinHash1a ProbMinHash3a P n i =1 min(1 , mH m /i ) min( n, mH m ) 10 0 10 1 10 2 10 3 10 4 10 5 10 6 n m = 4096 w ( d ) = 1 ProbMinHash1a ProbMinHash3a P n i =1 min(1 , mH m /i ) min( n, mH m ) Fig. 6. The av erage maximum buffer size as a function of the number of input elements n f or ProbMinHash1a and ProbMinHash3a. The shown bands cov er the middle 99% of all obser ved v alues. Hash3a algorithm is recommended. It is the fastest variant for most input sizes m at the expense of additional memory . Since there are more options in the case of the con- ventional Jaccard similarity J and no algorithm completely outperforms all others, the optimal choice depends on the expected distribution of n and also on the requir ements regar ding the estimation error for input sizes n in the range of m or smaller . However , MinHash, Pr obMinHash1a, or ProbMinHash2 should be used when independence of signature components is important. 4 O U T L O O K Some other metrics might also benefit from the presented ideas. For example, there is the Lempel-Ziv Jaccard dis- tance which is a generic similarity measure defined on the set of binary subsequences r esulting during Lempel-Ziv compression [18], [39]. Our fast algorithms could make it feasible to incorporate weights as proposed in [40]. Another example is OrderMinHash, which was recently proposed as locality-sensitive hash algorithm for the edit similarity between sequences [41]. The signature calculation requir es the smallest l hash values for each component instead of just the smallest as with MinHash. W e already implemented sta- tistically equivalent algorithms called FastOrderMinHash1, FastOrderMinHash1a, and FastOrderMinHash2 based on ProbMinHash1, ProbMinHash1a, and Pr obMinHash2, r e- spectively . Our first results shown in Fig. 7 look very promising as the calculation time was reduced by an order of magnitude for longer sequences. Apart from that, there are some open theoretical ques- tions. W e could not yet prove mathematically that the 10 1 10 2 10 3 10 4 10 5 10 6 n 10 − 4 10 − 3 10 − 2 10 − 1 10 0 calculation time (s) m = 1024 l = 2 OrderMinHash F astOrderMinHash1 F astOrderMinHash1a F astOrderMinHash2 10 1 10 2 10 3 10 4 10 5 10 6 n m = 1024 l = 5 OrderMinHash F astOrderMinHash1 F astOrderMinHash1a F astOrderMinHash2 Fig. 7. The av erage calculation time of signatures f or the edit similar ity . variance of ProbMinHash3 and ProbMinHash4 is never worse than that of P-MinHash. The only exception is the unweighted case of ProbMinHash4, which corresponds to the SuperMinHash algorithm and for which the variance is given by (7). Futhermore, we assumed that elements are unorder ed with r espect to their weights for our complexity analysis. It would be inter esting to see how much an or der affects the performance of the ProbMinHash algorithms. It is clear that it is advantageous to process elements in descending order with respect to their weights. The opposite is the case with ascending order . 5 C O N C L U S I O N W e have introduced new locality-sensitive hash algorithms for the probability Jaccard similarity and, as a by-product, also for the conventional Jaccard similarity . These algo- rithms show lar gely a significantly better performance than other state-of-the-art methods. In addition, the calculated IEEE TRANSACTIONS ON KNO WLEDGE AND D A T A ENGINEERING, VOL. XX, NO . X, XXXX 2020 15 signatures partly allow a more precise similarity estimation. They can therefore improve the performance and accuracy of existing applications and also open up new applications. W e therefore expect that our algorithms will soon be used in practice. R E F E R E N C E S [1] A. Z. Broder , “On the resemblance and containment of docu- ments,” in Proceedings of Compression and Complexity of Sequences , 1997, pp. 21–29. [2] J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of Massive Datasets . Cambridge University Press, 2014. [3] O. Ertl, “BagMinHash – Minwise hashing algorithm for weighted sets,” in Proceedings of the ACM SIGKDD 24th International Con- ference on Knowledge Discovery and Data Mining (KDD) , 2018, pp. 1368–1377. [4] S. Ioffe, “Improved consistent sampling, weighted minhash and L1 sketching,” in Proceedings of the IEEE 10th International Conference on Data Mining (ICDM) , 2010, pp. 246–255. [5] A. Shrivastava, “Simple and efficient weighted minwise hashing,” in Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS) , 2016, pp. 1498–1506. [6] P . Li, “0-bit consistent weighted sampling,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) , 2015, pp. 665–674. [7] D. Y ang, B. Li, L. Rettig, and P . Cudr ´ e-Mauroux, “HistoSketch: Fast similarity-preserving sketching of str eaming histograms with con- cept drift,” in Pr oceedings of the 17th IEEE International Conference on Data Mining (ICDM) , 2017, pp. 545–554. [8] R. Moulton and Y . Jiang, “Maximally consistent sampling and the Jaccard index of pr obability distributions,” in In Proceedings of the 6th ICDM W orkshop on High Dimensional Data Mining (HDM) , 2018, pp. 347–356. [9] D. Y ang, B. Li, L. Rettig, and P . Cudr ´ e-Mauroux, “D 2 HistoSketch: Discriminative and dynamic similarity-preserving sketching of streaming histograms,” IEEE T ransactions on Knowledge and Data Engineering , vol. 31, no. 10, pp. 1898–1911, 2019. [10] E. Raf f, J. Sylvester , and C. Nicholas, “Engineering a simplified 0- bit consistent weighted sampling,” in Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM) , 2018, pp. 1203–1212. [11] P . Li, A. Owen, and C. Zhang, “One permutation hashing,” in Proceedings of the 26th Conference on Neural Information Processing Systems (NIPS) , 2012, pp. 3113–3121. [12] T . Mai, A. Rao, M. Kapilevich, R. A. Rossi, Y . Abbasi-Y adkori, and R. Sinha, “On densification for minwise hashing,” in Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence (UAI) , 2019. [13] A. Shrivastava, “Optimal densification for fast and accurate min- wise hashing,” in Proceedings of the 34th International Conference on Machine Learning (ICML) , 2017, pp. 3154–3163. [14] A. Shrivastava and P . Li, “Improved densification of one permu- tation hashing,” Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence (UAI) , 2014, pp. 732–741. [15] S. Dahlgaard, M. B. T . Knudsen, and M. Thorup, “Fast similarity sketching,” in Proceedings of the 58th Annual IEEE Symposium on Foundations of Computer Science (FOCS) , 2017, pp. 663–671. [16] O. Ertl, “SuperMinHash – A new minwise hashing algorithm for Jaccard similarity estimation,” arXiv:1706.05698 [cs.DS] , 2017. [17] N. Nissim, O. Lahav , A. Cohen, Y . Elovici, and L. Rokach, “V olatile memory analysis using the MinHash method for efficient and secured detection of malware in private cloud,” Computers & Security , vol. 87, p. 101590, 2019. [18] E. Raf f, J. Aurelio, and C. Nicholas, “PyLZJD: An easy to use tool for machine learning,” in Pr oceedings of the 18th Python in Science Conference , 2019, pp. 101 – 106. [19] W . P . M. Rowe, A. P . Carrieri, C. Alcon-Giner , S. Caim, A. Shaw , K. Sim, J. S. Kroll, L. J. Hall, E. O. Pyzer-Knapp, and M. D. W inn, “Streaming histogram sketching for rapid microbiome analytics,” Microbiome , vol. 7, no. 1, p. 40, 2019. [20] M. Bawa, T . Condie, and P . Ganesan, “LSH forest: Self-tuning indexes for similarity search,” in Pr oceedings of the 14th International Conference on World Wide Web (WWW) , 2005, pp. 651–660. [21] P . Indyk and R. Motwani, “Approximate nearest neighbors: T o- wards removing the curse of dimensionality ,” in Proceedings of the 30th Annual ACM Symposium on Theory of Computing (STOC) , 1998, pp. 604–613. [22] Q. Lv , W . Josephson, Z. W ang, M. Charikar , and K. Li, “Multi- probe LSH: Efficient indexing for high-dimensional similarity search,” in Pr oceedings of the 33rd International Conference on V ery Large Data Bases (VLDB) , 2007, pp. 950–961. [23] P . Li and A. C. K ¨ onig, “b-bit minwise hashing,” in Proceedings of the 19th International Conference on World Wide Web (WWW) , 2010, pp. 671–680. [24] M. S. Charikar , “Similarity estimation techniques from r ounding algorithms,” in Proceedings of the 30th Annual ACM Symposium on Theory of Computing (STOC) , 2002, pp. 380–388. [25] A. Andoni, P . Indyk, T . Laarhoven, and I. Razenshteyn, “Practical and Optimal LSH for Angular Distance,” in Proceedings of the 28th Conference on Neural Information Processing Systems (NIPS) , 2015, pp. 1225–1233. [26] T . Christiani and R. Pagh, “Set Similarity Search beyond Min- Hash,” in Proceedings of the 49th Annual ACM Symposium on Theory of Computing (ST OC) , 2017, pp. 1094–1107. [27] A. Shrivastava, P . Li, “In Defense of Minhash over Simhash,” in Proceedings of the 17th International Conference on Artificial Intelli- gence and Statistics (AIST A TS) , 2014, pp. 886–894. [28] D. Y ang, P . Rosso, B. Li, and P . Cudr ´ e-Mauroux, “Nodesketch: Highly-efficient graph embeddings via recursive sketching,” in Proceedings of the ACM SIGKDD 25th International Conference on Knowledge Discovery and Data Mining (KDD) , 2019, pp. 1162–1172. [29] L. Devroye, Non-Uniform Random V ariate Generation . Springer , 1986. [30] S. M. Ross, Introduction to Probability Models , 12th ed. Academic Press, 2019. [31] R. A. Fisher and F . Y ates, Statistical T ables for Biological, Agricultural and Medical Resear ch . Oliver & Boyd, London, 1938. [32] T . H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Intr oduc- tion to Algorithms , 3rd ed. MIT Press, 2013. [33] M. Mitzenmacher , E. Upfal, Probability and computing: Random- ization and probabilistic techniques in algorithms and data analysis , 12th ed. Cambridge University Pr ess, 2017. [34] G. Marsaglia and W . W . T sang, “The ziggurat method for generat- ing random variables,” Journal of Statistical Software , vol. 5, no. 8, pp. 1–7, 2000. [35] W . Y i. (2019) W yhash. [Online]. A vailable: https://github.com/ wangyi- fudan/wyhash [36] Boost C++ Libraries. [Online]. A vailable: http://www .boost.org [37] D. Lemir e, “Fast random integer generation in an interval,” ACM T ransactions on Modeling and Computer Simulation , vol. 29, no. 1, pp. 3:1–3:12, 2019. [38] Mathematics Stack Exchange. [Online]. https://math. stackexchange.com/questions/3598867 [39] E. Raff and C. Nicholas, “An alternative to NCD for large se- quences, Lempel-Ziv Jaccard distance,” in Proceedings of the ACM SIGKDD 23rd International Conference on Knowledge Discovery and Data Mining (KDD) , 2017, pp. 1007–1015. [40] ——, “Malware classification and class imbalance via stochastic hashed LZJD,” in Pr oceedings of the 10th ACM W orkshop on Artificial Intelligence and Security (AISEC) , 2017, pp. 111–120. [41] G. Marcais, D. DeBlasio, P . Pandey , and C. Kingsford, “Locality- sensitive hashing for the edit distance,” Bioinformatics , vol. 35, no. 14, pp. i127–i135, 2019. Otmar Ertl is a Senior Software Mathematician at Dynatrace Research. He received the Ph.D . degree in technical sciences from the Vienna University of T echnology in 2010. His current re- search interests include probabilistic algorithms and data structures.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment