Finding the True Frequent Itemsets
Frequent Itemsets (FIs) mining is a fundamental primitive in data mining. It requires to identify all itemsets appearing in at least a fraction $\theta$ of a transactional dataset $\mathcal{D}$. Often though, the ultimate goal of mining $\mathcal{D}$…
Authors: Matteo Riondato, Fabio V, in
Finding the T rue F requen t Itemsets ∗ Matteo Riondato † F abio V and in ‡ T uesday 1 1 th Septem b er, 20 18 Abstract F requent Itemsets (FIs) mining is a fundamental primitive in d ata mining. It req uires to identif y all itemsets app earing in at least a fraction θ of a transactional dataset D . Often though, the u ltimate goal of mi ning D is not an analysis of the dataset p er se , but the understanding of the underlying pro cess that generated it. Specifically , in man y applications D is a collection of samples obtained from an un known probabilit y distribution π on transactions, and by ext racting the FIs in D one attempts to infer itemsets that are frequently (i.e., with p robability at least θ ) generated by π , which w e call the T rue F requent Itemsets (TFIs). D ue to the inherently stochastic nature of the generative process, the set of FIs is only a rough appro ximation of the set of TFIs, as it often contains a huge num ber of false p ositives , i.e., spurious itemsets that are not among the TFIs. In this wo rk we design and analyze an algorithm to identify a threshold ˆ θ such that the collectio n of itemsets with frequency at least ˆ θ in D conta ins only TFIs with probabilit y at least 1 − δ , for some user-sp ecified δ . Our metho d uses results from statistical learning theory invol ving th e (empirical) VC -dimension of th e problem at hand. This allo ws us to identify almost all t h e TFIs without including any false positive. W e also exp erimen tally compare our metho d with the direct mining of D at frequency θ and with tec hniques b ased on widely-used standard boun ds (i.e., the Chernoff b oun ds) of the binomial distribution, and sh ow that our algori thm outp erforms these metho d s and ac hiev es eve n bett er results th an what is guaranteed by the theoretical analysis. Keyw ords: F requen t itemsets, V C-dimension, F alse p ositives, Distribution-free methods , F requency threshold identification. 1 In trod uction The extra ction of a sso ciation rules is one of the fundamental primitiv es in da ta mining and knowledge discov ery from large databases [1]. In its mo st general definition, the pro blem can b e r educed to iden tifying frequent sets of items, or fr e quent itemsets , app earing in a t least a fraction θ of all transa ctions in a da taset, where θ is provided in input by the us e r. F requen t itemsets and asso ciation rules are not only of interest for classic data mining applications (e.g., market basket ana lysis), but are also useful for further da ta analysis and mining task, including clustering, cla ssification, and indexing [11, 1 2]. In most applications, the s e t of frequent itemsets is not interesting p er se . Instead, the mining r esults are use d to infer proper ties of the u nderlying pr o c ess that generated the datase t. Consider for e x ample the following sce na rio: a resea rcher w ould like to identify frequent asso cia tions (i.e., itemsets) b etw een preferences among F aceb o ok users. T o this end, she sets up an online s urvey which is filled o ut by a smal l fr action of F acebo ok users (some users may even take the survey multipl e times). Using this information, the researcher wan ts to infer the ass o ciations (itemsets) that are frequent for the entir e F aceb o ok p opulation. In fact, the whole F aceb o ok p o pula tion and the online s ur vey define the underlying pr o c ess that ge ner ated the dataset observe d by the researcher. In this work we are interested in answering the following question: how can we use the latter (the observed dataset) to identify itemsets that are frequent in the former (the whole po pulation)? This is a very na tural question, a s is the underlying assumption that the obser ved dataset is ∗ W ork supp orted in part b y NSF grant IIS-1247581. This is an extended version of the wo rk that app eared as [21]. † Departmen t of Computer Science, Bro wn Universit y , Prov idence, RI, U SA. matteo@cs .brown.e du . Conta ct author. ‡ Departmen t of Computer Science, Brown Universit y , Providence , RI, USA and Department of Mathematics and Computer Science, Universit y of Southe rn Denmark, Odense, Denmark. vandinfa@ imada.s du.dk . 1 r ep r esent ative of the generating pro cess. F or example, in market basket analysis, the observed purchases of customers are used to infer the future purchase habits of all customers while a ssuming that the purchase behavior that gener ated the dataset is re pr esentativ e o f the one that will b e follow ed in the future. A natura l a nd g eneral mo del to descr ib e these concepts is to assume that the transactions in the dataset D are indep end ent identic al ly distribute d (i.i.d) samples fro m an unknown probability distribution π defined o n all po ssible transactions built o n a s et o f items. Since π is fixed, each itemset A has a fixe d pr ob ability t π ( A ) to app ear in a trans a ction sampled from π . W e ca ll t π ( A ) the true fr e quency of A (w.r.t. π ). The true frequency corresp o nds to the fraction of transactions that contain the itemset A among an infinite set of transactions. The real goal of the mining pro cess is then to ident ify itemsets that hav e true frequency t π at least θ , i.e., the T rue F r e quent Itemsets (TFIs). In the market basket ana lysis example, D contains the obser ved purchases of customer s, the unknown dis tr ibution π describ es the purch ase b ehavior of the customers as a whole, and we want to ana ly ze D to find the itemsets that hav e pr obability (i.e., true frequency) at least θ to be b ought b y a customer. Since D re pr esents only a finite sample from π , the set F of freq uent itemsets o f D w.r.t. θ o nly provides an appr oximation of the T rue F requent Itemsets: due to the sto chastic nature of the g enerative pro cess F may contain a num ber of false p osi tives , i.e., itemsets that app ear a mong the freq uen t itemsets of D but whose tru e frequency is smaller than θ . At the sa me time, s o me itemsets with true frequency greater than θ may hav e a frequency in D that is smal ler than θ ( false ne gatives ), and therefore no t b e in F . This implies that one can not aim at identifying al l and only the itemsets having true frequency at lea st θ . Ev en worse, from the data analyst’s po in t of view, there is no guar ante e or b ound on the numb er of false p ositives rep orted in F . Co nsider the following scenario as an example. Let A and B b e tw o (disjoin t) sets of pa irs of items. The set A c o ntains 1,000 disjoin t pairs, while B co ntains 10,00 0 disjoint pair s. Let π b e such that, for any pair ( a, a ′ ) ∈ A , we hav e t π (( a, a ′ )) = 0 . 1, and for any pa ir ( b, b ′ ) ∈ B , we ha ve t π (( b, b ′ )) = 0 . 09. Let D b e a dataset o f 10,000 tra nsactions sampled from π . W e are interested in finding pair s of items that hav e true frequency at least θ = 0 . 09 5. If we extr act the pairs of items with frequency at le a st θ in D , it is easy to see that in exp ectation 5 0 of the 1,0 0 0 pairs from A will hav e frequency in D b elow 0 . 095, and in exp ectation 400 pairs from B will have fr e quency in D ab ov e 0 . 095 . Ther efore, the set o f pairs that hav e frequency at least θ in D do es not contain s ome of the pairs that have true frequency at lea st θ (false negatives), but includes a huge num ber of pairs that hav e true fr e q uency smaller than θ (false p ositives). In gener al, one would like to av oid false p os itiv es a nd at the s ame time find a s ma ny TFIs as p os s ible. These are so mewhat contrasting go als, a nd care must b e taken to achiev e a go o d balance betw een them. A naïve but overly c onservative metho d to av oid false p ositives in v olves the use of Chernoff and union b oun ds [19]. The frequency f D ( A ) of an itemset A in D is a r a ndom v a riable with B ino mial distribution B ( |D | , t π ( A )). It is p ossible to use standard methods like the Chernoff and the unio n b o unds to b ound the deviation of the frequencies in the da taset of al l itemsets from their expectatio ns . Thes e too ls can b e used to compute a v alue ˆ θ such that the probability that a non- true fr e q uen t itemset B has frequency gr eater or equal to ˆ θ is at most 1 − δ , for some δ ∈ (0 , 1). This metho d has the following serio us drawback: in order to achiev e such guarantee, it is n e c essary to b ound the deviation of the frequencies of al l itemsets p ossibly app e aring in the dataset [15]. This means that, if the transactions are built on a set of n items, the union bo und must b e taken ov er all 2 n − 1 p otential itemsets, e ven if some or most of them may app ear with very low frequency or not at all in samples from π . As a consequence, the chosen v alue o f ˆ θ is extremely c onservative , despite being sufficient to av oid the inclusion o f false p os itives in mining results. The collection of itemsets with frequency a t least ˆ θ in D , although consisting (probabilistically) only of TFIs, it only cont ains a very smal l po rtion of them, due to the overly cons e r v a tiv e c hoice o f ˆ θ . (The results of our exp e r iment al ev aluation in Sect. 6 clearly s how the limitations of this method.) More refined algorithms ar e therefore needed to achiev e the c orrect ba la nce b etw een the co nt rasting goa ls of av oiding false po sitives and finding a s many TFIs as po ssible. 1.1 Our con tributions. The cont ributions of this w ork ar e the following: 2 • W e formally define the pro blem o f mining the T rue F r e quen t Itemsets w.r.t. a minim um thres hold θ , and we develop and ana lyze an algor ithm to identify a value ˆ θ such t hat, with pr o b ability at le ast 1 − δ , al l itemsets with fr e quency at le ast ˆ θ in t he dataset have true fr e quency at le ast θ . O ur metho d is co mpletely distribution-fr e e , i.e., it do es not make any assumption ab out the unknown generative distribution π . By contrast, existing methods to assess the significance of freq uent patterns after their extraction req uir e a well sp ecified, limited gener ative mo del to characterize the significance of a pattern. When a dditional information ab out the distribution π is av a ila ble, it ca n b e incorp or ated in our metho d to obtain even higher accura cy . • W e a na lyse our algorithm using results from statistic al le a rning the ory and optimization . W e define a range set asso ciated to a collection of itemsets and give an upp er b ound to its (empirica l) VC- dimension and a pro cedure to compute this bo und, showing an int eresting connection with the Set- Union Knapsack Problem (SUKP) [8]. T o the b est of o ur knowledge, ours is the first work to apply these tec hniques to the field o f TFIs, and in general the fir st application of the sample c o mplexity bo und based on empiric al VC-dimension to the field of data mining. • W e implement ed our algorithm and assessed its p er formances o n simulated datasets with prop erties – num ber of items, itemsets frequency distribution, etc.– similar to rea l datasets. W e co mputed the fraction of TFIs contained in the set of frequen t itemsets in D w.r.t. ˆ θ , and the num ber of false po sitives, if any . The results s how that the algo rithm is even mor e ac cura te than the theory guara ntees, since no false p ositive is r epo rted in any of the many exp eriments we per formed, a nd moreov er allows the extr actio n of almost al l TFIs . W e also compared the set o f itemsets computed b y o ur metho d to those obtained with the “Chernoff a nd union b ounds” metho d present ed in the in tro duction, and found that our alg o rithm vastly outp erf orms it. Outline. In Sect. 2 we review relev an t previous co nt ributions. Sections 3 and 4 contain preliminaries to formally define the problem and key concepts that we will use throug ho ut the work. Our pro po sed algorithm is descr ib ed a nd analyzed in Sect. 5. W e present the metho do logy and res ults of our e x per iment al ev a luation in Sect. 6. Conclusions a nd future work can be found in Sect. 7. 2 Previous w ork While the pr o blem of identifying the TFIs ha s r e ceived s c a nt attention in the litera tur e , a num ber o f ap- proaches have been prop osed to filter the FIs of spurious p atterns , i.e., patterns that are not actually int er- esting , according to so me interestingness mea sure. W e refer the re ader to [11, Sect. 3] and [5] for surveys on different measures . W e r e ma rk that, as noted by Liu et al. [16], that the use of the minim um supp ort threshold θ , reflecting the level of doma in significa nce, is complementary to the use of interestingness mea- sures, and that “statistical significa nce measures and domain significance meas ures should be used together to filter uninteresting rules from different p ersp ectives” . The algo rithm we present can b e seen as a metho d to filter out patterns that are not interesting according to the measure represented by the true frequency . A num ber of w orks explored the idea to use statistica l prop erties o f the patterns in o r der to assess their in terestingness. While this is not the fo cus o f our work, some of the techniques and mo dels prop os ed are relev a nt to our framework. Mo st of these w orks are focuse d on asso ciation rules, but some results can be applied to itemsets. In these works, the notion o f in terestingness is rela ted to the deviation b e tw een the o bserved frequency of a pattern in the dataset and its ex pected supp ort in a random dataset g enerated according to a well-defined proba bilit y distribution that can incorp orate prior b elief and that can be up dated during the mining pro cess to e ns ur e that the most “surprising” patterns are extrac ted. In many previous works, the probability distribution was defined by a simple indep endence model: a n item b elong s to a transaction indep endently from other items [4, 6, 10, 15, 1 8, 22]. In contrast, our work do es no t imp ose any restriction on the probability distribution generating the dataset, with the result that our metho d is as general a s p os sible. Kirsch et a l. [15] developed a multi-h ypothesis testing pro cedure to identify the b est supp ort threshold such that the nu m ber o f itemsets with a t least such s upp or t deviates significantly from its ex p ecta tion in a 3 random dataset o f the sa me size a nd with the same frequency distribution for the individual items. In o ur work, the minimu m thresho ld θ is an input pa rameter fixed by the user, and we iden tify a threshold ˆ θ ≥ θ to gua rantee tha t the collection of FIs w.r.t. ˆ θ do es not contain any fals e discov ery . Gionis et a l. [6] present a metho d to create rando m datasets that can act as sa mples from a distr ibution satisfying an assumed generative mo del. The main idea is to swap items in a given dataset while keeping the length o f the transactions a nd the sum o ver the columns constant. This metho d is only a pplicable if one c a n actually derive a pro cedure to p erform the s wapping in such a wa y that the genera ted datasets are indeed random samples from the ass umed distribution. F or the problem we are interested in there such pro cedure is not av ailable. Considering the same generative mo del, Hanhijärvi [13] presents a direc t adjustment metho d to b ound the pr obability of false discoveries by ta king into consideration the actual n um ber of hypotheses to b e tested. W ebb [2 4] pr op oses the use of established statistical techniques to control the probability o f false discov- eries. In o ne o f these metho ds (called holdout), the av ailable data a r e split into tw o parts: one is used for pattern discov ery , while the seco nd is used to verify the significance of the discovered patterns, testing o ne statistical hypothesis at a time. A new metho d (lay ered cr itica l v alues) to choose the cr itical v alues when using a direct adjustmen t technique to con trol the probability o f fa ls e discoveries is presented b y W ebb [25] and works b y exploiting the itemset lattice. The metho d we pr esent instead identify a threshold frequency such that all the itemsets with frequency ab ov e the thre s hold a re TFIs. There is no need to test each itemset separately a nd no need to split the dataset. Liu et al. [16] conduct an exp erimental ev aluation of direct c o rrections, holdout da ta , and rando m p ermu- tations metho ds to control the false p ositives. They test the metho ds o n a very sp ecific pr oblem (asso ciatio n rules fo r binary classificatio n). In co nt rast with the methods pr e s ent ed in the w orks a bove, ours do es not employ an explicit direct correction depe nding on the num ber o f patterns consider ed as it is done in traditional mult iple hypothesis testing settings. It instead uses the ent ire av aila ble da ta to obtain more accura te results,without the need to re-sampling it to generate ra ndom da tasets or to split the da ta set in tw o parts, b eing therefore more efficient computationally . 3 Preliminarie s In this section we introduce the definitions, lemmas, and to ols that we will use throughout the work, providing the deta ils that are needed in later sections. 3.1 Itemsets mining. Given a gr o und set I of items , let π b e a proba bilit y distribution on 2 I . A tr ansac tion τ ⊆ I is a single sample drawn from π . The length | τ | o f a transaction τ is the num ber of items in τ . A dataset D is a ba g o f n transactions D = { τ 1 , . . . , τ n : τ i ⊆ I } , i.e., of n indep endent identic al ly distribute d (i.i.d.) samples from π . W e call a subset of I an itemset . F or any itemset A , let T ( A ) = { τ ⊆ I : A ⊆ τ } b e the supp ort set of A . W e define the true fr e quency t π ( A ) of A with resp ect to π as the pro babilit y that a transac tio n sampled from π co nt ains A : t π ( A ) = X τ ∈ T ( A ) π ( τ ) . Analogously , given a (observed) datas e t D , let T D ( A ) denote the set of tra nsactions in D containing A . The fr e quency of A in D is the fraction of transactions in D that contain A : f D ( A ) = | T D ( A ) | / |D| . It is easy to see that f D ( A ) is the empiric al aver age (and an unbiase d estimator ) for t π ( A ): E [ f D ( A )] = t π ( A ). T raditionally , the in terest has been on extrac ting the set o f F r e quent Itemsets (FIs) from D w ith resp ect to a minimum frequency threshold θ ∈ (0 , 1] [1], that is, the set FI ( D , I , θ ) = { A ⊆ I : f D ( A ) ≥ θ } . 4 In most applications the fina l goal of data mining is to gain a better understanding of the pr o c ess gener ating the data , i.e., o f the distribution π , through the true fr e quencies t π , which are unknown and only approximately reflected in the dataset D . Therefore, we a r e int erested in finding the itemsets with true frequency t π at lea s t θ for some θ ∈ (0 , 1]. W e call these itemsets the T rue F r e quent Itemsets (TFIs) and denote their set as TFI ( π , I , θ ) = { A ⊆ I : t π ( A ) ≥ θ } . If one is o nly g iven a fi nite n um ber of r andom sa mples (the dataset D ) from π a s it is usua lly the case, one can not aim at finding the exact set TFI ( π , I , θ ): no a ssumption can b e made on the set-inclusion rela tionship betw een TFI ( π, I , θ ) a nd FI ( D , I , θ ), b eca use an itemset A ∈ TFI ( π , I , θ ) may not app ear in FI ( D , I , θ ), a nd vice versa. One can instead try to appr oxima te the set of TFIs. This is what we are interested in this work. Goal. Given an user-sp ecified parameter δ ∈ (0 , 1 ), we aim at providing a threshold ˆ θ ≥ θ s uch that C = FI ( D , I , ˆ θ ) wel l appr oximates TFI ( π , I , θ ), in the sense that 1. With probability at lea st 1 − δ , C do es not contain any false p os itive: Pr( ∃ A ∈ C : t π ( A ) < θ ) < δ . 2. C contains as ma n y TFIs as p ossible. The metho d we present does not make any assumption ab out π . It uses information from D , and g uarantees a s ma ll pro babilit y of false po sitives while achieving a high success ra te. 3.2 V apnik-Cherv onenkis dimensi on. The V apnik-Chernov enkis (VC) dimension of a collection of subse ts o f a domain is a measure of the co mplexit y or expres s iveness of such collection [2 3]. W e outline here so me basic definitions and results and refer the reader to the works of Alon and Sp encer [2, Sect. 14 .4 ] and Boucheron et al. [3, Sect. 3] for an in tro duction to VC-dimension and a survey of r ecent dev elopment s. Let D b e a domain and R be a collection of subsets from D . W e call R a r ange set on D . Giv en B ⊆ D , the pr oje ction of R on B is the set P R ( B ) = { B ∩ A : A ∈ R} . W e say that the set B is shatter e d by R if P R ( B ) = 2 B . Definition 1. Given a set B ⊆ D , the empiric a l V apnik-Chervonenkis (V C) dimension of A on B , denoted as EVC ( R , B ) is the cardinality of the lar gest subset of B that is s ha ttered b y R . The V C-dimensio n of R is defined as V C ( R ) = EVC ( R , D ). The main application of (empirical) VC-dimension in statistics and lea rning theory is in co mputing the n um ber o f samples needed to approximate the probabilities asso cia ted to the ra nges through their empirical av erages. F ormally , let X k 1 = ( X 1 , . . . , X k ) b e a collection o f indep endent identically distributed r andom v ar iables tak ing v alues in D , sampled acco rding to s ome distribution ν on the elements of D . F or a s et A ⊆ D , let ν ( A ) b e the probability that a sa mple from ν b elongs to the set A , and let ν X k 1 ( A ) = 1 k k X j =1 1 A ( X j ) , where 1 A is the indica tor function for the set A . The function ν X k 1 ( A ) is the empiric al aver age o f ν ( A ) on X k 1 . Definition 2. Let R b e a rang e set on D and ν b e a probability distribution on D . F or ε ∈ (0 , 1), an ε -appr oximation to ( R , ν ) is a ba g S of element s of D such that sup A ∈R | ν ( A ) − ν S ( A ) | ≤ ε . 5 An ε -approximation can b e constr ucted by sa mpling po ints of the domain acco rding to the distribution ν , provided an upp er b ound to the VC-dimension of R or to its empirical V C-dimension is known: Theorem 1 (Thm. 2.12 [1 4]) . L et R b e a r ange set on D with VC ( R ) ≤ d , and let ν b e a distribution on D . Given δ ∈ (0 , 1) and a p ositiv e inte ger ℓ , let ε = s c ℓ d + log 1 δ (1) wher e c is an u niversal p ositive c o nstant. Then, a b ag of ℓ elements of D sample d indep endently ac c or ding to ν is an ε - appr oximation to ( R , ν ) with pr ob ability at le ast 1 − δ . Löffler and Phillips [17] estimated exp er imen tally that the constan t c is at mo st 0 . 5 . Theorem 2 (Sect. 3 [3]) . L et R b e a r ange set on D , and let ν b e a distribution on D . L et X ℓ 1 = ( X 1 , . . . , X ℓ ) b e a c ol le ct ion of elements fr om D sample d indep endently ac c or ding to ν . L et d b e an int e ger such that EVC ( R , X ℓ 1 ) ≤ d . Given δ ∈ (0 , 1) , let ε = 2 r 2 d log( ℓ + 1) ℓ + s 2 log 2 δ ℓ . (2) Then, X ℓ 1 is a ε -appr oximation for ( R , ν ) with pr ob ability at le ast 1 − δ . 4 The range set of a c ollection of itemset s In this section we define the concept of a range set asso cia ted to a collection of itemsets a nd show how to bo und the VC-dimension and the empirical VC-dimension o f this rang e set. W e use these definitions a nd results to develop our algorithm in later sections. Definition 3. Given a collection C of itemsets built on a ground set I , the r ange set R ( C ) asso ciate d to C is a r a nge set on 2 I containing the suppor t sets of the itemsets in C : R ( C ) = { T ( A ) : A ∈ C } . Theorem 3. L et C b e a c ol le ctio n of itemsets and let D b e a dataset. Le t d b e the maximum inte ge r for which ther e ar e at le ast d tr ansactio ns τ 1 , . . . , τ d ∈ D such that the set { τ 1 , . . . , τ d } is an antichain, and e a ch τ i , 1 ≤ i ≤ d , c ontains at le ast 2 d − 1 itemsets fr om C . Then EVC ( R ( C ) , D ) ≤ d . Pr o of. The a ntic hain requirement guara nt ees that the set of tra nsactions considered in the computation o f d could indeed theoretically b e shattered. Ass ume that a subset F of D cont ains t w o transactions τ ′ and τ ′′ such that τ ′ ⊆ τ ′′ . Any itemset fro m C app earing in τ ′ would also app ear in τ ′′ , so there would not b e any itemset A ∈ C such that τ ′′ ∈ T ( A ) ∩ F but τ ′ 6∈ T ( A ) ∩ F , which would imply that F can no t b e shattered. Hence sets that are not antic hains sho uld not b e considered. This has the net e ffect of po ten tially resulting in a low er d , i.e., in a stricter upper b ound to EV C ( R ( C ) , D ). Let now ℓ > d and c o nsider a se t L of ℓ transa c tions fro m D that is an antic hain. Assume that L is shattered by R ( C ). Let τ b e a transa c tio n in L . The transa ctions τ b elong s to 2 ℓ − 1 subsets of L . Let K ⊆ L be one of these subsets. Since L is s ha ttered, there exists an itemset A ∈ C s uch that T ( A ) ∩ L = K . F rom this and the fact that t ∈ K , we have that τ ∈ T ( A ) o r e q uiv alently that A ⊆ τ . Given that all the subsets K ⊆ L containing τ a re different, then als o all the T ( A )’s such that T ( A ) ∩ L = K should b e differen t, which in turn implies that all the itemsets A sho uld b e different and that they should all app ear in τ . There a re 2 ℓ − 1 subsets K o f L cont aining τ , therefore τ m ust contain at least 2 ℓ − 1 itemsets from C , and this holds for all ℓ tra nsactions in L . This is a c o ntradiction b eca use ℓ > d and d is the ma ximum integer for which there are a t least d transactions co ntaining at least 2 d − 1 itemsets from C . Hence L cannot b e sha ttered a nd the thesis follows. 6 4.1 Computing the VC-Dimen sion. The naïve computation of d accor ding to the definition in Thm. 3 requires to scan the transactio ns one b y one, compute the num ber of itemsets from C a ppea ring in each tra nsaction, and make sure to co nsider only itemsets co nstituting anti chains. Giv en the very lar ge num ber of transactio ns in typical dataset a nd the fact that the num ber of itemsets in a transa ction is exp onential in its length, this metho d would b e computationally to o expensive. An upp er bound to d (and therefore to EVC ( R ( C ) , D )) can b e computed by solving a Set- Union K napsack Pr oblem (SUKP) [8] a sso ciated to C . Definition 4 ([8]) . Let U = { a 1 , . . . , a ℓ } b e a set of elemen ts and let S = { A 1 , . . . , A k } b e a s et of subsets o f U , i.e. A i ⊆ U for 1 ≤ i ≤ k . Each subset A i , 1 ≤ i ≤ k , has an asso ciated non-negative pr ofit ρ ( A i ) ∈ R + , and ea ch element a j , 1 ≤ j ≤ ℓ as an asso ciated non-negative weight w ( a j ) ∈ R + . Given a subset S ′ ⊆ S , we define the profit of S ′ as P ( S ′ ) = P A i ∈S ′ ρ ( A i ). Let U S ′ = ∪ A i ∈S ′ A i . W e define the weigh t of S ′ as W ( S ′ ) = P a j ∈ U S ′ w ( a j ). Given a non-nega tiv e par ameter c that we call c ap acity , the Set- Union K napsack Pr oblem (SUKP) requires to find the set S ∗ ⊆ S whic h maximizes P ( S ′ ) over all sets S ′ such that W ( S ′ ) ≤ c . In our case, U is the set of items that app ear in the itemsets of C , S = C , the pro fits and the weigh ts are all unitary , and the capa city co nstraint is an integer ℓ . W e call this optimization problem the SU KP asso ciate d to C with c ap acity ℓ . It is easy to see that the optimal profit of this SUKP is the maxim um n um ber of itemsets from C that a tra nsaction of length ℓ can co nt ain. In order to show how to use this fact to compute an upper b ound to E VC ( R ( C ) , D ), we need to define some a dditional terminolo gy . L e t ℓ 1 , . . . , ℓ w be the sequence of the t r a nsaction lengths of D , i.e., for each v alue ℓ for which there is a t lea st a tr ansaction in D of length ℓ , there is o ne (and only one) index i , 1 ≤ i ≤ w s uch that ℓ i = ℓ . Ass ume that the ℓ i ’s are lab elled in sorted decreasing order: ℓ 1 > ℓ 2 > · · · > ℓ w . Let now L i , 1 ≤ i ≤ w b e the maximum num b er o f transactions in D that hav e leng th at least ℓ i and such that for no tw o τ ′ , τ ′′ of them we hav e either τ ′ ⊆ τ ′′ or τ ′′ ⊆ τ ′ . The s e q uences ( ℓ i ) w 1 and a sequence ( L ∗ i ) w of upper b ounds to ( L i ) w 1 can b e computed efficien tly with a scan of the datas et. Let now q i be the o ptimal pro fit of the SUKP ass o ciated to C with capacity ℓ i , and let b i = ⌊ log 2 q i ⌋ + 1. The following lemma uses these sequences to show how to obtain a n upp er b ound to the empirical V C-dimension of C on D . Lemma 1. L et j b e the minimum inte ger for which b i ≤ L i . Then EVC ( C , D ) ≤ b j . Pr o of. If b j ≤ L j , then there are a t least b j transactions which can contain 2 b j − 1 itemsets fr o m C and this is the maximum b i for which it happ ens, b ecause the sequence b 1 , b 2 , . . . , b w is sor ted in decrea s ing order, g iven that the sequence q 1 , q 2 , . . . , q w is. Then b j satisfies the conditions of Lemma 3 . Hence EVC ( C , D ) ≤ b j . Corollary 1. L et q b e pr ofit of the SUKP asso ci ate d to C with c ap ac ity e qual to ℓ = |{ a ∈ I : ∃ A ∈ C s.t. a ∈ A }| ( ℓ is the numb er of items su ch t hat ther e is at le ast one itemset in C c ontaining them). L et b = ⌊ log 2 q ⌋ + 1 . Then V C ( R ( C )) ≤ b . Solving the SUKP optimally is NP- hard in the gener al case, a lthough there are known restr ictions for which it can b e solved in p olyno mial time us ing dyna mic progra mming [8]. F or our ca se, it is a ctually not ne c essary to c ompute the optimal solution to the SUKP: any upp er b ound solution for which we can pr ove that there is no p ow er of tw o b etw een that solution and the optimal solution would result in the same upp er b oun d to the (empirical) V C dimension, while s ubstant ially sp eeding up the computation. This prop erty can be sp ecified in currently av a ilable optimization pro blem solvers (e.g. CPLEX), which can then can compute the b o und to the (empirical) VC-dimension very fast even for very large instances with thousands of items and hundred o f thousands of itemsets in C , making this approach pra ctical. The range set asso ciated to 2 I is particularly interesting for us. It is p ossible to compute b o unds to V C ( R (2 I )) a nd EV C ( R (2 I ) , D ) without having to so lve a SUKP . Theorem 4 ([2 0]) . L et D b e a dataset built on a gr ound set I . The d-index d ( D ) of D is the maximum inte ge r d su ch that D c ontains at le ast d tr ansactions of length at le ast d that form an ant ichain. W e have EVC ( R (2 I ) , D ) ≤ d ( D ) . 7 Corollary 2. V C ( R (2 I )) ≤ |I | − 1 . Riondato and Upfal [20] presented an efficien t algorithm to compute an upp er b ound to the d-index of a dataset with a single linear scan of the dataset D . The upper b ound pre s ent ed in Thm. 4 is tight: there ar e datasets for which EVC ( R (2 I ) , D ) = d ( D ) [20]. This implies that the upper b ound pr esented in Co rol. 2 is also tigh t. 5 Finding the T rue F requen t Itemsets In this s ection we present an algorithm to identify a threshold ˆ θ such that, with proba bilit y a t least δ for some user-sp ecified par ameter δ ∈ (0 , 1 ), all itemsets with frequency a t least ˆ θ in D are T rue F requent Itemsets with resp ect to a fixed minimu m true frequency threshold θ ∈ (0 , 1 ]. The threshold ˆ θ can b e used to find a collection C = FI ( D , I , ˆ θ ) of itemsets such that Pr( ∃ A ∈ C s.t. t π ( A ) < θ ) < δ . The intu ition behind the metho d is the following. Let B b e the ne gative b or der of T FI ( π , I , θ ), that is the se t of itemsets not in TFI ( π , I , θ ) but such that all their pr op er s ubsets ar e in T FI ( π , I , θ ). If we can find an ε such that D is a n ε -approximation to ( R ( B ) , π ) then we hav e that any itemset A ∈ B has a frequency f D ( A ) in D less than ˆ θ = θ + ε , given that it must be t π ( A ) < θ . By the antimonotonicit y prop e r ty of the frequency , the same holds for all itemsets that are supers ets of those in B . Hence, the only itemsets that can hav e frequency in D g reater or equa l to ˆ θ = θ + ε are those with true fre q uency at least θ . In the following paragr aphs we show ho w to compute ε . Let δ 1 and δ 2 be such that (1 − δ 1 )(1 − δ 2 ) ≥ 1 − δ . Let R (2 I ) b e the r ange space of a ll itemsets. W e use Corol. 2 (resp. Thm. 4) to compute an upp er b ound d ′ to VC ( R (2 I )) (r esp. d ′′ to EVC ( R ( 2 I ) , D )). Then we can use d ′ in Thm. 1 (resp. d ′′ in Thm 2) to compute an ε ′ 1 (resp. an ε ′′ 1 ) such that D is, with probability at leas t 1 − δ 1 , an ε ′ 1 -approximation (r esp. ε ′′ 1 -approximation) to ( R (2 I ) , π ). F a ct 1. L et ε 1 = min { ε ′ 1 , ε ′′ 1 } . With pr ob ability at le ast 1 − δ 1 , D is an ε 1 -appr oximation to ( R (2 I ) , π ) . W e wan t to find an upp er b ound the (empirical) VC-dimension of R ( B ). T o this end, we us e the fact that the neg a tive b order of a collection of itemsets is a maximal ant ichain o n 2 I . Let now W b e the ne gative b or der of C 1 = FI ( D , I , θ − ε 1 ), G = { A ⊆ I : θ − ε 1 ≤ f D ( A ) < θ + ε 1 } , a nd F = G ∪ W . Lemma 2. L et Y b e the set of maximal ant ichains in F . If D is an ε 1 -appr oximation to ( R (2 I ) , π ) , then 1. max A∈Y EVC ( R ( A ) , D ) ≥ EVC ( R ( B ) , D ) , and 2. max A∈Y V C ( R ( A )) ≥ VC ( R ( B )) . Pr o of. Given that D is a n ε 1 -approximation to ( R (2 I ) , π ), then TFI ( π , I , θ ) ⊆ G ∪ C 1 . F rom this and the definition of negative bor der and of F , we have that B ) ⊆ F . Since B is a maximal antic hain, then B ∈ Y . Hence the thesis. In or der to co mpute upp er b ounds to VC ( R ( B )) and E V C ( R ( B ) , D ) we can solve slightly mo dified SUKPs asso ciated to F with the additional constra in t that the optimal solution, which is a collection of itemsets, must b e a maximal antichain . Lemma 1 still ho lds even for the solutions of these mo dified SUKPs . Using these b ounds in Thms. 1 and 2, we co mpute an ε 2 such that, with probabilit y at least 1 − δ 2 , D is an ε 2 -approximation to ( R ( B ) , π ). Let ˆ θ = θ + ε 2 . Theorem 5. With pr ob ability at le ast 1 − δ , FI ( D , I , ˆ θ ) c ont ains no false p osi tives: Pr FI ( D , I , ˆ θ ) ⊆ TFI ( π , I , θ ) ≥ 1 − δ . 8 Dataset F req. θ TFIs Times FPs Times FNs accide nts 0.2 88988 3 100% 100% BMS-PO S 0.005 4240 100% 100% chess 0.6 25494 4 100% 100% connec t 0.85 14212 7 100% 100% kosara k 0.015 189 45% 55% pumsb* 0.45 1913 5% 80% retail 0.0075 277 10% 20% T able 1 : F ractions of times that FI ( D , I , θ ) contained false p os itives and missed TFIs (false negatives) over 20 datasets from the same g round truth. Pr o of. Consider the t w o even ts E 1 =“ D is a n ε 1 -approximation for ( R (2 I ) , π )” and E 2 =“ D is an ε 2 - approximation for ( R ( B ) , π )” . F rom the a b ove discussion and the definition of δ 1 and δ 2 it follows that the e vent E = E 1 ∩ E 1 o ccurs with pro bability at least 1 − δ . Suppos e from now on that indeed E o ccurs. Since E 1 o ccurs, then Lemma 2 holds, and the bo unds we compute b y so lving the mo dified SUKP problems are indeed bo unds to VC ( R ( B )) and EVC ( R ( B , D )). Since E 2 also o ccurs, then for any A ∈ B we hav e | t π ( A ) − f D ( A ) | ≤ ε 2 , but given that t π ( A ) < θ b ecause the elements of B a re no t TFIs, then we have f D ( A ) < θ + ε 2 . Because o f the a nt imonotonicity pr op erty of the frequency and the definition of B , this holds for a ny itemset that is not in TFI ( π , I , θ ). Hence, the o nly itemsets that can hav e a frequency in D at least ˆ θ = θ + ε 2 are the TFIs, so FI ( D , I , ˆ θ ) ⊆ TFI ( π , I , θ ), whic h concludes our pro of. Exploiting additional knowledge ab out π . Our algor ithm is completely distribution-fr e e , i.e., it do e s not require any assumption ab out the unknown distribution π . On the other hand, when information ab out π is av ailable, our metho d can exploit it to achiev e b etter p erformances in terms o f running time, pr acticality , and accur acy . F or example, in most applications π will not genera te any transa ction longer than s ome upp er bo und ℓ ≪ |I | , and this is know. Consider for ex ample an online marketplace like Amazon: it is extremely unlik ely (if not humanly imp ossible) that a single cus tomer buys one of each av ailable pro duct. Indeed, given the hundred o f thousands of items on sale, it is safe to assume that all the tr ansactions will contains at mo st ℓ items, for ℓ ≪ |I | . Other times, like in an online sur vey , it is the na tur e of the pr o cess that limits the n um ber of items in a transaction, in this case the num ber of questions. A different kind of informa tion ab out the gener a tive pro cess may consists in knowing that some co mbination of items may never o ccur, beca use “forbidden” in some wide sense. Other examples are p ossible. All these pieces o f informa tion ca n b e used to compute b etter (i.e., stricter) upp er b ounds to the V C-dimension VC ( R (2 I )). F or example, if we k now that π will never gener ate transactions with more than ℓ items, w e ca n safely say that VC ( R (2 I )) ≤ ℓ − 1, a m uch stricter bound than |I | − 1 from Corol. 2. This ma y res ult in a smaller ε 1 , a smaller ε , and a smaller ˆ θ , which allows to pro duce more TFIs in the output collection. In the exp er imen tal ev a luation, we show the po sitive impact o f including additional information may o n the per formances of o ur algo rithm. 6 Exp erimen tal ev aluatio n W e conducted an extensive ev aluation to a ssess the p erforma nces of the algor ithm we prop ose. In pa rticular, we used it to co mpute v alues ˆ θ for a num b er o f frequencies θ on different datasets, and compared the collec tio n of FIs w.r.t. ˆ θ with the collection o f TFIs, measuring the num ber o f false po s itiv es a nd the frac tion of TFIs that w ere found. 9 6.1 Implemen tation. W e implement ed the a lg orithm in Py thon 3.3. T o mine the FIs, we used the C implementation by Grahne and Zhu [9]. Our solver of choice for the SUKPs was IB M R ILOG R CPLEX R Optimization Studio 1 2.3. W e run the experiments on a nu m ber of machines with x86-64 pro ce s sors running GNU/Linux 3.2.0. 6.2 Datasets generation. W e ev aluated the algorithm using pseudo-ar tificia l da tasets generated by taking the datasets fro m the FIMI’04 repo sitory 1 as the gr ound tru t h for the true frequencies t π of the itemsets. W e considered the following datase ts: accid ents , B MS-PO S , che ss , conn ect , kosa rak , pu msb * , and retai l . These da tasets differ in size, num ber of items, a nd, mo re imp ortantly for our case, distribution of the fr equencies o f the itemsets [7]. W e crea ted a dataset by sampling 20 mil lion tr ansactions uniformly at r and om from a FIMI rep ository dataset. In this way the the true fr e q uency of a n itemset is its frequency in the o riginal FIMI dataset. Given that o ur metho d to find the TFIs is distribution-free, this is a v alid pro cedure to es ta blish a ground truth. W e used these enla rged datasets in o ur exper imen ts, and use the origina l name of the datasets in the FIMI rep ositor y to annotate the results for the datas ets we generated. 6.3 F alse p ositive s and false negativ es in FI ( D , I , θ ) . In the first set of exp er iments we ev a luated the p erfor ma nces, in terms o f inclusion o f fals e po sitives and false negatives in the o utput, of mining the dataset at frequency θ . T able 1 r e p o rts the fra ction of times (ov er 20 da tasets from the same g round truth) that the se t FI ( D , I , θ ) contained fa ls e p o sitives (FP) and w as missing TFIs (false negatives (FN)). In most cases, esp ecially when ther e ar e many TFIs, the inclusio n of false po sitives when mining at frequency θ should b e exp ected. This highlights a need for metho ds like the one presented in this w ork, as there is no guara nt ee that FI ( D , I , θ ) only contains TFIs. On the other hand, the fact that some TFIs hav e freq uency in the dataset smal ler than θ (false negatives) p oints out how one can no t a im to extra ct all and o nly the TFIs by using a fixed threshold approach (as the one we present) . 6.4 Con trol of the false p ositiv es (Pr ecision). In this set of exp eriments we ev aluated how w ell the threshold ˆ θ computed by our a lg orithm allows to av oid the inclusio n of false neg atives in FI ( D , I , ˆ θ ). T o this end, we used a wide r ange of v alues for the minim um true frequency threshold θ (see T able 2) and fixed δ = 0 . 1. W e rep eated each exp er imen t on 2 0 different enlarged da ta sets g enerated from the sa me or iginal FIMI dataset. In a ll the hundr e ds of runs of our algor ithms, FI ( D , I , ˆ θ ) never co nt ained any false p ositive , i.e., alwa ys c o ntaine d only TFIs . In other words, the pr e cision of the output was 1.0 in all our exper iment s. Not o nly our metho d can g ive a freq uency threshold to extract only TFIs, but it is more c onservative , in terms of including false po sitives, than what the theo retical analysis guar antees. 6.5 Inclusion of TFIs (Recall). In addition to av oid false p ositives in the results, one w ant s to include as many TFIs as pos sible in the output collection. T o this e nd, we a ssessed what fraction of the total nu m ber of TFIs is rep orted in FI ( D , I , ˆ θ ). Since there were no false p o sitives, this is co rresp onds to ev aluating the r e c al l of the o utput collection. W e fixe d δ = 0 . 1, and considered different v a lues for the minim um true fre q uency threshold θ (see T able 2). F or e a ch frequency threshold, we rep eated the exp eriment o n 2 0 differen t da tasets sampled from the s a me origina l FIMI da taset, and found very small v ar iance in the results. W e compared the fraction of TFIs that our algorithm included in output with that included by the “Chernoff and Union bounds” (CU) metho d w e presented in Introduction. W e compared tw o v ar iants of the algor ithms: one (“v anilla”) which makes no assumption on the generative distribution π , and another (“additional info”) which assumes that the pr o cess 1 http://f imi.ua. ac.be/data/ 10 Repo rted TFIs (A v erage F raction) “V anilla” (no info) A dditional Info Dataset F req. θ TFIs CU Method This W ork CU Metho d This W ork accide nts 0.8 149 0.838 0.981 0.85 3 0.981 0.7 529 0.925 0.985 0.93 5 0.985 0.6 2074 0.967 0.99 2 0.973 0.992 0.5 8057 0.946 0.99 1 0.955 0.991 0.45 1612 3 0.948 0.992 0.955 0.992 0.4 32528 0.949 0.991 0.957 0.992 0.3 14954 5 0.957 0.989 0.2 88988 3 0.957 0.987 BMS-PO S 0.05 59 0.845 0.938 0.85 1 0.938 0.03 134 0.879 0.99 2 0.895 0.992 0.02 308 0.847 0.95 6 0.876 0.956 0.01 1099 0.813 0.868 0.833 0.872 0.0075 1896 0.826 0.854 0.005 4240 0.762 0.775 chess 0.8 8227 0.964 0.99 1 0.964 0.991 0.775 1326 4 0.957 0.990 0.957 0.990 0.75 2099 3 0.957 0.983 0.957 0.983 0.65 11123 9 0.972 0.991 0.6 25494 4 0.970 0.989 connec t 0.95 2201 0.802 0.951 0.80 2 0.951 0.925 9015 0.881 0.97 5 0.881 0.975 0.9 27127 0.893 0.978 0.89 3 0.978 0.875 6595 9 0.899 0.974 0.85 14212 7 0.918 0.974 kosara k 0.04 42 0.738 0.939 0.80 9 0.939 0.035 50 0.720 0.98 0 0.780 0.980 0.025 82 0.682 0.963 0.02 121 0.650 0.975 0.015 189 0 .641 0.933 pumsb* 0.55 305 0.791 0.92 6 0.859 0.926 0.5 679 0.929 0.998 0.95 7 0.998 0.49 804 0.858 0.98 4 0.907 0.984 0.475 1050 0.942 0.996 0.45 1913 0 .861 0.976 retail 0.03 32 0.625 1.00 0.906 1.00 0.025 38 0.842 0.97 3 0.972 0.973 0.0225 46 0.739 0.934 0.869 0.935 0.02 55 0 .882 0.945 0.01 159 0.902 0.931 0.0075 277 0.811 0.843 T able 2: Recall. A verage fraction (ov er 20 runs) of rep orted TFIs in the output of an algor ithm using Cherno ff and Union b ound and of the one presented in this work. F or each algorithm we present tw o versions, one (V anilla) whic h uses no information ab out the ge ner ative pro cess, and one (Add. Info) in which we a ssume the knowlegde that the pr o cess will not g enerate any tra ns action lo nger than twice the size of the longest transaction in the origina l FIMI data set. In b old, the b est result (highest r ep orted fraction). 11 will not gener ate any transa c tio n longer than twice the lo nge st transa c tio n found in the original FIMI datas et. Both algor ithms can b e easily mo dified to include this information. In T able 2 w e r ep o rt the av erage fraction of TFIs contained in FI ( D , I , ˆ θ ). W e can see that the amount of TFIs found by o ur algorithm is alwa ys very high: only a minimal fraction (often less than 3%) of TFIs do not app e a r in the output. This is ex plained by the fa c t tha t the v alue ε 2 computed in our metho d (see Sect. 5) is a lways s maller than 1 0 − 4 . Moreov er o ur solution uniformly outp erfo rms the CU metho d, often by a huge mar g in, s ince our a lg orithm do es not hav e to take into account all p ossible itemsets when co mputing ˆ θ . Only par tial results are r e p o rted for the “v anilla” v ar iant b ecause o f the very high num ber of items in the considered da tasets: the mining of the dataset is per formed a t fre q uency threshold θ − ε 1 and if there are many items, then the v alue of ε 1 beco mes very high beca use the b ound to the VC-dimension of R (2 I ) is |I | − 1 , and as a co nsequence we hav e θ − ε 1 ≤ 0. W e stress, though, that as suming no knowledge ab out the distr ibution π is not rea listic, and usually additional information, esp ecially reg arding the length of the transac tio ns , is av aila ble and can a nd should be used. The use of additional information gives flexibility to our metho d and improv es its pra cticality . Mor eov er, in some c a ses, it allows to find a n ev en lar ger fraction of the TFIs. 7 Conclusions The usefulness o f frequent itemset mining is often hindered by spurious discov eries, or false p ositives, in the results. In this work we developed an algo rithm to co mpute a frequency threshold ˆ θ suc h that the collection of FIs at frequency ˆ θ is a go o d approximation the collection of T rue F requent Itemsets. The threshold is such that that the pro bability of rep orting any false po sitive is b o unded by a user-sp ecified quantit y . W e used to ols from statistical lea r ning theory and from optimization to develop and a na lyze the algorithm. The exp e rimental ev aluation shows tha t the metho d we pro po se can indeed b e used to co ntrol the pr esence of false pos itives while, a t the same time, extracting a very lar ge fra c tion o f the TFIs from h uge datasets. Ther e are a num b er of directions fo r further resear ch. Among these, w e find particularly interesting a nd challenging the extension of our metho d to other definitions of statistical sig nificance for patterns. Also int eresting is the deriv a tion of b etter low er b ounds to the V C-dimension of the range set of a collection of itemsets. Moreov er, while this work fo cuses o n itemsets mining, we b elieve that it ca n be extended and generalized to o ther settings of multiple hypothesis testing, and g ive a nother a lternative to existing approaches for controlling the pr obability of false discov eries. References [1] Agraw al, R., Imieliński, T., Swami, A.: Mining ass o ciation rules b et ween sets of items in larg e databases. SIGMOD Rec . 22, 207–2 16 (1993) [2] Alon, N., Sp encer, J.H.: The Pr obabilistic Metho d. Wiley , third edn. (200 8 ) [3] Boucheron, S., Bousquet, O., Lugosi, G.: Theory of classifica tion : A sur vey of some r ecent adv ances. ESAIM: Pr obab. and Statistics 9, 323–37 5 (20 05) [4] DuMouchel, W., Pr e g ib on, D.: Empirical Bay es sc r eening for m ulti-item a s so ciations. KDD’01 (2001) [5] Geng, L., Hamilton, H.J.: Interestingness measures for data mining: A survey . A CM Comp. Surv. 38(3) (2006) [6] Gionis, A., Mannila, H., Mielikäinen, T., T sapar as, P .: Assessing data mining r esults via swap r andom- ization. ACM T rans. o n Knowl. Disc. from Data 1(3) (20 0 7) [7] Go ethals, B., Zaki, M.J.: Adv a nces in frequent itemset mining implementations: rep or t on FIMI’03 . SIGKDD E xplor. Newsl. 6(1), 109–1 17 (2004) 12 [8] Goldshmidt, O., Nehme, D., Y u, G.: Note: O n the set-union knapsack problem. Nav al Resear ch Logistics 4 1 (6), 83 3–84 2 (19 94) [9] Grahne, G., Zhu, J.: Efficient ly using prefix-trees in mining frequent itemsets. FIMI’03 (2003 ) [10] Hämäläinen, W.: StatApriori: an efficient alg o rithm for sea rching statistically significa nt asso ciation rules. Knowl. and Inf. Sys. 2 3(3), 373 –399 (20 10) [11] Han, J., Cheng, H., Xin, D., Y an, X.: F requent pattern mining: current status and future directions. Data Min. and Knowl. Disc. 1 5, 5 5–86 (20 07), [12] Han, J., K amber, M., Pei, J.: Data mining: concepts and techniques. Mor gan Ka ufmann (200 6) [13] Hanhijärvi, S.: Multiple hypothesis testing in pattern discov ery . DS’11 (2 011) [14] Har-Peled, S., Sharir, M.: Relative ( p, ε )-appr oximations in geometr y . Discr. & Comput. Geo m. 45(3), 462–4 96 (2011) [15] Kirsch, A., Mitzenmacher, M., Pietraca prina, A., Pucci, G., Upfal, E., V andin, F.: An efficient rig orous approach for identifying statistically significant frequen t itemsets. J. ACM 5 9(3), 12:1 –12:2 2 ( 2012 ) [16] Liu, G., Zhang , H., W ong , L.: Controlling false po sitiv es in asso ciation rule mining. VLDB’11 (2011 ) [17] Löffler, M., Phillips, J.M.: Shape fitting on point s e ts with probability distributions. ESA’09 (2009 ) [18] Megiddo, N., Sr ikant, R.: Discov ering predictive asso ciation rules . K DD’98 (20 08) [19] Mitzenmacher, M., Upfal, E.: P robability a nd Co mputing, Cambridge University P ress, second edn. (2005) [20] Riondato, M., Upfal, E.: Efficient discovery of a sso ciation rules and frequent itemsets through sampling with tight p erforma nce gua rantees. T o app e ar in ACM T rans . on Knowl. Disc. from Data (2014) [21] Riondato, M., V andin, F.: Finding the T rue F req uent Itemsets. SDM’14 (20 14) [22] Silverstein, C., Br in, S., Mot w ani, R.: Beyond market baskets: Generalizing asso ciation rules to depe n- dence rules. Data Min. and Knowl. Disc. 2(1), 39 –68 (199 8 ) [23] V a pnik, V.N., Chervonenkis, A.J.: On the uniform conv ergence of relative frequencies o f even ts to their probabilities. Th. Prob. Appl. 1 6 (2), 2 6 4–28 0 (19 71) [24] W ebb, G.I.: Disco vering s ignificant patterns. Mach. Learn. 68(1), 1 –33 (200 7) [25] W ebb, G.I.: La yered critical v alues: a p ow erful direct-adjustment approach to discov ering significant patterns. Mach. Lear n. 71, 307–3 23 (2008) 13
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment