A Note On Computing Set Overlap Classes

A Note On Computing S et Ov erlap Classes Pierre Charbit 1 Michel Habib 1 Vincen t Limouz y 1 F abien de Montgolﬁer 1 Mathieu Raﬃnot ⋆ 1 Micha ¨ el Rao 2 1 LIAF A, Univ. Paris Diderot - Paris 7, 75205 Pari s Cedex 13, F rance. 2 LIRMM, 161 rue Ada, 34392 Montp ellier, F rance. Abstract. Let V be a ﬁn ite set of n elements and F = { X 1 , X 2 , . . . , X m } a family of m subsets of V . Two sets X i and X j of F o verlap if X i ∩ X j 6 = ∅ , X j \ X i 6 = ∅ , and X i \ X j 6 = ∅ . Tw o sets X, Y ∈ F are in the same o verlap class if there is a series X = X 1 , X 2 , . . . , X k = Y of sets of F in which each X i X i +1 o verlaps. In this note, we focu s on eﬃcientl y identif y ing all o verlap classes in O ( n + P m i =1 | X i | ) time. W e thus revisit the clever algorithm of Dahlhaus [2] of which we give a clear presenta tion and that w e simplify to mak e it practical and implemen table in its real w orst case complexity . A n useful v arian t of Dahlhaus’s approac h is also explained. 1 In tro duction Let V b e a ﬁnite set of n = |V | elements a nd F = { X 1 , X 2 , . . . , X m } a fa mily of m subsets o f V . Two sets X i and X j of F ov er lap if X i ∩ X j 6 = ∅ , X i \ X j 6 = ∅ , and X j \ X i 6 = ∅ . W e de no te |F | as the sum of the sizes of all X i ∈ F . W e deﬁne the ov erlap gr a ph O G ( F , E ) as the graph with all X i as vertices and E = { ( i, j ) | X i ov erlaps X j } , ∀ 1 ≤ i, j ≤ m. A connected comp o nent of this graph is called an overlap class. In this note we fo cus on eﬃcien tly identifying all overlap classes of OG ( F , E ) . This problem is a class ical one in g raph clustering rela ted topics but it also app ears fre q uently in many gra ph pro blems rela ted to graph deco mp o sition [2 ] or PQ-tree manipulation [3]. An eﬃcien t O ( n + |F | ) time algo rithm has a lready been pr esented by Dahlhaus in [2]. The alg orithm is very clever but uses an oﬀ-line Lowest Common Ancestor algorithm (LCA) as subro utine. F rom a theo r etical p o int of view, oﬀ-line LCA queries hav e been prov ed to be solv able in constan t time (after a linea r time prepro cess ing) in a RAM mo del (a ccepting an additional constant time s pe ciﬁc register op eratio n) but also recently in a pointer machine mo del [1]. How ever, in pr actice, it is very diﬃcult to implemen t these LCA algo rithms in their re a l linear co mplexity . Another diﬃculty with Dahlha us’s algorithm comes from that its o r iginal pr esentation is diﬃcult to follow. The s e tw o p oints motiv ated this note. Dahlha us’s algo rithm is really clever and deserves a clear presentation, all the mor e so we show ho w to repla ce LCA quer ies b y set partitio ning , which ⋆ Corresponding author. E-mail: raffinot@lia fa.jussieu.fr makes Dahlhaus’s algor ithm easily implementable in pra ctice in its r eal co mplex- it y . W e also provide a s o urce co de fr e ely av ailable in [4]. W e event ua lly explain how to simply mo dify Da hlha us’s appr o ach to eﬃciently compute a spa nning tree of e a ch connec ted comp onent of the ov erla p graph. This simpliﬁes a g raph construction in [3]. 2 Dahlhaus’s algorithm The ov erlap graph O G ( F , E ) might have Θ ( m 2 ) edges, which can be q uadratic in O ( |F | ) . F o r instance, if F = {{ x 1 , x 2 } , { x 1 , x 3 } , . . . , { x 1 , x m }} , | E | = m ( m − 1) / 2 = Θ ( m 2 ) . The appro ach o f Dahlhaus is quite surpr ising since that, instead of co mputing a subgraph of the ov erla p gra ph, Dahlhaus considers a s econd graph D ( F , L ) on the sa me vertex set but with diﬀerent edges. This gra ph has how ever a strong prop erty: its co nnec ted comp onents are the same than that of O G ( F , E ), a l- though that in the general case D ( F , L ) is not a subgr aph of OG ( F , E ) . Let LF be the list of all X ∈ F s o rted in decrea s ing size order. The order ing of s ets of equal size is arbitrarily ﬁxed. Giv en X ∈ F , w e denote Max( X ) as the largest Y ∈ F taken in L F order such that | Y | ≥ | X | and Y ov erlaps X . Note that Max( X ) might be undeﬁned for some sets of F . In this la tter cas e, in order to simplify the presentation of some technical p oints, we write Ma x( X ) = ∅ . Dahlhaus’s algor ithm is based on the following obs e rv ation: Lemma 1 ([2]). L et X ∈ F such t hat Max ( X ) 6 = ∅ . Then for al l Y ∈ F such that Y ∩ X 6 = ∅ and | X | ≤ | Y | ≤ | Max ( X ) | , Y overlaps X or Max ( X ) . Pr o of. If Y do es not ov erlap X , as | X | ≤ | Y | and Y ∩ X 6 = ∅ , X ⊆ Y . T hus Y ∩ Max( X ) 6 = ∅ . Then, if Y do es not overlap Max( X ), then Max( X ) ⊆ Y . But in this ca se, as | Y | ≤ | Ma x( X ) | , Y = Max( X ) and o verlaps X . Therefore Y ov erlaps X or Max( X ) . ✷ Let us assume that we a lready computed all Max( X ) . F o r each v ∈ V w e compute the list S L ( v ) of all sets X ∈ F to which v b elongs . This list is sorted in increas ing order of the s izes of the sets. Computing a nd sor ting a ll lists for all v ∈ V can b e done in O ( |F | ) time us ing a global buc ket sort. Dahlhaus’s graph D ( F , L ) is built o n tho s e lists. Let X b e a set containing v such that Max( X ) 6 = ∅ . Then for all consecutiv e pairs Y W after X in S L ( v ) ( X included, i.e . Y ca n b e instanced by X ) and s uch that | W | ≤ | Max( X ) | , create an edge ( Y , W ) in the graph D . Lemma 2 ([2]). The two gr aphs D ( F , L ) and O G ( F , E ) have t he same c on- ne cte d c omp onents. Pr o of. ( ⇒ ) Let Y , W ∈ F such that ( Y , W ) ∈ L. By co nstruction there exists v such that Y a nd W ar e cons e cutive on S L ( v ) and there exists X that app ear s befo re Y W on S L ( v ) such that Max( X ) 6 = ∅ and such that | X | ≤ | Y | ≤ | W | ≤ | Max( X ) | . By lemma 1, Y and W overlap either X or M a x( X ) . A s X a nd Max( X ) overlap, the sets X , Y , W , and Max( X ) b elong to the same ov erlap class of O G ( F , E ). By extension, the v er tice s of an y connected pa th in D ( F , L ) belo ng to the same ov erlap class of OG ( F , E ). ( ⇐ ) Let A, B ∈ F b e tw o o verlapping sets, i.e. ( A, B ) ∈ E . Let v ∈ A ∩ B . Assume w.l.o.g . that | A | ≤ | B | . Then Max( A ) 6 = ∅ and | Max( A ) | ≥ | B | . Therefore, in S L ( v ), there exits a serie o f consecutive pair s Y W from A to B that are link ed in D ( F , L ) . In consequence, A and B are connected in D ( F , L ) . ✷ Notice that the order of equally sized sets in S L lis ts has no imp or tance for the cons truction of a Dahlhaus ’s graph. Figure 1 shows an example of an overlap graph and a Dahlhaus’s graph. (B) (A) c b a d e f X X X X X 1 1 X 11 X X 9 X X 2 X 3 X 4 X X 3 6 6 l X X X X X i j k h g 7 X 5 X 3 11 X 9 8 X 8 X X 9 10 X 9 X 10 9 X X X 1 X X 4 X 5 6 7 8 a b c d e f g h i j k l 10 X X 2 X 3 X 11 X 9 X X X X X X X 2 3 4 5 6 7 8 9 1 X X X 10 X 11 (D) X X X X 5 3 5 3 4 2 X X X X X X X 2 3 4 5 6 7 8 9 1 X X X 10 X 11 7 2 (C) X 5 3 Fig. 1. Global ex a mple: (A) input family of 11 sets; (B) Overlap gra ph; (C) S L lists; (D) Dahlha us’s gra ph. On (C) interv als deﬁned by Max( X ) are ov erlined. Notice that Dahlhaus’s graph is not a subgraph of the Ov er lap graph. Lemma 3 ([2]). Given al l Max ( X ) , X ∈ F , the gr aph D ( F , L ) c an b e built in O ( |F | ) time and its n u mb er of e dges is less than or e qual t o |F | . Pr o of. T o build the gra ph D ( F , L ) fr om the S L lists, it suﬃces to go through each S L list from the smallest se t to the lar gest and remenber a t each step the largest Max( X ) alr eady seen. If the s ize o f the cur rent set is sma lle r than o r equal to this v a lue, an edge is created b e t ween the last t wo s e ts considered. Let us now consider the num b er o f edg es of D ( F , L ) . As at most one edge is created for e ach set in a list S L , at most |F | edges are crea ted after pro cessing all lists. ✷ Ident ifying the overlap classes of O G ( F , E ) can therefore b e done by a simple Depth First Se a rch on D ( F , L ) in O ( n + |F | ) time. It r emains how ever to explain how to eﬃciently compute all Max( X ) . 3 Computing all Max( X ) Let LF b e the list of all X ∈ F sorted in decreasing size o rder. The order o f sets of equal size is not imp ortant. W e cons ider a bo o lean matrix B M of size |F | × | V | such that each row repre s ents a set X ∈ F in the order of LF, and eac h column an elemen t v ∈ V . The v alue BM[ i, j ] is 1 if and only if v j ∈ X i . The ﬁr s t step of Dahlhaus’s algorithm is to sort the co lumns of BM in lexico - graphical o rder, althoug h that there is no detail in [2] on ho w to do it eﬃcien tly in O ( |F | ) time. W e postp one all explanations concer ning this step to section 3.2 and w e consider b elow that all columns o f B M a re lexicographically sorted. Figure 2 shows the B M matrix for the set family of Figure 1. X X X X X X X X X X X 10 11 a 3 9 5 2 1 4 8 7 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i l j k b c d f g 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 3 5 6 7 8 9 10 12 11 4 1 1 0 left 7 12 2 6 9 12 6 8 1 6 8 12 10 11 6 9 4 5 3 5 1 2 e 0 0 0 0 0 1 1 1 h 0 0 0 0 0 1 1 1 right Fig. 2. Example contin ued: B M matrix which lines a re sorted by decreasing sizes of X ∈ F a nd whic h columns are sor ted in lexicographic order. F or e ach X ∈ F we denote left( X ) (resp. right( X )) the num ber o f the column of B M co nt a ining the leftmost (res p. rightm o s t) 1 in the ro w of X . Lemma 4. L et X , Y ∈ F such that Y overla ps X and let r Y b e the r ow of Y in B M . Then ther e exists a ro w t higher than or e qual to r Y such that BM [ t, left ( X )] = 0 and BM [ t, right ( X )] = 1 . Pr o of. As Y overlaps X , | X | ≥ 2 . L e t r X be the row co rresp o nding to X in BM . Since Y ov erlaps X , there ex ist tw o indices 1 ≤ i < j ≤ | V | and a row r such that BM[ r X , i ] = BM[ r X , j ] = 1, suc h that one of the v alue of B M[ r , i ] and BM[ r , j ] is 1 and the other 0 . W e consider the highest r tha t s atisﬁes these conditions. In a ﬁrs t step, if BM[ r, i ] = 1 and BM[ r , j ] = 0, then, as i < j and as all columns ha s b een sorted in increas ing lexicog r aphical order , there must exist a row r ′ higher than r such that BM[ r ′ , i ] = 0 and BM[ r ′ , j ] = 1 . W e thus consider now w.l.o.g that BM[ r, i ] = 0 and BM[ r, j ] = 1 . Among all pairs of indices i and j s uch tha t B M[ r X , i ] = BM[ r X , j ] = 1 and that there e x its r such that BM[ r, i ] = 0 and BM[ r, j ] = 1 , let us cons ider one pair i ′ and j ′ , 1 ≤ i ′ < j ′ ≤ | V | , that is associa ted to the highest suc h r that w e denote t. W e no w prov e that BM[ t, left( X )] = 0 and BM[ t, rig ht ( X )] = 1 . If BM[ t, left( X )] = 1 , th us i > left( X ) and as BM[ t, i ] = 0 a nd that the columns ar e sorted in lexi- cogra phica l order, there should ex its an higher row r ′ such that BM[ r ′ , left( X )] = 0 and BM[ r ′ , i ] = 1, which contradicts t to b e the hig hest s uch row. Th us BM[ t, left ( X )] = 0 . Symmetrically , the same ar g ument ho lds to pr ove that BM[ t, rig ht ( X )] = 1 . ✷ Lemma 5. L et X ∈ F . Then Max ( X ) 6 = ∅ if and only if ther e ex ists a r ow t in B M such that BM [ t, left ( X )] = 0 and BM [ t, right ( X )] = 1 c orr esp onding to a set Y ∈ F verifying | Y | ≥ | X | . Pr o of. ( ⇐ ) If a set Y cor resp onds to a row t in B M such that BM[ t, left( X )] = 0 and BM[ t, r ig ht ( X )] = 1 , Y ob vious ly overlaps X . As | Y | ≥ | X | , Max( X ) 6 = ∅ . ( ⇒ ) Le t us ass ume that Max( X ) 6 = ∅ and let r M be its ro w in B M . Then, by lemma 4, there exists a row t in B M suc h that BM[ t, left( X )] = 0 and BM[ t, rig ht ( X )] = 1 a nd such that t is higher than or equal to r M . As Max( X ) veriﬁes | Max( X ) | ≥ | X | , the set Y corres p o nding to r M is als o such that | Y | ≥ | X | . ✷ Lemma 6 ([2]). L et X ∈ F su ch that Max ( X ) 6 = ∅ . Then Max ( X ) c orr esp onds to the highest r ow t in B M such that BM [ t, left ( X )] = 0 and BM [ t, right ( X )] = 1 . [Notice that this row might b e lower than the row corres p o nding to X . This is the case for X 8 and X 10 since Max( X 10 ) = X 8 but also Max( X 8 ) = X 10 . in our example.] Pr o of. Let us assume that Max( X ) 6 = ∅ a nd let r M be its row in B M . Then, by lemma 4, there exists a row t in B M suc h that BM[ t, left( X )] = 0 and BM[ t, rig ht ( X )] = 1 and such that t is higher than or equal to r M . How ever, as such a row t corresp onds to a set overlapping X and that Max( X ) is the la rgest of those sets in LF order, t = r M . ✷ F or example, in Figure 2, Max( X 1 ) = X 9 since left( X 1 ) = 1 , right ( X 1 ) = 6 and X 9 (row 2) corres p o nds to the highest row with 0 on the ﬁrst co lumn and 1 on the 6 th . Dahlhaus’s approa ch for computing a ll Max( X ) is to identify for each row r c orresp o nding to X the hig hest r ow t such that BM[ t, left( X )] = 0 and BM[ t, rig ht ( X )] = 1 . T o do it eﬃciently , Dahlhaus reduces the pro blem to LCA computations. W e explain this reduction in the next s ection 3.1. W e then pr e s ent another approa ch using class partitions in 3.2. This new approach is muc h sim- pler to implement than the LCA algorithm in its r eal linear worst ca se complex- it y . Moreover, it allows an easy computatio n of the lexicogra phical order of the columns. 3.1 Computing all Max( X ) usi ng LCA Let us consider all intermediate columns b etw een all pairs of co lumns in BM . In those co lumns, fo r each row, w e place a po int • betw een each motif 01 o r 10. This is shown in Figure 3 (left). W e link the highest p oint in each intermediate column, if it exist, in a Dahlhaus’s tree (DT) the following way: 1. the ro ot of the tr ee is the highest p oint. There can b e only one ro ot and there must b e one ro o t if o ne of the set X ∈ F diﬀers fro m V . W e assume this be low; 2. we recurse the follo wing pro cess: each new point np in the tree (ro ot included) splits the submatrice in tw o subparts according to the in termedia te column it is placed in; the le ft (resp. right) child of np is the highest p oint in the left (right) part, if it exits. Note tha t the lex icogra phical order of the c o lumns of B M insur es that there can be at most one highest point in each part; 3. when a subpart do es not contain any new po int, a leaf p er BM column in this subpa rt is created a nd attached as child to the p oint that cr e ated the subpart. If this p oint is pla ced to the left (res p. right) of this column, the child is a r ight (resp. left) child. Each leaf is num ber ed with the num b er of the corres p o nding column in BM . An instance of such a tree is given in Figure 3 (r ight). 2 3 5 6 7 8 9 10 12 11 4 1 2 3 5 6 7 8 9 10 12 11 4 1 X X X X X X X X X X X 10 11 a 3 9 5 2 1 4 8 7 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i l j k b c d f g 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 e 0 0 0 0 1 1 h 0 0 0 0 1 1 1 0 0 1 1 1 X X X X X X X X X X X 10 11 a 3 9 5 2 1 4 8 7 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i l j k b c d f g 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 e 0 0 0 0 0 1 1 1 h 0 0 0 0 0 1 1 1 Fig. 3. Exa mple c o ntin ued: Dahlhaus’s tree built ov er a B M matrix . Prop ositi on 1 ([2]). L et X ∈ F . L et Y ∈ F b e the set c orr esp onding to the r ow of LC A ( left ( X ) , right ( X )) in BM . If | Y | ≥ | X | , t hen Y = Max ( X ) . Ot herwise Max ( X ) = ∅ . Pr o of. Let r b e the num b er o f the row of LC A (left( X ) , right( X )) in BM and let l be the p ositio n of the co lumn in B M tha t is just before the p oint repres e nt ing LC A (left( X ) , right( X )) . First, B M [ r, l ] = 0 and B M [ r, l + 1] = 1 . Suppo se a contrario that B M [ r , l ] = 1 and B M [ r, l + 1] = 0 . As all columns of B M are sorted in lexicog raphical o rder, there m ust exists an highe r row r ′ such that B M [ r, l ] = 0 and B M [ r, l + 1] = 1 . and thus a p oint in the intermediate column b etw een l an l + 1 higher than that in row r , which con tradicts the constr uction of DT . W e now prove that B M [ r, left( X )] = 0 and B M [ r, right( X )] = 1 . A c o n- trario, supp ose that B M [ r, left( X )] = 1 . Then, again, as the co lumns o f B M are s orted in le xicogra phical order , there m ust exists an hig he r row r ′ such that B M [ r ′ , left( X )] = 0 and B M [ r ′ , l ] = 1 . This again contradicts the co nstruction of D T . A similar argument holds for the right side. W e then prove that r is the highest row with this prop erty . Assume a co n- trario that there exist an higher row r ′ such that B M [ r ′ , left( X )] = 0 and B M [ r ′ , rig ht ( X )] = 1 . Then there would have b een a split 01 so mewhere in this row that would have separa ted left ( X ) and rig ht ( X ) . This implies that there would hav e bee n a no de in D T in a r ow higher than or equal to r ′ that w o uld hav e split left( X ) and right ( X ) , w hich co nt r adicts r to b e the n umber of the row of LC A (left( X ) , right( X )) . If | Y | ≥ | X | , b y Lemma 6 Ma x( X ) 6 = ∅ and the set Y that co rresp onds to r is such that Y = Max( X ) . If | Y | < | X | , since no row r ′ higher than r can verify B M [ r ′ , left( X )] = 0 and B M [ r ′ , rig ht ( X )] = 1 , by Lemma 5 Ma x ( X ) 6 = ∅ . ✷ F or example, X 9 corres p o nds to the row of L C A (1 , 2) = LC A (left( X 11 ) , right ( X 11 )) . As | X 9 | ≥ | X | , X 9 = Max( X 11 ) . 3.2 Computing all Max( X ) usi ng set partitioning W e pr esent b elow a n alterna tive approa ch that p ermits av o iding LCA queries . Moreov er , the lexico g raphical column order appea r s as a b y-pr o duct. W e manipulate sorted partitions of V that we reﬁne by each X ∈ F taken in LF order , that is , in decr easing order of their sizes. The initia l partition is the whole set V and denoted P V . F or clarity , a set in a partition is called a p art . In each partition the order of the parts is imp or tant, but the order of elements in a same par t is not. Let C = { v 1 , . . . , v k } be a part in a partition. Reﬁning C by X ∈ F consists in extracting all v i ∈ X in C and cre a te a new pa r t C ′′ with all those v i . The remaining v i 6∈ X in C for m a new pa rt C ′ and C is replaced in the curre nt partition by C ′ C ′′ . If C only contains elements of X as well as if it contains no ne, C remains unchanged in the partition. Reﬁning a par titio n P by a s et X ∈ F consists in re ﬁning s uccessively all parts in P . W e note this reﬁnement P | X . F or exa mple (co nt inued), if P = { a }{ i, j, k , l }{ b } { c, d }{ e, f , g , h } and X = X 4 = { d, e } , P | X = { a }{ i, j, k , l } { b }{ c } { d }{ f , g , h }{ e } . Our approach r equires 3 steps: 1. reﬁne P V by all X ∈ F taken in LF order; 2. then compute for each X ∈ F the v alues of left( X ) and r ig ht( X ) a nd sor t all X ∈ F in a spe cial order in regard with these v alues; 3. even tually reﬁne P V again by all X ∈ F taken in LF order but using the informations computed in step 2 to compute all Max( X ) . W e detail b elow ea ch step. Step 1 - R eﬁning P V . Let us consider the ﬁnal partition w e obtain after reﬁning P V by each X ∈ F taken in LF order . W e note this partition P f . Lemma 7. The elements of P f ar e sorte d ac c or dingly to the lexic o gr aphic al or- der of the c olumns of B M . Pr o of. Reﬁning a pa rtition consis ts in lexicog raphically so r ting a r ow of B M touching only the 1 in the row but also keeping the global or der alrea dy deﬁned by the sets in the partition. Thus reﬁning partitions fr o m P V in LF o rder consists in lexicogra phically o rdering B M fro m the top row to the b ottom. ✷ F or exa mple (contin ued), on the data in Figure 1 , P f = { a }{ i } { l }{ j } { k }{ b } { c }{ d }{ h }{ f , g }{ e } . Note that equal columns of B M a re in the same part o f P f on whic h we ﬁx an arbitrary order. Step 2 - Computi ng al l left ( X ) and ri ght ( X ) valu es. W e then compute all le ft( X ) and rig ht( X ) v alues o n P f . This can b e done easily in O ( |F | + n ) time by sca nning e ach X ∈ F and keeping the minimum and maximum p osition of o ne of its element in P f . W e also compute a data structure AM that for each po sition 1 ≤ i ≤ | V | of P f gives a list of a ll X ∈ F s uch that i = right ( X ). All those lists are sor ted in incr easing or der of left( X ) . The structure also allows an element X ∈ F to b e removed from the list AM [right ( X )] in O (1) time. This can b e insured for instance using doubly link ed lis t to implement eac h list, a nd the whole structure can easily be built in O ( n + m ) time using buc ket so rting. Step 3 - R eﬁning P V again and identifying al l Max ( X ) . The main idea is the following. Assume that a t a step of the reﬁnemen t pr o cess in LF order we reﬁne a part C = { v 1 , . . . , v k } of a pa rtition P by Y ∈ F and that it results tw o non empt y par ts C ′ C ′′ . Lemma 8. L et X ∈ F such that | X | ≤ | Y | , left ( X ) ∈ C ′ and right ( X ) ∈ C ′′ . Then Y = Max ( X ) . [Note that if | X | = | Y | then X could b e b efore Y in LF order.] Pr o of. Let r b e the row corres p o nding to Y in B M . As left( X ) ∈ C ′ and right ( X ) ∈ C ′′ , then B M [ r , left( X )] = 0 and B M [ r, right ( X )] = 1 , and Y ob- viously overlaps X . As | X | ≤ | Y | , Max( X ) 6 = ∅ . Moreover, the row r is the highest such that B M [ r, left( X )] = 0 and B M [ r , right( X )] = 1 since otherwise the elements of X w o uld hav e been split b y a set big ger that Y in the LF order . Thu s, by Lemma 6, Y = Max( X ) . ✷ The last phase of the a lgorithm th us consists in reﬁning P V again b y all Y ∈ F taken in LF o r der. W e ﬁrst initialize a ll v alues Max( X ) to ∅ . Ea ch time a new split C ′ C ′′ app ears (say b etw een p ositions l and l + 1), for all v ∈ C ′′ all lists AM [ v ] ar e insp ected the following wa y: let X b e the to p of o ne o f those the list; while left( X ) ≤ l , X is p opp ed o ﬀ the list and Max ( X ) ← Y . After having reﬁned with Y , if there is no more Y ′ < LF Y such that | Y ′ | = | Y | , all se ts o f the same size than Y ar e removed from the AM structure. Lemma 9. Our algorithm c orr e ctly c omputes in 3 steps al l Max ( X ) , X ∈ F . Pr o of. In step 1 the lex icogra phical order of the columns of B M is computed as a partition P f (Lemma 7). In step 2 all v alues left ( X ) a nd r ight( X ), X ∈ F , are co mputed and the AM structur e is built. In step 3, the correctnes s o f the computation relies on the following observ ation: for e a ch new partition P created after a reﬁnement, a ll sets X remaining in AM ar e such that left ( X ) and right ( X ) b elong to the s ame part in P . This is obviously true since other wise they would ha ve been split by a previous reﬁnement and remov ed of AM . This has for cons equence that after a split of a set C in C ′ C ′′ by a set Y , testing if left( X ) ∈ C ′′ and r ight ( X ) ∈ C ′′ for all sets in AM is equiv alent to test if right ( X ) ∈ C ′′ and left( X ) ≤ l , whe r e l is the left p os ition in P of the split betw een C a nd C ′′ . Moreover, as each s e t taken in LF o r der and used for a po ssible reﬁnement is remov ed of AM after having proce s sed all the sets o f the same size, when a set Y splits a par t C in C C ′′ , a ll sets in AM are such that | X | ≤ | Y | . W e th us fulﬁll all requirements of Lemma 8 and Y = Ma x( X ) . Th us, if a v alue Max( X ) is assigned b y our algor ithm, it is as s igned with the right one. Now, s upp o se that a set X admits a set Y a s Max( X ). It is guaranteed that a certa in step of the algor ithm Y has b een ass igned to Max( X ) since that by deﬁnition | X | ≤ | Y | which implies that X is s till in AM when Y is pro cessed and that b y Lemma 6 left( X ) 6∈ Y and right( X ) ∈ Y . The set Y has thus s plit a part C in a par tition in C ′ C ′′ such that righ t( X ) > l and left( X ) ≤ l wher e l is the left po sition in P of the split betw een C and C ′′ . ✷ It remains to explain how a partition reﬁnement can be eﬃcien tly implemen ted. W e ex plo it the fact that ele ment’s order inside each pa r t of a partition has no impo rtance to obta in a simple implementation: a par tition is repres ented as a table of size n in which each cell con tains (a) an element o f V a nd (b) a pointer to the part o f the pa rtition in which it is co ntained. A part is represented b y a pair of its bounds o n this table. Figure 4 shows such an implemen tation. Reﬁning a partition P by a set Y can b e do ne in O ( | Y | ) the following w ay . Let [ i, j ] b e the b ounds of a part C such that C 6⊂ Y (easily testable). Le t k b e the nu mber o f elements of Y that b elo ngs to the subtable [ i, j ], 1 ≤ k ≤ j − i . W e swap elements in the subtable [ i, j ] to pla c e all k elements belong ing to Y at the end of this subta ble . W e then adjust the b ounds of C to [ i, j − k ] a nd c reate a new set [ j − k + 1 , j ] on whic h the k elements of Y now p oint. Theorem 1. The identiﬁc ation of al l Max ( X ) , X ∈ F , u sing p artition r eﬁne- ment c an b e done in Θ ( n + |F | ) time. 1,1 2,5 7,8 6,6 9,12 a i j k l b c d e f g h Fig. 4. Example contin ued: implemen tation of P = { a }{ i, j, k , l } { b }{ c , d }{ e, f , g , h } . Pr o of. By Lemma 9 the algorithm is cor rect. Steps 1 and 2 a r e Θ ( |F | + n ) time. In step 3 , the fa ct that all lists in AM a r e sor ted in increasing or de r of l ef t () v alues insures that when a set Y splits a part C in C ′ C ′′ , identifying a nd p opping o ﬀ all sets X such that left( X ) ∈ C and righ t( X ) ∈ C ′′ can b e done in Θ ( | C | + K + 1) time, wher e K is the num b er of such sets. Removing a set o ut of AM is O (1) time, th us the tota l of time managing AM is Θ ( |F | + n ) time. ✷ The whole algorithm has been implemented in its real w ors t case time com- plexity and is freely av ailable in [4]. 4 Computing a subgraph of the o verlap graph In s ome applications like in [3] it is useful to get a spanning tree of all overlap classes of OG ( F , E ) . The approach of [3 ] is to ﬁrst compute Dahlhaus’s graph and then compute s panning trees o f the connected c o mp o nents of the overlap graph using a q uite complex add- on. W e th us explain in this section how to simply mo dify Dahlha us’s approach to co mpute a subgr aph o f the overlap gr aph instead o f D ( F , L ) . The size of the subgraph is linear but it has the same co n- nected comp onents than the ov erlap graph and it is th us e a sy from it to co mpute spanning tr ees of the overlap graph. The idea of the mo diﬁcation is the following. Lemma 10. L et X , Y ∈ F such that X ∩ Y 6 = ∅ , su ch that Max ( X ) 6 = ∅ and such that | X | ≤ | Y | ≤ | Max ( X ) | . L et r Y b e t he r ow of Y in B M . If B M [ r Y , left ( X )] = 0 , Y overlaps X . Ot herwise, (a) if B M [ r Y , right ( X )] = 0 , then Y overlaps X , and (b) if B M [ r Y , right ( X )] = 1 , then Y overlaps Max ( X ) . Pr o of. Let r X be the row o f X in B M , and r that of Max( X ) . If B M [ r Y , l ef t ( X )] = 0 , as B M [ r X , l ef t ( X )] = 1, that X ∩ Y 6 = ∅ and that | X | ≤ | Y | , Y overlaps X . Assume now that B M [ r Y , l ef t ( X )] = 1 . Case (a): if B M [ r Y , rig ht ( X )] = 0 , then, as B M [ r X , r ig ht ( X )] = 1, with the sa me a rguments that ab ov e Y o verlaps X . Case (b): if B M [ r Y , rig ht ( X )] = 1 , then, as by Lemma 6 , B M [ r, right ( X )] = 1 and B M [ r, left( X )] = 0, and that | Y | ≤ | Max( X ) | , Y ov er la ps Max ( X ). ✷ W e mo dify the c o nstruction of Dahlhaus’s gra ph the fo llowing way . W e still co nsider interv als X ..Y W.. on S L ( v ) lists such that Ma x( X ) 6 = ∅ and | W | ≤ | Max( X ) | , but instead of crea ting a chain X − ..Y − W − .. in D ( F , L ), we create an edge ( X, Max( X )) (if it do es not alre ady exists) and a list of q uintu- ples (left(X) , right( X ) , X, Y , Max( X )) , (left(X) , right( X ) , X , W, Max( X )) , .. fo r c b a d e f X X X X X 1 1 X 11 X X 9 X X 2 X 3 X 4 X X 3 6 6 l X X X X X i j k h g 7 X 5 X 3 11 X 9 8 X 8 X X 9 10 X 9 X 10 9 X X X X 5 3 5 3 4 2 7 2 (C) X 5 3 1 (a,b,X ,X ,X ) 11 9 4 2 5 (d,e,X ,X ,X ) 1 (a,b,X ,X ,X ) 11 9 4 2 5 (d,e,X ,X ,X ) X X X X X X X 2 3 4 5 6 7 8 9 1 X X X 10 X 11 (D) (A) X X X 1 X X 4 X 5 6 7 8 a b c d e f g h i j k l 10 X X 2 X 3 X 11 X 9 7 9 1 (a,b,X ,X ,X ) 2 9 7 5 3 (a,b,X ,X ,X ) 1 (b,h,X ,X ,X ) (B) LQ 2 LQ 1 Fig. 5. Globa l example (contin ued): (A) input family of 11 sets; (B) LQ 1 and LQ 2 lists in which right( X ) and left( X ) heve b een replac e d by P f right ( X ) and P f left ( X ) ; (C) S L lists; (D) the resulting subgraph of the ov erlap graph. all the elements in the interv al distinct of X and Max(X). All quintuples for all in terv als are placed in the same list LQ 1 . Note that if an elemen t b elongs to 2 in terv als, a unique quintuple is formed with the righ test interv al. T o apply Lemma 1 0, if s uﬃces for each (left( X) , right ( X ) , X , Y , Max( X )) to test if Y b elongs to S L ( P f left(X) ). If not, w e then create an e dge ( X, Y ) . Otherwise, w e test if Y b elo ngs to S L ( P f right ( X ) ) . If not, w e also crea te an edge ( X, Y ) . How ever, if it do es, we create an edge ( Y , Max( X )) . F or complexity issues we need to per form thos e tes ts at a g lance for all quint uples in LQ 1 . W e do it in tw o phas es. In the ﬁrst phas e we sea rch for all Y in S L ( P f left(X) ). If Y do es no t belong to S L ( P f left(X) ), we add the quintuplet (left(X) , r ight ( X ) , X , Y , Max( X )) to a second list LQ 2 . In the second pha se, if LQ 2 is not empt y , for all (left(X) , right( X ) , X , Y , Max( X )) in LQ 2 we searc h Y in S L ( P f right( X) ) . W e assume b elow that all S L ( v ) lis ts are sorted acc o rdingly to the LF order instead of b eing s imply sorted by increasing sizes. T o eﬃciently c o mpare LQ 1 with all S L ( v ) lists it suﬃces to so r t the list LQ 1 according ly to left( X ) and then sort a ll quintuples with the s a me left( X ) v a lue in the LF order of Y . This can b e done in O ( n + |F | ) time us ing buck et sorting. The co mparison of LQ 1 and the tables S L () ca n then b e done in O ( n + |F | ) time by compar ing simutaneously | V | sorted lists. The same approach holds for LQ 2 . W e thu s hav e: Theorem 2. A su b gr aph of the overlap gr aph of F having the same c onne cte d c omp onents c an b e c ompute d in O ( n + |F | ) time. Pr o of. Lemma 10 insures that the new g r aph is a s ubgraph of the ov er lap gr aph. T o prov e that they hav e the same connected comp o nent , it thus suﬃces to pro ve that if tw o sets A and B ov erlap, there exists a path c o nnecting A and B in the subgraph. The following observ ation is the bas e o f the pro of: let X ..Y ..Z sorted by increa sing size on the s a me S L ( v ) a nd such that | Y | ≤ | Max( X ) | , | Max( X ) | ≤ | Ma x( Y ) | and | Z | ≤ | Max( Y ) | . Then there exis ts a pa th b etw een all sets X , Y , Z in the new subg r aph since by constructio n X and Max( X ) are connected, Y is co nnected to X or Max ( X ), Y is connected to Max( Y ) and even tually Z is connected to Y or Max( Y ). Now let v ∈ A ∩ B . Assume w.l.o.g. that | A | ≤ | B | . Then Max( A ) 6 = ∅ and | Max( A ) | ≥ | B | . Therefore, in S L ( v ), there e x its a series (p otentially empty) of k sets A..Y 1 ..Y 2 ..Y k ..B such that | B | ≤ | Max( Y k ) | , | Y k | ≤ | Max( Y k − 1 ) | , and | Y 1 | ≤ | Max( A ) | . By inductio n o n the ser ies using the previous observ ation ther e exits a path from A to B in the subgr aph. The s ubgraph can o bviouly be e n built in O ( n + |F | ) time since all steps can be done in this time. ✷ An example (cont inued) of the resulting subgraph is shown in Figure 5. References 1. A. L. Buchsbaum, H. K ap lan, A. Rogers, an d J. R. W estbro ok. Linear-time p ointer- mac hine algorithms for least common ancestors, mst veriﬁcatio n, and d ominators. In Pr o c e e dings of the thirtieth annual A CM symp osium on The ory of c omputing (STOC) , pages 279–28 8. AC M Press, 1998. 2. E. Dahlhaus. Parall el algorithms for hierarc hical clustering and applications to split decomp osition and parity graph reco gnition. J. Algorithms , 36(2):205–240, 2000. 3. R. M. McConnell. A certifying algorithm for th e consecutive-ones prop erty . In SODA , pages 76 8–777, 2004. 4. M. Rao. Set overl ap classes computation, source co de. 2007. F reely av ailable at http://www .liafa.jussieu.fr / ~ raffinot/o verlap.html .

A Note On Computing Set Overlap Classes

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment