On building minimal automaton for subset matching queries

On building mi nimal automaton f or subset matc hing q ueries Kimmo F redriksson Sc ho ol of Computing, Univ ersit y of Eastern Finland, P .O. Bo x 1627, 70211 Kuopio, Finland kimmo.fredr iksson@uef.fi Abstract W e address the p roblem of b uilding an ind ex for a set D of n strings, where eac h string lo cation is a subset of some ﬁnite integer al ph ab et of size σ , so that w e can answer eﬃciently if a giv en simple q uery string (where eac h string lo cation is a single symbol) p occurs in the set. That is, w e need to eﬃciently ﬁnd a string d ∈ D su ch that p [ i ] ∈ d [ i ] for every i . W e sho w how to build such index in O ( n log σ/ ∆ ( σ ) log( n )) a verage time, where ∆ is the a verage size of the subsets. Ou r metho ds ha ve applications e.g. in computational biolog y (haplot yp e inference) and m usic information retriev al. Keyw ords: a lgorithms; approximate string matching; s ubset matching; ﬁnite- state automaton minimization 1 In tro duction Let Σ = { 0 , . . . , σ − 1 } b e an or dered integer alphab et. W e a re given a s e t D = { d 0 , . . . , d n − 1 } of strings, called a dictionary . Each lo ca tion j of the string d i is a subset of Σ, i.e. d i [ j ] ⊆ Σ for every 0 ≤ i ≤ n − 1 and 0 ≤ j ≤ | d i | − 1 . A str ing p is called s imple if its each lo cation is a single symbol from Σ, i.e. p [ j ] ∈ Σ. The simple query string p matches the dictionar y string d i ∈ D iﬀ p [ j ] ∈ d i [ j ] for 0 ≤ j ≤ | p | − 1 a nd | p | = | d i | . W e co ns ider the following t wo problems: Problem 1 De cide if p matches any st r ing in D . Problem 2 R etrieve the set L = { j 1 , . . . , j r } such that p matches d j i for al l j i ∈ L . In particula r, w e set out to eﬃciently build a smal l index for D such that b oth pr oblems c an b e solve d in O ( | p | ) time. Eﬃcient solution o f these problems have a pplications in computatio na l bi- ology , in ma tc hing DNA ( σ = 4) or protein ( σ = 20 ) string s, o r in haplo t yp e inference ( σ = 2) [9, 1 0]. Finally , note that if | d i [ j ] | is either 1 or σ for a ll i, j , then we hav e a sp ecial cas e called wild-c ar d matching [3]. Another specia l case is δ - matching (see e.g. [2]), where we hav e d i [ j ] = { c i,j − δ, . . . , c i,j + δ } where c i,j ∈ Σ, and δ < σ . These v ariants hav e applica tio ns in indexing natural language words and in m usic infor mation re tr iev al. 1.1 Related w ork Assume that the long est string in D has length m and that for e very d i ∈ D there are a t most k lo catio ns where | d i [ j ] | > 1. The immediate triv ial so lution to our problem would then b e as follows. Fir st genera te all the simple strings of length m that match a string in D . Call the se t of these strings D ′ . The size of D ′ is uppe r b ounded by O ( nσ k ). The problem is now transfor med to exact matching, s o w e can insert all str ings in D ′ to some data structure that can answer whether a given simple query string matc hes a s tring in the data structure exactly . One suc h data structure is a path compressed trie [7] (cf. Sec. 2). This can b e na ¨ ıvely built in O ( m | D ′ | ) = O ( mnσ k ) time and space. The queries can b e answered in O ( | p | ) time. This is a lso the appro ach in [10]. They g ive t wo non-trivial a lgorithms to con- struct the (path compr essed) trie faster, na mely in O ( nm + σ k n log(min { n , m } )) and O ( nm + σ k n + σ k/ 2 n log(min { n, m } )) time, yielding query times of O ( | p | ) and O ( | p | log lo g( σ ) + min {| p | , log( σ k n ) } log log( σ k n )) resp ectively (the la tter metho d in fact us es tw o tries). The t echniques in [3] can be adapted [1 0] to solve the problem with O ( nm lo g ( nm ) + n log k ( n/k !)) prepro cessing t ime, and O ( m + log k ( n ) log log( n )) query time. 1.2 Our con tributions Inspired by [10], w e also take the approach of computing the trie for D ′ as a starting p oint. How ever, instead of a trie, we dir e c tly build a pseudo-minimal (cf. Sec. 2.2 ) deterministic ﬁnite-state automaton (DF A) co rresp onding to the set D ′ ; i.e. our metho d do es no t explicitly ge nerate the set D ′ . The resulting automaton ca n be used to solve P roblems 1 and 2 in O ( | p | ) time. This au- tomaton can b e easily and eﬃciently minimized (again, cf. Se c . 2.2), so that the Problem 1 can still be solved in O ( | p | ) time. W e also pro po se a form of path compressio n that can further sav e space and spee d up the construction. W e show that our cons truction works in O ( n log σ/ ∆ ( σ ) log( n )). av erage time, where ∆ = avg | d i [ j ] | . As shown exp erimentally , our algorithm can be orders of magnitude f as ter in construction time than the r elated na ¨ ıve a pproach of ﬁrst building a trie for D ′ , and then conv erting it to the minimal DF A, or direc tly building the minimal DF A fr om D ′ . The ps eudo-minimal automaton is mo re eﬃcient to constr uc t than the tr ue minimal automaton, and is in practice only slightly larg er. 2 2 The algorithm Let us deﬁne a DF A as M ( Q, Σ , δ, q , F ), where Q is the set of states, q is the initial state, F ⊆ Q is the set of a ccepting states and δ ∈ Q × Σ → Q is the transition function. F or conv enience we also deﬁne δ ∗ ( q , aw ) = δ ∗ ( δ ( q , a ) , w ) for a string w ∈ Σ ∗ . 2.1 Prelude T raditiona lly a trie [7] is des crib ed as being a ro oted tree storing a set of ( simple) strings. Each no de has at most σ children, and the (dire c ted) edges a re lab eled by the symbols in Σ. In p ath c ompr e sse d trie the unary paths a re compacted to single edges, lab eled by strings consisting of the concatena tion of the sym b ols in the original path. In bo th cases, a path from the ro o t to any no de u sp ells out a preﬁx o f a s ubset of the strings stor ed in the tr ie , and that subset is stored in the subtree ro oted at u . The trie c an b e seen as a DF A in an obvious way; the ro ot no de corresp onding to the state q , and the lab eled edges corresp o nding to δ . W e ex tend the DF A so that for the no des u ∈ F w e attach a list L , sto r ing the corres p o nding string identiﬁ ers . Mor e formally , we deﬁne j i ∈ L ( u ) ⇔ u match es d j i ∈ D, (1) where u denotes the string s pe lle d b y the path fro m q to u , i.e. u = ( w | δ ∗ ( q , w ) = u ). Thus by g enerating all the strings D ′ that ma tc h a string in D and building a DF A for D ′ , P roblems 1 and 2 c a n b e solved in O ( | p | ) time. One of the problems of this appr oach is tha t | D ′ | can b e large. A wa y to alleviate this is to minimize the DF A. There e x ists a larg e n umber of algo rithms for this task [4]. So me of these can build the automaton incrementally , inserting one string at a time while main taining the automaton in minimal state (e.g. [6]). This ca n still b e unnecessarily slow. Moreov er, the result does not allow prop er mapping betw een the states and the lists L . E .g. if all the strings in D are of equal le ngth, the resulting minima l DF A w ould ha ve only one accepting state. How ever, this automato n ca n still b e used to solve Problem 1. Another solutio n is to co nstruct a pseudo-minimal DF A [11, 5] still a llowing ma pping states or transitions to strings. W e take a similar approa ch, although our deﬁnition of pseudo-minimal is somewhat diﬀerent. 2.2 Pseudo-minima l DF A W e now pr esent an algo r ithm that directly (i.e. our algo rithm never deletes a state) co nstructs pseudo-minima l DF A from D , without using a trie-like DF A as an int ermedia te step, o r e x plicitly generating the set D ′ . Nevertheless, we ﬁrst des crib e a par ticula r (direct) wa y to build a trie-DF A, and then deﬁne a certain e q uiv alence re la tion for the trie states, and show how we ca n dur ing the construction a void creating new s tates b y iden tifying a n e q uiv alen t state already present. 3 The algor ithm can pro ceed recursively in either a depth-ﬁrst or a breadth- ﬁrst manner, with min or diﬀerences. W e describ e and give pseudo co de for the breadth-ﬁr st v ariant: the constr uction b egins by inserting the starting s tate (ro ot node) in to queue of states; at each stage a state is dequeued and its children a re c o mputed a nd enqueued. The algorithm termina tes when the queue bec omes e mpty . As des crib ed a bove, each state u will hav e an asso ciated list L ( u ), ( L ( u ) = ∅ , if u 6∈ F ). W e will denote the partially computed list as L ′ ( u ) ( L ′ ( u ) 6 = ∅ ). The following inv ar iants are maint ained: (a ) when a ll the children (if any) of u are enqueued, the state u is fully computed and Eq. (1) is sa tisﬁed (po st-condition); (b) when a sta te u is enqueued, then the list L ′ ( u ) s atisﬁes Eq. (2) b e low (pre-condition): j i ∈ L ′ ( u ) ⇔ u match es d j i [0 . . . | u | − 1] | d j i ∈ D. (2) I.e. j i ∈ L ′ ( u ) iﬀ u matc hes a preﬁx o f d j i (note that | u | = depth ( u ), if the paths are no t compressed). Thus the a lgorithm initializes L ′ ( q ) = { 0 , . . . , n − 1 } (3) and enqueues q . At each iter ation, one state u is dequeued, its “children” are initialized accor ding to the pre- condition, and enqueued, and the p ost- c o ndition for u is co mputed. Given the lis t L ′ ( u ) a nd ∀ c ∈ Σ, we deﬁne L ′ ( v ) = { j i | j i ∈ L ′ ( u ) and c ∈ d j i [ | u | ] } . (4) If | L ′ ( v ) | > 0, then a transitio n δ ( u, c ) = v is added, and v enqueued. Note that j i is put into | d j i [ | u | ] | lists. The list L ( u ) is then computed as L ( u ) = { j i | j i ∈ L ′ ( u ) an d | u | = | d j i |} . (5) That is, w e keep only the strings that end in u , a nd u b ecomes an ac cepting state iﬀ | L ( u ) | > 0. All the σ lists L ′ ( v ) and the list L ( u ) can b e co mputed with a single pass ov er the the list L ′ ( u ). Alg. 1 gives the pseudo co de . This is rep eated until the queue b ecomes e mpt y . Note that this co mputes exactly the sa me trie as one would get by ﬁrst g enerating D ′ and then ins e rting the strings one at a time. Howev er, our bulk-insertio n metho d is more easily improv ed. W e deﬁne the following r elation b etw een the states u and v : u ≡ p v : L ′ ( u ) = L ′ ( v ) and | u | = | v | , (6) which is clearly reﬂexive, symmetric, and tra ns itive, i.e. an equiv alence relation. The fo llowing is easy to notice: u ≡ p v ⇒ L ( u ) = L ( v ) , (7) where the language of u is L ( u ) = { w ∈ Σ ∗ | δ ∗ ( u, w ) ∈ F } . (8) 4 Alg. 1 Partition( D , L ′ , depth ). 1 L ← ∅ 2 for c ← 0 to σ − 1 do P [ c ] ← ∅ 3 for i ← 0 to | L ′ | − 1 do 4 k ← L ′ [ i ] 5 if | d k | ≤ depth then 6 L ← L ∪ { k } 7 else 8 for ∀ c ∈ d k [ depth ] do P [ c ] ← P [ c ] ∪ { k } 9 return ( L, P ) Hence we will pa rtition the states into equiv alence clas ses, so that in the ﬁnal DF A all states be lo ng to a diﬀer ent class. Note that this do es not result in a minimal DF A; i.e. we hav e that L ( u ) = L ( v ) ; u ≡ p v , while the implicatio n would b e required fo r a true minimal auto ma ton. Note that b y the deﬁnition we can still prop erly asso cia te states with the lists L ′ and L . So we ca n ca ll the result pseudo-minimal DF A a s in [11, 5], even when our deﬁnition should not be confused with the deﬁnition g iven in these pap ers W e need to ma in tain se ts of pa ir s ( L ′ , u ), where L ′ is a k ey that is used to insert and sea rch the state u , a repr esentativ e o f its equiv alence class. The algorithm is no w immediate: whenever we ha ve computed a list L ′ ( v ), we searc h if it is present in a s et S ( depth ( v )); if so, v can b e replaced by the corresp onding no de u . In this ca se, v is not enq ueue d, as a n equiv alent state u is in the queue already . If L ′ ( v ) is not present, w e inse r t ( L ′ ( v ) , v ) to S ( depth ( v )), and enqueue v . Alg. 2 gives the complete pseudo co de, keeping the automaton in its pseudo- minimal state throughout the construction. 2.3 Using subsets for unary paths F or a m oment co nsider a plain trie with a path compression. In this case the trie has Θ( | D ′ | ) no des (states), indepe ndent o f the pattern lengths (without path compressio n, this is multiplied by O ( m )). While this may sav e space in many cases, this is not alwa ys so . Cons ide r e.g. the unr ealistically pathologica l case, where D contains only one string of length m , na mely Σ m . This means tha t all σ m po ssible str ings are pr esent in D ′ , a nd no path compress ion can take place, as there simply ar e no unary paths (the minimal a nd pseudo- minimal DF As would b oth hav e only m + 1 states). W e pro po se a s lightly diﬀerent, but muc h more eﬀective, path compress ion. Consider now a string in D , a nd in particular that the s tring po sitions can b e any subsets of Σ (not necessa rily just single symbols). Assume that d i [ depth ( u )] = d j [ depth ( u )], fo r some u a nd ∀ i, j ∈ L ′ ( u ). This means that there is no need to bra nch, since all the subsets are the same, and no sy m b ol in Σ ca n diﬀerentiate b etw een any d i , d j . Hence we could add a transition from u to (so me) v using the subset d i [ depth ( u )] as a labe l. This do es not p ose 5 Alg. 2 BuildDF A( D ). 1 q ← NewState() 2 L ′ ( q ) ← { 0 , . . . , | D | − 1 } 3 Enqueue( q ) 4 while not QueueEmpty( ) do 5 u ← Dequeue() 6 ( L ( u ) , P ) ← Partition( D, L ′ ( u ) , depth ( u )) 7 if | L ( u ) | > 0 then F ← F ∪ { u } 8 for c ← 0 to σ − 1 do 9 if | P [ c ] | = 0 thencon tinue 10 v ← Search( S [ depth ( u )] , P [ c ]) 11 if v = null then 12 v ← NewNo de() 13 L ′ ( v ) ← P [ c ] 14 Insert( S [ depth ( u )] , ( L ′ ( v ) , v )) 15 Enqueue( v ) 16 δ ( u, c ) ← v 17 return q any problems , as (when used in recog nition) we ca n still test in O (1) time if p [ depth ( u )] ∈ d i [ depth ( u )]. (Note that our pseudo - minimization a lgorithm ef- fectively alr eady ha ndles this, i.e. under the ab ove condition, δ ( u, c ) = v for ∀ c ∈ d i [ depth ( u )].) More generally , g iven a no de u , a nd ∀ i, j ∈ L ′ ( u ) : d i [ k ] = d j [ k ] | depth ( u ) ≤ k < h, (9) then d i [ depth ( u ) . . . h − 1] can b e used as a string labe l in a compr essed unary path. The e a siest way to utilize this is to use it only for unary paths to the le aves when | L ′ ( u ) | = 1. This is eﬀectively a chiev ed simply by repla cing the line 15 in Alg. 2 by “ if | L ′ ( v ) | > 1 then E nq ueue( v )”. It would b e relatively ea sy to use the path compre s sion in any unary path, but as show in Sec. 3 this simple metho d can give huge savings in b oth time and space. 2.4 Analysis Let us now consider the r unning time of Alg. 2, with (our) path co mpression on leav es. W e assume that the subsets d i [ j ] have av erage size ∆, and that they are are r andomly , uniformly and indep endent ly g enerated. At ﬁrst we assume that there is a no n-zero pr obability that tw o r andom subsets do not intersect (e.g. ∆ ≤ σ / 2). The partition of L ′ ( u ) can b e implemented to take O ( | L ′ ( u ) | ∆) time. Each of the σ resulting new sets hav e av erag e size O ( | L ′ ( u ) | ∆ /σ ), as for a random c ∈ Σ the pr obability that c ∈ d i [ j ] is ∆ /σ . These sets are sear ched from 6 S , and poss ibly inser ted (if not found). The size o f S is upp er b ounded b y O ( | Q | ), the num ber of states in the resulting automaton. Hence insert/sear ch can b e implemented in O (log ( | Q | ) + | L ′ ( u ) | ∆ /σ ) worst case time with a num ber of radix-tre e techniques, s ee e.g. [12, 1]. Therefor e the tota l time p er no de is O ( σ (lo g ( | Q | ) + | L ′ ( u ) | ∆ /σ ) + | L ′ ( u ) | ∆), i.e. O ( σ log( | Q | ) + | L ′ ( u ) | ∆), which is O (log ( | Q | ) + | L ′ ( u ) | ), assuming σ = O (1). F or a moment assume tha t we are building a plain trie, without path com- pression. Re c a ll that by deﬁnition the length of the list L ′ ( root ) is exactly n . As describ ed ab ov e, the length 1 of each of the σ lists for the children of node u is O ( | L ′ ( u ) | ∆ /σ ), so the le ngths of the lists L ′ ( u ) decrease exp o- nent ially when the depth o f u (i.e. | u | ) increase, a s | L ′ ( u ) | = O ((∆ /σ ) | u | n ). Hence | L ′ ( u ) | = O (1) when α = | u | ≥ log σ/ ∆ ( n ). The total num b er of states up to this depth is | Q | = P α i σ i = O ( σ α ) = O ( n log σ/ ∆ ( σ ) ), that is, all the states hav e all the σ p o s sible branches up to depth α . As there a re σ i no des a t depth i , the total length of al l the lists a t a depth i is on av erage O ((∆ /σ ) i n σ i ) = O (∆ i n ). Thus the total length of a ll the lists up to depth α is ℓ = n P α i ∆ i = O ( n ∆ α ) = O ( n log σ/ ∆ (∆)+1 ) = O ( n log σ/ ∆ ( σ ) ). Assume no w (pes s imistically) that path compress ion and pseudo- minimization take place only a fter depth α . After this depth, the lists hav e length k = O (1 ), (a nd will con tinue to shrink un til k = 1).Ther e are o nly  n k  = O ( n k /k !) diﬀerent lists of length k , but at the same time there are O ( n log σ/ ∆ ( σ ) ) states (with a sso ciated lists), so by the pigeo nhole principle many of the states must be equiv alent , and are co m bined into a single state. How ever, due to path compression, the pro cess ter minates for any state having k = 1 . Hence the num b er of s ta tes p er level s tarts to decr ease exp onentially 2 after depth α . That is, the total num be r of states is bo unded by tw o geo metric se- ries, b oth having the larg est term at depth α , where the automa ton is in its “widest”, i.e. the total num be r o f sta tes is asymptotica lly upp er b ounded by O ( n log σ/ ∆ ( σ ) ). Summing up, the total time is on av erage O ( | Q | lo g( | Q | ) + ℓ ) = O ( n log σ/ ∆ ( σ ) log( n )) , (10) again assuming σ = O (1). So far w e hav e assumed tha t there is a non-zer o probabilit y that t wo ra ndom subsets do not intersect. Consider no w the ( ra ther unin teresting) case wher e the subset sizes are always ∆ > σ / 2 (not just on av erage). At ﬁrst, the pro cess g o es as be fo re, the nu mber of states increasing exp onentially , a nd the list lengths | L ′ ( u ) | decreasing exp onentially . Ho wev er, assume no w, for s implicity , that L ′ ( u ) = { i , j } for some state u . Due to ∆ > σ / 2, the subs e ts d i [ h ] and d j [ h ] m ust intersect (where h = | u | ). Thus the alphab et Σ is eﬀectively pa rtitioned 1 In the “w orst case” the re is only one “new” set, being exa ctly th e same as its parent, L ′ ( u ); but in this case the corresponding node would not branch, so the complexity wou ld only improv e. 2 Note that without comb ining the equiv alen t s tates or the path compression, after depth α the num ber of states would con tinue to incr e ase exponentially , resulting in a ful l trie. 7 int o four disjoint sets : A = d i [ h ] \ d j [ h ]; B = d j [ h ] \ d i [ h ]; C = d i [ h ] ∩ d j [ h ]; D = Σ \ ( d j [ h ] ∪ d i [ h ]). Group D do e s genera te a ny bra nches for u . Sym b ols from A , B and C g e ne r ate branches, but these are combined (group-wise) by the minimization, resulting in at most one new state per group, call it v . F or A (similarly for B ), | L ′ ( v ) | = 1, and due to path compression, v will hav e no des c endants. The interesting ca se is C . Note that C cannot b e empty , so L ′ ( v ) = L ′ ( u ), a nd hence the pro cess rep eats for v . In other words, the pro ces s do es not terminate un til h = | d i | . The situation is similar when | L ′ ( u ) | > 2. Note that after depth α we hav e | L ′ ( u ) | = O (∆) in a n y case, and b eca use o f the pseudo-minimization, there ca n b e only  n ∆  = O ( n ∆ / ∆!) diﬀerent s tates with lists of length O (∆). Thu s in general the “breadth” o f the a utomaton will stay a ppr oximately the same a fter depth α , and the total time is upp er bo unded by O (( n log σ/ ∆ ( σ ) + min { n log σ/ ∆ ( σ ) , n ∆ } m ) log( n )), where m is the length of the string s in D . Finally , as the num b er of subsets of n items is at mo st 2 n , the trivial upp er bo und for the worst case siz e of our da ta str ucture is O ( m 2 n ). This should b e contrasted with the O ( σ m ) bo und of [10]. 3 Exp erimen ts and ﬁnal remarks W e have implemen ted the a lgorithms in C, and ran the e xpe riments on 3 .0GHz Int el Core2 with 2GB RAM, 4MB L2 cache, running GNU/Linux 2 .6.23. The implemented a lg orithms are : Pseudo - minimal DF A (PM DF A), a s in Alg. 2; minimal DF A (M DF A); PM DF A with path compres sion (PM DF A PC) on leaves, as detailed in Sec. 2.3; pla in trie; and trie with path compress ion on leav es (T rie PC), as in PM DF A PC. Some results for the T ries ar e not included, as they could not ﬁt into the av ailable RAM. M DF A was computed from PM DF A, as computing it from D ′ or the corresp onding trie w ould hav e bee n totally intractable in mos t cases . W e implemented the set S in Alg. 2 with Patricia tr ies [12]. W e hav e not implemented the methods in [10], but we s how that the lower bo und ( | D ′ | ) for the size of their data structur e can b e several orders of mag- nitude la rger than our empirical siz e s . In fact, we ca n build r easona bly sma ll data structures for problem instances that are c ompletely intractable with their metho ds. T able 1 gives the r esults for s ome randomly generated instances. W e used parameters ( m, n, σ , (∆ l , ∆ h ) , f ), where m is the length of the str ings (a ll n of equal leng th); (∆ l , ∆ h ) , f denotes that in probability f any string lo cation con- tains a rando mly selected subset of Σ, wher e the size of the subset is randomly selected betw een ∆ l . . . ∆ h ; otherwise (with proba bilit y 1 − f ) the string loc ation is a s ing le random symbol from Σ. W e r epo rt the n umber o f states generated by the diﬀerent methods, as well as the time in se conds, for so me illustrative case s . As shown, the num b er of states generated is sig niﬁcantly smaller than | D ′ | in all cas es, so metimes the diﬀerence b eing many order s of mag nitude. PM DF A is usually only slightly 8 T able 1: Exper iment al results for data g enerated using parameters ( m, | D | , σ, (∆ l , ∆ h ) , f ). Times are given in seco nds. | Q | is the num b er of g en- erated states, and | D ′ | is the num b er of diﬀerent s trings matching a s tr ing in D . (32 , 1000 0 , 2 , (2 , 2) , 0 . 2), | D ′ | = 3 , 418 , 449 Metho d time (s) | Q | PM DF A 0.828 476,36 5 M DF A 1.20 385,25 5 PM DF A PC 0.820 379,94 8 T rie 6.71 18,767 ,894 T rie PC 0.326 948,49 3 (32 , 1000 0 , 4 , (2 , 4) , 0 . 3), | D ′ | = 40 , 75 5 , 624 , 31 2 PM DF A 1.42 680,906 M DF A 2.39 635,79 5 PM DF A PC 1.40 499,21 2 T rie PC 1.19 4,203,6 73 (16 , 1000 0 , 20 , (2 , 6) , 0 . 7 5), | D ′ | = 1 , 830 , 872 , 526 , 457 PM DF A 6.50 1,335,25 1 M DF A 18.2 1,320,1 26 PM DF A PC 6.21 1,276,9 85 T rie PC 6.29 22,431 ,630 (16 , 1000 00 , 32 , (2 , 3 2) , 0 . 01), | D ′ | = 1 , 033 , 039 PM DF A 6.28 1,331,24 1 M DF A 12.0 964,84 7 PM DF A PC 0.486 149,99 8 T rie 15.5 6,981,2 14 T rie PC 0.236 235,56 5 (16 , 1000 , 32 , (32 , 32 ) , 0 . 25), | D ′ | ≈ 1 , 190 × 10 15 PM DF A 0.954 118,47 4 M DF A 7.29 115,79 7 PM DF A PC 0.907 110,34 0 larger than the true minimal DF A, while using pa th co mpression with PM DF A is usually smal ler tha n M DF A. In so me rare ca ses using path compr e s sion with a plain T rie is very comp etitive. Fig. 1 shows the exp onential inc r ease (depth / α ) and decrea se (depth ' α ) of the n umber of states a s a function of the depth in the automato n / trie, and illustr ates the b ehaviour when all subset sizes are > σ / 2. 9 0 50000 100000 150000 200000 250000 300000 350000 400000 0 5 10 15 20 25 30 states depth (32, 10000, 4, (2,4), 0.3) PM DFA PM DFA PC Trie Trie PC 0 200000 400000 600000 800000 1e+06 1.2e+06 0 2 4 6 8 10 12 14 16 states depth (16, 10000, 20, (2,6), 0.75) PM DFA PM DFA PC Trie Trie PC 0 200000 400000 600000 800000 1e+06 0 5 10 15 20 25 30 states depth (32, 20, 8, (7,7), 1.0) PM DFA PM DFA PC Trie Trie PC Figure 1: The total num b er of states gener ated for each depth o f the automaton / trie during the constructio n. Ab ov e: α = log σ/ ∆ ( n ) ≈ 10 . 04; middle: α ≈ 5 . 07. Below: subset sizes alwa ys > σ / 2. Finally , we note that our metho ds ha ve applications in on- line dictiona r y string matching, e.g. in δ -matching and ( δ, γ )-matching. It turns out that we can solve b oth pr o blems in O ( | T | log σ/δ ( nm ) /m ) av erag e time, which is optimal 10 for δ -ma tch ing [8 ], for a dictiona ry of n pa tter ns of leng th m , a nd a text o f length | T | . W e leav e the details for future work. References [1] G. H. Badr a nd B. J . O ommen. Self-adjusting of ternar y sea r ch tries using conditional rotations and rando mize d heuristics. Comput. J. , 48(2 ):200– 219, 200 5 . [2] E. Ca m b ouro po ulos, M. Cro chemore, C. S. Iliop oulos , L . Mouchard, and Y. J. Pinzon. Alg orithms for computing a pproximate rep etitions in musical sequences. Int . J . Comput. Math. , 79(11 ):1135– 1148, 200 2. [3] R. Cole, L. Gottlieb d, a nd M. Le wenstein. Dictionary matching and index- ing with er rors and don’t car es. In Pr o c e e dings of STOC’04 , pages 91– 1 00, New Y ork, NY, USA, 20 0 4. ACM. [4] J. Daciuk. Comparison o f co nstruction algorithms for minimal, acyclic, deterministic, ﬁnite-state a uto mata from sets of strings. In Pr o c e e dings of CIAA’02 , LNCS 2608 , pages 255– 261. Springe r , 20 0 2. [5] J. Daciuk, D. Maur el, and A. Sav ar y . Incremental and semi-incremental construction of pseudo-minimal automata. In Pr o c e e di ngs of CIAA’05 , LNCS 3845, pages 341 – 342. Springer, 2005 . [6] J. Daciuk, S. Mihov, B. W. W atso n, and R. W atson. Incremental construc- tion of minimal acyc lic ﬁnite state automata. Comp ut ational Linguistics , 26(1):3–1 6, 2000 . [7] E. F redkin. T rie memory . Communic atio ns of the ACM , 3(9):490 –499, 1960. [8] K. F redriksson, V. M¨ akinen, a nd G. Nav arro. Flexible music retriev al in sublinear time. International Journal of F oundatio ns of Computer Scienc e (IJFCS) , 17 (6):1345 – 1364 , 200 6. [9] D. Gusﬁeld. Haplot yp e inference by pure pa rsimony . In Pr o c e e dings of CPM’03 , LNCS 26 76, page s 14 4–15 5 . Springe r , 2003. [10] G. La ndau, D. Tsur, a nd O. W eimann. Indexing a dictionar y for subset matching queries . In Pr o c e e dings of SPIRE’07 , v olume 47 26 of LNCS , pa g es 195–2 04. Springer , 200 7. [11] D. Maur e l. Pseudo-minimal tra nsducer. Th e or etic al Computer Scienc e , 231(1):12 9 – 139, 200 0. [12] D. R. Morr ison. Patricia—practical a lgorithm to r etrieve information co ded in alphanumeric. J. ACM , 15(4):514 –534, 1 968. 11

On building minimal automaton for subset matching queries

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment