3-Way Composition of Weighted Finite-State Transducers
Composition of weighted transducers is a fundamental algorithm used in many applications, including for computing complex edit-distances between automata, or string kernels in machine learning, or to combine different components of a speech recogniti…
Authors: Cyril Allauzen, Mehryar Mohri
3-W ay Composition of W eighted Finite-State T ran sducers Cyril Allauzen 1 , ⋆ and Mehrya r Mohri 1 , 2 1 Courant Institute of Mathematical Sciences, 251 Mercer Street, New Y ork, NY 10012 . 2 Google Research, 76 Ninth A venue, New Y ork, NY 10011. Abstract. Composition o f weighted transducers is a fund amental algorithm use d in many ap plications, including for computing complex edit-distances between automata, o r string kernels in machine learning, or to combine different compo- nents of a speech recognition, sp eech synthesis, or information extraction system. W e present a generalization of the composition of weighted transducers, 3 -way composition , which is dramatically faster in practice than the standard composi- tion algorithm when co mbining more than tw o transduc ers. The worst-case co m- plexity of our algorithm for composing three tr ansdu cers T 1 , T 2 , and T 3 resulting in T , is O ( | T | Q min( d ( T 1 ) d ( T 3 ) , d ( T 2 )) + | T | E ) , where | · | Q denotes the num- ber of states, | · | E the number of tr ansitions, and d ( · ) the maximum out-degree. As in regular composition, the use of perfect hashing requires a pre-processing step with linear-time exp ected complexity in the size of the input transducers. In many cases, this approach significantly improves on the complexity of standard composition. Our algorithm also leads to a dramatically faster composition in practice. Furthermore, standard composition can be obtained as a special case of our algorithm. W e report the results of sev eral e xperiments demonstrating this im- prov ement. These theoretical and empirical improv ements significantly enhance performance in the applications already mentioned. 1 Intr oduction W eighted finite-state transd ucers ar e widely used in text, speech, an d image process- ing applications and other related areas such as info rmation extraction [8, 10, 12, 11, 4]. They are finite autom ata in which e ach transition is aug mented with an outp ut lab el and some weigh t, in addition to the familiar (input) label [14, 5, 7 ]. Th e weights may represent probab ilities, lo g-likelihoods, or they m ay b e some other co sts used to rank alternatives. They are, more generally , elements of a semirin g [7]. W eighted tran sducers are used to repr esent models der i ved from large data sets us- ing various s tatistical learnin g t echniqu es such as pron unciation diction aries, s tatistical grammar s, string kernels, or comp lex e dit-distance models [11, 6, 2, 3 ]. These models can be combin ed to create com plex sy stems such as a speech recognition or info rmation extraction system using a fundame ntal transdu cer algor ithm, composition o f weighted ⋆ This author’ s current address is: Google Research, 76 Ninth A venue, New Y ork, NY 10011. transducers [12, 1 1]. W eighted co mposition is a gene ralization of the composition a l- gorithm f or u nweighted finite-state transducer s which consists of match ing the ou tput label o f the t ransitions of o ne transduce r w ith th e inpu t lab el o f the transitions of an other transducer . The weighted case is howev er more comp lex and req uires the intr oduction of a n ǫ -filter to a void th e cre ation of redundan t ǫ -p aths and p reserve the cor rect path multiplicity [12 , 11]. The result is a ne w weighted transducer representing the relation al composition of the two transducers. Composition is widely used in computatio nal b iology , text and spe ech, and ma- chine learn ing a pplications. In many o f these applications, the transducer s used ar e quite large, they may ha ve as many as se veral hundred million states or transitions. A critical problem is thus to d e vise efficient algorithm s for combinin g them. This p aper presen ts a gen eralization of the co mposition of weig hted transdu cer , 3 -way composition , that is dramatically faster than th e stand ard co mposition alg orithm when co mbining m ore th an two tra nsducers. Th e comp lexity o f comp osing three transduce r T 1 , T 2 , and T 3 , with the standard c omposition algorithm is O ( | T 1 || T 2 || T 3 | ) [ 12, 11]. Using per fect hashing, the worst-case complexity of computing T = ( T 1 ◦ T 2 ) ◦ T 3 using standard composition is O ( | T | Q min( d ( T 3 ) , d ( T 1 ◦ T 2 )) + | T | E + | T 1 ◦ T 2 | Q min( d ( T 1 ) , d ( T 2 )) + | T 1 ◦ T 2 | E ) , (1) which may be prohib iti ve in some cases ev en when the r esulting transduc er T is not large but the interm ediate tran sducer T 1 ◦ T 2 is. Instead, the worst-ca se com plexity of our algorithm is O ( | T | Q min( d ( T 1 ) d ( T 3 ) , d ( T 2 )) + | T | E ) . (2) In both cases, the use of perfect hashing requires a pre- processing step with linear -time expected complexity in the size of the input transd ucers. Our algo rithm also leads to a d ramatically faster comp utation of the result o f com- position in practice. W e repo rt the results of se veral experiments de monstrating this improvement. Th ese theoretical and empirical improvements sign ificantly enhan ce p er- forman ce in a series of app lications: string kernel-based algo rithms in m achine learn- ing, th e compu tation of co mplex e dit-distances between auto mata, speech re cognition and speech synthesis, and infor mation extraction. Furthermor e, as we shall see later , standard composition can be obtained as a special case of 3 -way composition. The m ain techn ical difficulty in the d esign of our algorithm is the definition of a filter to deal with a p ath multiplicity prob lem th at arises in the p resence of the empty string ǫ in the composition of three transducer s. This prob lem, which we shall describe in detail, leads to a word combinato rial problem [13] . W e will pr esent two solutions for this problem : on e requiring two ǫ -filters and a generalization of the ǫ -filters used for standard comp osition [12, 11] ; an d another d irect and sym metric solution where a single filter is needed . Remarka bly , this 3 -way filter can be encoded as a finite automa ton and painlessly integrated in our 3 -way composition. The remainde r o f the p aper is structu red as follows. So me pr eliminary definition s and terminolo gy are introduced in the n ext section (Section 2). Section 3 d escribes our 3 -way algorithm in the ǫ -free case. The word com binatorial p roblem o f ǫ -path multi- plicity a nd ou r solutions are p resented in detail Section 4. Sectio n 5 repor ts the r esults of exper iments using th e 3 -way algorithm and c ompares them with the standar d com- position. 2 Pr eliminaries This section giv es th e standard d efinition and specifies the notation u sed f or weighte d transducer s. F inite-state transducers are finite autom ata in which each transition is aug mented with an output label in addition to the familiar inpu t label [1, 5]. Output labels are concatenate d along a path to form an ou tput sequ ence and similarly with input labels. W eigh ted transducers are finite-state tran sducers in which e ach transition carries so me weight in addition to the input and outpu t labels [14, 7] . The weights are elem ents of a semiring , that is a ring that may lack negation [7]. Some f amiliar semiring s are the tropical semiring ( R + ∪ {∞} , min , + , ∞ , 0) related to classical sh ortest-paths algorithms, an d the proba bility semirin g ( R , + , · , 0 , 1) . A s emir- ing is idempotent if f or all a ∈ K , a ⊕ a = a . It is commutative when ⊗ is commutative. W e will assume in this paper that th e semir ing used is com mutativ e, which is a n eces- sary condition for composition to be an efficient algorithm [10 ]. The following gives a formal definition of weighted transducer s. Definition 1. A weighted finite-state transducer T over ( K , ⊕ , · , 0 , 1) is an 8-tup le T = ( Σ , ∆, Q, I , F , E , λ, ρ ) wher e Σ is th e finite inpu t alphabe t of the transducer , ∆ is the finite ou tput alpha bet, Q is a finite set of states, I ⊆ Q the set of in itial states, F ⊆ Q the set of fi nal states, E ⊆ Q × ( Σ ∪ { ǫ } ) × ( ∆ ∪ { ǫ } ) × K × Q a finite set of transitions, λ : I → K the initial w eight fu nction, an d ρ : F → K the final weight fun ction mappin g F to K . The weigh t of a path π is obtained by m ultiplying the weig hts of its constituent tr ansi- tions using the multiplication rule of the semiring and is deno ted by w [ π ] . The weigh t of a pair of input and output strings ( x, y ) is obtained by ⊕ -summin g the weights of th e paths labeled with ( x, y ) fro m an initial state to a final s tate. For a path π , we denote by p [ π ] its origin state and by n [ π ] its destination state. W e also denote by P ( I , x, y , F ) the set of paths from the in itial states I to the final s tates F labeled with input string x and output string y . A transd ucer T is r egulated if the output weight associated by T to any pair of strings ( x, y ) : T ( x, y ) = M π ∈ P ( I ,x,y ,F ) λ ( p [ π ]) · w [ π ] · ρ [ n [ π ]] (3) is well-d efined and in K . T ( x, y ) = 0 when P ( I , x, y , F ) = ∅ . If for all q ∈ Q L π ∈ P ( q ,ǫ,ǫ,q ) w [ π ] ∈ K , then T is r egulated. In particular , when T does not ad mit any ǫ -cycle, it is regulated. Th e weighted tra nsducers we will be considerin g in this paper will be regulated. Figure 1(a) sho ws an examp le. The co mposition of two weighted transducers T 1 and T 2 with m atching in put and output alphabets Σ , is a weighted transducer denoted by T 1 ◦ T 2 when the sum: ( T 1 ◦ T 2 )( x, y ) = M z ∈ Σ ∗ T 1 ( x, z ) ⊗ T 2 ( z , y ) (4) is well-defined and in K for all x, y ∈ Σ ∗ [14, 7]. W eighted automa ta can be defined as weighted transducer s A with identical input and outp ut l abels, for any transition. Thus, 0 a:b/.1 1 a:b/.2 2/1 a:b/.4 3/.8 b:a/.6 b:a/.3 b:a/.5 0 a/.1 1 a/.2 2/1 a/.4 3/.8 b/.6 b/.3 b/.5 (a) (b) Fig. 1. (a) Example of a weighted transducer T . (b) Example of a weighted automaton A . [ [ T ] ]( aa b , bba ) = [ [ A ] ]( aab ) = . 1 × . 2 × . 6 × . 8 + . 2 × . 4 × . 5 × . 8 . A bold circle i ndicates an initial state and a double-circle a final state. The final weight ρ [ q ] of a final state q is indicated after the slash symbol representing q . only pairs of the form ( x, x ) can ha ve a non- zero weight by A , which is why the weight associated by A to ( x, x ) is abusi vely de noted b y A ( x ) and identified with the weight associated by A to x . Similarly , in th e g raph r epresentation o f we ighted au tomata, the output (or input) label is omitted. 3 Epsilon-Fr ee Compositio n 3.1 Standard Composition Let us s tart with a brief description of the standard composition algorithm for weighted transducer s [12, 11]. States in the comp osition T 1 ◦ T 2 of two weighted tra nsducers T 1 and T 2 are id entified with p airs of a state of T 1 and a state o f T 2 . Leaving aside transi- tions with ǫ in puts or outp uts, the following rule specifies how to co mpute a transition of T 1 ◦ T 2 from approp riate transitions of T 1 and T 2 : ( q 1 , a, b, w 1 , q 2 ) and ( q ′ 1 , b, c, w 2 , q ′ 2 ) = ⇒ (( q 1 , q ′ 1 ) , a, c, w 1 ⊗ w 2 , ( q 2 , q ′ 2 )) . (5) Figure 2 illustrates the algo rithm. In th e worst case, all tr ansitions of T 1 leaving a state q 1 match all those of T 2 leaving state q ′ 1 , thus the space and time comp lexity of c omposition is quadratic: O ( | T 1 || T 2 | ) . Ho wever , using perfect h ashing on th e in - put transducer with the high est out-degree leads to a worst-case comp lexity of O ( | T 1 ◦ T 2 | Q min( d ( T 1 ) , d ( T 2 )) + | T 1 ◦ T 2 | E ) . The pre- processing step required fo r hashing the transitions of the tr ansducer with the high est out-degree has a n expected complexity in O ( | T 1 | E ) if d ( T 1 ) > d ( T 2 ) and O ( | T 2 | E ) otherwise. The main problem with the stand ard comp osition a lgorithm is the f ollowing. As- sume that one wishe s to compu te T 1 ◦ T 2 ◦ T 3 , say for example by p roceeding left to right. Thus, first T 1 and T 2 are composed to comp ute T 1 ◦ T 2 and then the result is composed with T 3 . The worst-case complexity of that compu tation is: O ( | T 1 ◦ T 2 ◦ T 3 | Q min( d ( T 1 ◦ T 2 ) , d ( T 3 )) + | T 1 ◦ T 2 ◦ T 3 | E + | T 1 ◦ T 2 | Q min( d ( T 1 ) , d ( T 2 )) + | T 1 ◦ T 2 | E ) . (6) 0 1 a:b/0.1 a:b/0.2 2 b:b/0.3 3/0.7 b:b/0.4 a:b/0.5 a:a/0.6 0 1 b:b/0.1 b:a/0.2 2 a:b/0.3 3/0.6 a:b/0.4 b:a/0.5 (0, 0) (1, 1) a:b/0.2 (0, 1) a:a/0.4 (2, 1) b:a/0.5 (3, 1) b:a/0.6 a:a/0.3 a:a/0.7 (3, 2) a:b/0.9 (3, 3)/1.3 a:b/1 (a) (b) (c) Fig. 2. Example of transducer composition. (a) W eighted transducer T 1 and (b) W eighted trans- ducer T 2 ov er the probability semiring ( R , + , · , 0 , 1) . (c) Result of the composition of T 1 and T 2 . But, in many cases, co mputing T 1 ◦ T 2 creates a very large numb er of transitions that may n e ver match any transition of T 3 . For example, T 2 may repre sent a com- plex edit-d istance transducer, allowing all possible insertion s, deletions, substitutions and per haps other oper ations such as tran spositions or more co mplex e dits in T 1 all with different costs. Even when T 1 is a simp le non-d eterministic finite automaton with ǫ -transitions, which is often the case in the ap plications already men tioned, T 1 ◦ T 2 will then hav e a very large num ber of paths, most of which will not m atch those of the non-d eterministic automaton T 3 . In other applicatio ns in speech re cognition, or for the computatio n of k ernels in mach ine learn ing, th e central transduce r T 2 could be far more complex and the set of transition s or paths of T 1 ◦ T 2 not matching those of T 3 could be ev en larger . 3.2 3-W ay Composition The key id ea behind our algorithm is precisely to a void cre ating these unnecessary tran- sitions by directly construc ting T 1 ◦ T 2 ◦ T 3 , which we refer to as a 3 -way composition . Thus, our algorith m does not include the intermed iate step of creating T 1 ◦ T 2 or T 2 ◦ T 3 . T o do so, we can pro ceed following a la teral or sideways strate gy : for each transition e 1 in T 1 and e 3 in T 3 , we search for matching transitions in T 2 . The pseudoco de of the a lgorithm in the ǫ -free case is given below . The algorithm computes T , the result of the co mposition T 1 ◦ T 2 ◦ T 3 . It uses a queue S co ntain- ing the set of pairs of states yet to be examined. Th e q ueue d iscipline o f S can be arbitrarily chosen and does not affect the termina tion of the algo rithm. Using a FIFO or L IFO d iscipline, the queue operation s can be perfo rmed in constant time. W e can pre-pr ocess the transducer T 2 in expected linear time O ( | T 2 | E ) by using perfect hash- ing so that th e transitions G (line 1 3) can b e f ound in worst-case linear time O ( | G | ) . Thus, the worst-case run ning time comp lexity o f the 3 -way composition algorithm is in O ( | T | Q d ( T 1 ) d ( T 3 ) + | T | E ) , where T is transdu cer return ed by the algo rithm. Alternatively , d epending on the size o f the three tran sducers, it may be ad v antageou s to d irect the 3 -way composition from the center, i.e., ask fo r each tran sition e 2 in T 2 if there are match ing transitions e 1 in T 1 and e 3 in T 3 . W e refer to this as the central strat- e g y f or our 3 -way compo sition algorithm . Pre-pr ocessing the transducers T 1 and T 3 and creating hash tables for the transition s leaving each state (the expected com plexity of this pre-p rocessing being O ( | T 1 | E + | T 3 | E ) ), this strategy lead s to a worst-case running time complexity of O ( | T | Q d ( T 2 ) + | T | E ) . The lateral and central strategies can be co m- bined by using, at a state ( q 1 , q 2 , q 3 ) , the lateral strategy if | E [ q 1 ] | · | E [ q 3 ] | ≤ | E [ q 2 ] and the cen tral strategy otherwise. Th e algor ithm leads to a n atural lazy or on-d emand im- plementation in which the transitions of the resulting transducer T are gen erated o nly as n eeded by other o perations on T . The standard co mposition coin cides with the 3 - way algorithm when using the central strategy with either T 1 or T 2 equal to the identity transducer . 3 - W A Y - C O M P O S I T I O N ( T 1 , T 2 , T 3 ) 1 Q ← I 1 × I 2 × I 3 2 S ← I 1 × I 2 × I 3 3 while S 6 = ∅ d o 4 ( q 1 , q 2 , q 3 ) ← H E A D ( S ) 5 D E Q U E U E ( S ) 6 if ( q 1 , q 2 , q 3 ) ∈ I 1 × I 2 × I 3 then 7 I ← I ∪ { ( q 1 , q 2 , q 3 ) } 8 λ ( q 1 , q 2 , q 3 ) ← λ 1 ( q 1 ) ⊗ λ 2 ( q 2 ) ⊗ λ 3 ( q 3 ) 9 if ( q 1 , q 2 , q 3 ) ∈ F 1 × F 2 × F 3 then 10 F ← F ∪ { ( q 1 , q 2 , q 3 ) } 11 ρ ( q 1 , q 2 , q 3 ) ← ρ 1 ( q 1 ) ⊗ ρ 2 ( q 2 ) ⊗ ρ 3 ( q 3 ) 12 f or each ( e 1 , e 3 ) ∈ E [ q 1 ] × E [ q 3 ] do 13 G ← { e ∈ E [ q 2 ] : i [ e ] = o [ e 1 ] ∧ o [ e ] = i [ e 3 ] } 14 f or each e 2 ∈ G do 15 if ( n [ e 1 ] , n [ e 2 ] , n [ e 3 ]) 6∈ Q then 16 Q ← Q ∪ { ( n [ e 1 ] , n [ e 2 ] , n [ e 3 ]) } 17 E N Q U E U E ( S, ( n [ e 1 ] , n [ e 2 ] , n [ e 3 ])) 18 E ← E ∪ { (( q 1 , q 2 , q 3 ) , i [ e 1 ] , o [ e 3 ] , w [ e 1 ] ⊗ w [ e 2 ] ⊗ w [ e 3 ] , ( n [ e 1 ] , n [ e 2 ] , n [ e 3 ])) } 19 retur n T 4 Epsilon filtering The algorithm described th us far cannot be readily u sed in most cases found in practice. In gener al, a transdu cer T 1 may h a ve transition s with ou tput label ǫ an d T 2 transitions with input ǫ . A straightfor ward gener alization of the ǫ -fr ee case would generate redun- dant ǫ - paths an d, in the case o f n on-idempo tent semirings, would lead to an incorrect result, e ven just for compo sing two t ransducer s. The weight of two matching ǫ -paths of the original transdu cers would be cou nted as many times as the numbe r of redundan t ǫ - paths generated in the result, instead of one. Thus, a crucial compon ent of our algorithm consists of coping with this problem. Figure 3( a) illustrates the p roblem just mentioned in th e simpler case of two trans- ducers. T o m atch ǫ -path s lea ving q 1 and those leaving q 2 , a g eneralization of the ǫ - free composition can make the fo llo wing moves: (1) first m ove f orward o n a transition of q 1 with o utput ǫ , o r e ven a p ath with o utput ǫ , an d stay at the same state q 2 in T 2 , with the hope of later findin g a tr ansition wh ose o utput label is som e lab el a 6 = ǫ m atching a transition of q 2 with the same input label; (2) pro ceed similarly by following a transi- tion or path lea ving q 2 with input label ǫ while staying at the same s tate q 1 in T 1 ; or , (3) match a transition of q 1 with output label ǫ with a transition of q 2 with input label ǫ . (0 , 0) (0 , 1) (0 , 2) (1 , 0) (1 , 1) (1 , 2) (2 , 0) (2 , 1) (2 , 2) ǫ 1 : ǫ 1 ǫ 1 : ǫ 1 ǫ 1 : ǫ 1 ǫ 1 : ǫ 1 ǫ 1 : ǫ 1 ǫ 1 : ǫ 1 ǫ 2 : ǫ 2 ǫ 2 : ǫ 2 ǫ 2 : ǫ 2 ǫ 2 : ǫ 2 ǫ 2 : ǫ 2 ǫ 2 : ǫ 2 ǫ 2 : ǫ 1 ǫ 2 : ǫ 1 ǫ 2 : ǫ 1 ǫ 2 : ǫ 1 0 ε2:ε1 x:x 1 ε1:ε1 2 ε2:ε2 x:x ε1:ε1 x:x ε2:ε2 (a) (b) Fig. 3. (a) Redundant ǫ -paths. A st raightforw ard generalization of the ǫ -free case could generate all the paths from (0 , 0) to (2 , 2) for ex ample, e ven wh en comp osing just two simple transducers. (b) Filter transducer M allowing a uniq ue ǫ -path. Let us rena me existing output ǫ -labels of T 1 as ǫ 2 , an d existing input ǫ -labels of T 2 ǫ 1 , and let us augment T 1 with a self- loop labeled with ǫ 1 at all states and similarly , augmen t T 2 with a self-loo p labeled with ǫ 2 at all states, as illustrated by Figures 5(a) and (c) . These self- loops correspon d to staying at the same state in that machine while consumin g an ǫ -label of the other transition. Th e th ree moves ju st describe d now cor- respond to the matches (1) ( ǫ 2 : ǫ 2 ) , ( 2) ( ǫ 1 : ǫ 1 ) , and (3) ( ǫ 2 : ǫ 1 ) . T he grid of Figure 3 (a) shows all the p ossible ǫ -paths between com position states. W e will denote by ˜ T 1 and ˜ T 2 the transducers obtained after application of these changes. For the result o f comp osition to be cor rect, betwe en any tw o o f th ese states, all but on e path must b e d isallo wed. Ther e are many p ossible ways of selectin g th at path. One natur al way is to select the shor test path with th e diagonal tran sitions ( ǫ -matchin g transitions) taken first. Figure 3(a) illustrates in boldface the path just described fro m state (0 , 0 ) to state (1 , 2) . Remark ably , this filterin g mechan ism itself can b e encoded as a finite-state transducer such as th e transduce r M o f Figure 3( b). W e denote by ( p, q ) ( r, s ) to indicate that ( r, s ) can be reached from ( p, q ) in the gr id. Proposition 1. Let M be the transduce r of F igur e 3(b). M allows a un ique p ath b e- tween any two states ( p, q ) a nd ( r , s ) , with ( p, q ) ( r, s ) . Pr oof. Let a denote ( ǫ 1 : ǫ 1 ) , b denote ( ǫ 2 : ǫ 2 ) , c denote ( ǫ 2 : ǫ 1 ) , and let x stand for any ( x : x ) , with x ∈ Σ . The following sequence s must be disallowed by a sho rtest-path filter with matching transitions first: ab, ba, ac, b c . This is because, from an y state, instead of the m oves ab or b a , the matching or diago nal transition c can be taken. Similarly , instead of ac or bc , ca an d cb can be taken for an earlier matc h. C onv ersely , it is clear from the grid or an immediate rec ursion that a filter d isallo wing th ese seq uences accepts a uniqu e path between two connected states of the grid. Let L be the set of sequences over σ = { a, b, c, x } that contain o ne of the disallowed sequence ju st men tioned as a sub string that is L = σ ∗ ( ab + ba + ac + bc ) σ ∗ . Then L represents exactly th e set of p aths allowed by that filter and is thus a regular langu age. Let A be an autom aton representing L (Figure 4(a)). An automaton repr esenting L can 0 a b c x 1 a 2 b 3 b c a c a b c x {0} c x {0,1} a {0,2} b x a {0,3} b c x b c a a b c x 0 c x 1 a 2 b x a 3 b c x b c a a b c x (a) (b) (c) Fig. 4. ( a) Finite automaton A representing the set of disallo w ed sequences. (b) Automaton B , result of the determinization of A . Subsets are indicated at each state. (c) Au tomaton C obtained from B by complementation, state 3 is not coac cessible. be con structed from A b y determin ization an d complem entation (Figu res 4(a)-(c )). Th e resulting automa ton C is e quiv alent to the tran sducer M a fter removal o f the state 3 , which does not admit a path to a final state. ⊓ ⊔ Thus, to c ompose two transdu cers T 1 and T 2 with ǫ -tra nsitions, it suffices to compu te ˜ T 1 ◦ M ◦ ˜ T 2 , using the rules of composition in the ǫ -free case. The pro blem of a voiding the creation of redund ant ǫ -paths is more comp lex in 3-way composition since the ǫ -transition s of all three transd ucers must be taken into accoun t. W e d escribe two solutio ns for this problem , one based o n two filters, another b ased on a single filter . 4.1 2 -way ǫ -Filters. One way to deal with this problem is to use th e 2-way filter M , b y first dealing with matching ǫ -p aths in U = ( T 1 ◦ T 2 ) , and then U ◦ T 3 . Howev er , in 3-way compo sition, it is possible to remain at the sam e state of T 1 and the sam e state of T 2 , an d move on an ǫ -transition of T 3 , which p reviously was n ot an op tion. This corr esponds to staying at the same state of U , while moving on a tr ansition of T 3 with inp ut ǫ . T o acco unt for this move, we introdu ce a new symbol ǫ 0 matching ǫ 1 in T 3 . But, we must also ensure the existence of a self-loop with outp ut lab el ǫ 0 at all states of U . T o do so, we augmen t the filter M with self-loops ( ǫ 1 : ǫ 0 ) an d the transducer T 2 with self -loops ( ǫ 0 : ǫ 1 ) (see Figure 5(b )). Figure 5(d ) shows the resulting filter tran sducer M 1 . Fr om Figures 5( a)- (c), it is clear th at ˜ T 1 ◦ M 1 ◦ ˜ T 2 will have prec isely a self-lo op lab eled with ( ǫ 1 : ǫ 1 ) at all states. In the same way , we must allow for moving fo rward on a transition of T 1 with o utput ǫ , th at is consu ming ǫ 2 , while re maining at the same states o f T 2 and T 3 . T o do so, we introdu ce again a new symb ol ǫ 0 this time only r ele vant for matching T 2 with T 3 , add self-loops ( ǫ 2 : ǫ 0 ) to T 2 , and augmen t the filter M by adding a transition lab eled with ( ǫ 0 : ǫ 2 ) (resp. ( ǫ 0 : ǫ 1 ) ) wherever there used to be o ne labeled with ( ǫ 2 : ǫ 2 ) (resp. ( ǫ 2 : ǫ 1 ) ). Figure 5(e) shows the resulting filter transdu cer M 2 . Thus, the composition ˜ T 1 ◦ M 1 ◦ ˜ T 2 ◦ M 2 ◦ ˜ T 3 ensures the uniqueness of matching ǫ -paths. In prac tice, the modifications of the tra nsducers T 1 , T 2 , and T 3 to gener ate ˜ T 1 , ˜ T 2 , and ˜ T 3 , as well as the filters M 1 and M 2 can be d irectly simulated or encod ed in the ε 1 ε 2 a ε :ε 2 0 ε :ε 0 1 a: ε 2 b ε : 1 ε :ε 1 2 ε 2 ε 1 b 0 x:x ε1:ε0 ε2:ε1 1 ε1:ε1 2 ε2:ε2 x:x ε1:ε0 ε1:ε1 x:x ε1:ε0 ε2:ε2 0 x:x ε0:ε1 ε2:ε1 1 ε1:ε1 2 ε0:ε2 ε2:ε2 x:x ε1:ε1 x:x ε0:ε2 ε2:ε2 (a) (b) (c) (d) (e) Fig. 5. Marking of transducers and 2-way filters. (a) ˜ T 1 . Self-loop labeled with ǫ 1 added at all states of T 1 , regular output ǫ s renamed t o ǫ 2 . (b) ˜ T 2 . Self-l oops with labels ( ǫ 0 : ǫ 1 ) and ( ǫ 2 : ǫ 0 ) added at all states of T 2 . Input ǫ s are replaced by ǫ 1 , output ǫ s by ǫ 2 . (c) ˜ T 3 . Self-loop l abeled with ǫ 2 added at all states of T 3 , regular input ǫ s renamed to ǫ 1 . (d) L eft-to-right filter M 1 . (e) Left-to-right filter M 2 . 3-way composition algorith m for gr eater ef ficien cy . The states in T beco me quintuples ( q 1 , q 2 , q 3 , f 1 , f 2 ) with f 1 and f 2 are states of th e filters M 1 and M 2 . Th e introductio n of s elf-loop s and marking of ǫ s can b e s imulated (line 12-13) and the filter states f 1 and f 2 taken into account to compute the set G of the transition matches allo we d (line 13 ). Note that while 3-way comp osition is symmetric , the analysis of ǫ - paths just p re- sented is left-to-righ t an d the filters M 1 and M 2 are not symmetr ic. In fact, we cou ld similarly define righ t-to-left filter s M ′ 1 and M ′ 2 . The ad v antage of the filters presented in this section is h o wev er that they ca n h elp m odify easily an existing imp lementation of co mposition into 3-way compo sition. The filters need ed for the 3 -way case are also straightfor ward generalization s of the ǫ -filter used in standard compo sition. 4.2 3 -way ǫ -Filter . There exists howe ver a direct an d symmetric method for dealin g with ǫ -paths in 3-way composition . Remark ably , this can be d one using a single filter automaton whose labels are 3-d imensional vectors. Figur e 6 shows a filter W that ca n be used f or that purpo se. Each tran sition is labeled w ith a tr iplet. The i th element of the tr iplet correspon ding to the move on the i th tran sducer . 0 indicates staying at th e same state or not moving, 1 that a move is made reading an ǫ -transition, and x a move along a matchin g transition with a non- empty symbol (i.e., non- ǫ output in T 1 , non- ǫ input or outpu t in T 2 and non- ǫ input in T 3 ). Matching ǫ -path s now correspon d to a three-dime nsional gr id, which leads to a more co mplex word combinatoric s problem. As in the two-dimensiona l case, ( p, q , r ) ( s, t, u ) indicates th at ( s, t, u ) can b e reach ed from ( p, q , r ) in the gr id. Several filters are possible, he re we will again fa vor the matching o f ǫ - transitions (i.e. the diago nals on the grid). Proposition 2. The filter a utomaton W allows a uniq ue path between an y two states ( p, q , r ) and ( s, t, u ) of a thr ee-dimension al grid, with ( p, q , r ) ( s, t, u ) . 0 (1,1,1) (1,x,x) (x,x,1) (x,x,x) 1 (0,0,1) 2 (0,1,1) (0,x,x) 3 (0,1,0) 4 (1,1,0) (x,x,0) 5 (1,0,0) 6 (1,0,1) (x,x,x) (0,0,1) (0,x,x) (x,x,1) (x,x,x) (0,0,1) (0,1,1) (0,x,x) (0,1,0) (x,x,0) (x,x,x) (0,x,x) (0,1,0) (x,x,0) (x,x,x) (1,x,x) (0,x,x) (0,1,0) (1,1,0) (x,x,0) (1,0,0) (x,x,x) (x,x,0) (1,0,0) (1,x,x) (x,x,1) (x,x,x) (0,0,1) (0,x,x) (x,x,0) (1,0,0) (1,0,1) Fig. 6. 3-way m atching ǫ -filter W . Pr oof. Let M and X b e th e d efined b y M = { ( m 1 , m 2 , m 3 ) : m 1 , m 2 , m 3 ∈ { 0 , 1 } } and X = { ( x, x, m ) , ( m, x, x ) : m ∈ { 0 , 1 }} . A sequen ce of moves correspond ing to a matchin g ǫ -path is th us an element o f ( M ∪ X ) ∗ . T wo sequenc es π 1 and π 2 are equiv alent if they consume the same sequen ce of transitions on each o f the three trans- ducers, for example (0 , x, x )(1 , 1 , 0) is equiv alent to (1 , x, x )(0 , 1 , 0) . For each set of equiv alent move sequences between two states ( p, q , r ) an d ( s, t, u ) , we must preserve a uniq ue seque nce r epresentative of that set. W e now d efine the un ique cor responding representative ¯ π of each sequence π ∈ ( M ∪ X ) ∗ . In all cases, π will be the seq uence where the 1 -moves and the x -moves are tak en as early as possible. 1. Assume that π ∈ M ∗ and let n i be th e number of occurren ces of 1 a s the i th element in a triplets d efining π . By s ymmetry , we can assume without loss of gen erality that n 1 ≤ n 2 ≤ n 3 . W e define π as (1 , 1 , 1) n 1 (0 , 1 , 1) n 2 − n 1 (0 , 0 , 1) n 3 − n 2 , that is th e sequence where the 1 -moves are taken as early as possible. 2. Otherwise, π can b e d ecomposed as π = µ 1 χ 1 µ 2 χ 2 · · · µ k χ k µ k +1 with k ≥ 1 , µ i ∈ M ∗ and χ i ∈ X . π is then d efined by indu ction on k . By symmetry , we can assume th at χ 1 = ( x, x, m ) with m ∈ { 0 , 1 } . Let π ′ be suc h that π = µ 1 χ 1 π ′ , let n i be the number of times 1 appears as i th element in a triplet of µ 1 , and let n ′ 3 the number o f times 1 is fo und as third element in a triplet re ading χ 1 π ′ from left to right befor e seeing an x . (a) If n 3 ≤ max( n 1 , n 2 ) , let n = min( n ′ 3 , max( n 1 , n 2 ) − n 3 ) . W e can then obtain χ ′ 1 π ′′ by rep lacing the n first 1’ s that appears in χ 1 π ′ as third eleme nt of a triplet by 0’ s. Let µ ′ 1 = µ 1 (0 , 0 , 1) n . W e then have tha t π is equiv alent to µ ′ 1 χ ′ 1 π ′′ . By induction , we can compute π ′′ and µ ′ 1 and define π as µ ′ 1 χ ′ 1 π ′′ . (b) If n 3 > max( n 1 , n 2 ) , we define n as n 3 − ma x( n 1 , n 2 ) if χ 1 = ( x, x, 1 ) and n 3 − max( n 1 , n 2 ) − 1 if χ 1 = ( x, x, 0) . Let µ ′ 1 be (1 , 1 , 1) n 1 (0 , 1 , 1) n 2 − n 1 if n 1 < n 2 and (1 , 1 , 1) n 2 (1 , 0 , 1) n 1 − n 2 otherwise. W e can then define π as µ ′ 1 ( x, x, 1 ) (0 , 0 , 1) n π ′ . A key pro perty of π is that it can be char acterized by a small set of forb idden sequ ences. Indeed , observe that the following rules apply: 1. in two consecutive triple ts, for i ∈ [1 , 3] , 0 in the i th machine of the first triplet cannot be followed by 1 in the second. Indeed, as in the 2-way case, if we stay at a state, then we must remain at that state u ntil a match with a no n-empty sy mbol is made (this correspo nd to cases 1 and 2(a) of the definition of π ). 2. two 0 s in a djacent transducer s ( T 1 and T 2 , or T 2 and T 3 ), cann ot b ecome both x s unless all comp onents beco me x s; For example, the seque nce (0 , 0 , 1)( x, x, 1) is disallowed since in stead ( x, x, 1 )(0 , 0 , 1) with an ear lier m atch c an be followed. Similarly , th e seq uence (0 , 0 , 1)( x, x, 0) is d isallo w ed since instead the single and shorter move ( x, x, 1) can b e taken (this corr espond to case 2(b) of the definition). 3. the triplet (0, 0 , 0 ) is al ways forbidden since it correspo nds to remainin g at the same state in all three transduce rs. Con versely , we observe that wit h o ur definition of ¯ π , these cond itions are also sufficient. Thus, a filter can b e obtain ed by taking the complemen t of an automaton accepting exactly the sequences of forb idden substrin gs just descr ibed. The r esulting determ inistic and m inimal autom aton is the filter W shown in Figure 6 . Observe th at each state o f W h as a transition labeled b y ( x, x, x ) goin g to the initial state 0 , this corre sponds to resetting the filter at the end of a matching ǫ -path. ⊓ ⊔ The filter W is used as follows. A triplet state ( q 1 , q 2 , q 3 ) in 3 -way co mposition is augmen ted with a state r of th e filter autom aton W , starting with state 0 of W . The transitions o f th e filter W at each state r determine the matches or moves allowed for that state ( q 1 , q 2 , q 3 , r ) of the comp osed machin e. 5 Experiments This section reports the results of experimen ts car ried out in tw o different applications: the computation of a complex edit-d istance between two au tomata, as motiv ated by applications in text and speech processing [9] , and the computation of kernels between automata needed in spoken-dialog classification and other machine learning tasks. T able 1. Comparison of 3-way composition with standard composition. The computation times are reported in seconds, t he size of T 2 in number of transitions. T hese experiments were per- formed on a dual-core AMD Opteron 2.2GHz with 16GB of memory , using the same software library and basic infrastructure. n -gram K ernel Edit distance ≤ 2 ≤ 3 ≤ 4 ≤ 5 ≤ 6 ≤ 7 standard +transpositions Standard 65.3 68.3 71.0 73.5 76.3 78.3 586.1 913.5 3-way 8.0 8.1 8.2 8.2 8.2 8.2 3.8 5.9 Size of T 2 70K 100K 130K 160K 190K 220K 25M 75M In the edit-distance case, the standar d transducer T 2 used was one based on all inser- tions, deletions, and substitutions with different costs [9]. A mo re realistic transdu cer T 2 was on e au gmented with all transpositions, e.g., ab → b a , with different costs. In the kernel case, n -gram kernels with v ar ying n -g ram order were used [3]. T able 5 shows the results of these experimen ts. The finite auto mata T 1 and T 3 used were extracted fro m real text and speech pr ocessing tasks. Th e re sults show th at in all cases, 3-way composition is orders of magnitude faster than standard composition. 6 Conclusion W e p resented a gener al algorithm for the composition of weighted finite- state trans- ducers. I n many instances, 3 -way composition ben efits from a significantly better time and space co mplexity . Our experiments with both complex edit- distance compu tations arising in a numb er of applications in text and speech pro cessing, and with kernel com- putations, crucial to many machine learning algorithm s applied to sequence prediction, show that our alg orithm is also su bstantially faster th an stan dard c omposition in prac- tice. W e expect 3-way composition to furth er improve efficiency in a variety of other areas and application s in which weighted compo sition of transduce rs is used. Acknowledgments. The research of Cyril Allauzen and Mehryar Mohri was partially sup- ported by the New Y ork State Office of Science T echnology and Academic Research (NYS- T AR). This project was also sponsored in part by the Department of the Army A ward Num- ber W81XWH-04-1-03 07. The U.S. Army Medical Research Acquisition Activity , 820 Ch andler Street, Fort Detrick MD 21702-50 14 is the awarding and administering acquisition of fice. The content of this material does not necessarily refl ect the position or the policy of t he Governm ent and no of ficial endorsement should be inferred. Refer ences 1. Jean Berstel. Tr ansduc tions and Context-F r ee L angu ag es . T eubner , 1979. 2. Stanley Chen and Joshua Goodman. An empirical study of smoothing techniques for lan- guage modeling. T echnical Report, TR-10-98, Harvard Uni versity , 1998. 3. Corinna Cortes, Patrick Haffner , and Mehryar Mohri. Rationa l Kernels: Theory and Algo- rithms. Journa l of Machine Learning Resear ch , 5:1035–1062 , 2004. 4. Karel Culik II and Jarkko Kari. Digital Images and Formal Languages. I n Grzeg orz Rozen- berg and Arto Salomaa, editors, Handbook of F ormal Languag es , volume 3 , pages 599 –616. Springer , 1997. 5. Samuel Eilenberg. Automata, Langu ag es and Machines . Academic Press, 1974–7 6. 6. Slav a M. Katz. Estimation of probabilities f rom sparse data for the language model compo- nent of a speech recogn iser . IEEE T ransactions on Acoustic, Speec h, and Signal Pr ocessing , 35(3):400–4 01, 1987. 7. W erner Kuich and Arto Salomaa. Semirings, Automata, Languag es . Number 5 in E A TCS Monograp hs on Theoretical Computer Science. Springer-V erlag, 1986. 8. Mehryar Mohri. Finite-Stat e Tran sducers in Language and Speech Processing. Computa- tional Linguistics , 23(2), 1997. 9. Mehryar Mohri. Edit-Di stance of W eighted Automata: General Definitions and Algorithms. International J ournal of F oundations of Computer Science , 14(6):957–9 82, 2003. 10. Mehryar Mohri. Statist ical Natural L angua ge Processing. In M. Lothaire, editor , Applied Combinatorics on W ords . Cambridge Univ ersity P ress, 200 5. 11. Mehryar Mo hri, Fernando C. N. Pereira, and Mich ael Riley . W eighted Automata i n T ext an d Speech Processing. In Procee dings of t he 12th biennial E ur opean Confer ence on A rtificial Intelligence (ECAI-96) . John W iley and Sons, 1996. 12. Fernando Pereira and Michael Riley . F inite State Langua ge P r ocessing , chapter Speech Recognition by Composition of W eighted Finite Automata. The MIT Press, 199 7. 13. Dominique Perrin. W ords. In M. Lothaire, editor, Combinatorics on wor ds , Cambridge Mathematical Library . Cambridge Univ ersity Pr ess, 19 97. 14. Arto Salomaa and Matti Soittola. A utomata-Theor etic Aspects of Formal Power Series . Springer-V erlag, 1978.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment