Toric grammars: a new statistical approach to natural language modeling
We propose a new statistical model for computational linguistics. Rather than trying to estimate directly the probability distribution of a random sentence of the language, we define a Markov chain on finite sets of sentences with many finite recurre…
Authors: Olivier Catoni, Thomas Mainguy
T oric grammars : a new statistical approac h to natural language mo deling Olivier Catoni Thomas Mainguy Abstract: W e propose a new statistical mo del for computat ional linguis- tics. Rat her than t rying to estimate directly the probability distribution of a random sentenc e of the language, we define a Marko v cha in on fini te sets of sen tences with many finite recurren t comm unicating classes and define our language mo del as the inv ari an t probabilit y measures of the chain on eac h recurren t comm unicating class. This Marko v chain, that we call a com- mun ication mo del, recombines at eac h step randomly the set of sen tences forming its current state, using some grammar rules. When the grammar rules are fixed and kno wn in adv ance i nstea d of b eing estimated on the fly , we can prov e supplemen tary mathematical properties. In particular, we can prov e in this case that all states are recurrent states, so that the chain defines a partition of i ts state space into finite recurrent communicating classes. W e show that our approac h is a decisive departure fr om Marko v models at the sentence l ev el and discuss its relationships with Con text F ree Grammars. Although the toric grammars w e use are closely related to Con- text F r ee Grammars, the wa y we generate the language from the grammar is qualitativ ely different. Our commu nication mo del has tw o purp oses. On the one hand, it is used to define indirectly the probability distri but ion of a random sente nce of the language. On the other hand it can s erv e as a (crude) mo del of language transmi ssion from one speaker to another speaker through the comm unication of a (large) set of sentenc es. AMS 2000 subj ect classifications: Primary 62M09, 62P99 , 68T50; sec- ondary 91F20, 03B65, 91E40, 60J20. Keywords a nd phrases: Probabilistic grammars, Contex t F ree Gram- mars, Language mo del, Computationa l linguistics, Statistical learning, Fi- nite state Marko v c hains. 1. In tro duction to a new com m unication m o del In the well k nown kernel approa c h to densit y estimation on a measurable space X , the probability distribution P of a rando m v ariable X ∈ X is estimated from a s ample ( X 1 , . . . , X n ) of n independent co pies of X as 1 n P n i =1 k ( X i , d x ), where k is a suitable Mar kov kernel. This kernel estimate can be seen as a mo dification of the empirical measure P = 1 n P n i =1 δ X i . In the context of natural language mo deling a t the sent ence level, X is the set of sentences, that is the set of sequences of words of finite length. Finding sensible kernel estimates o r sens ible parametric mo dels in this co n text is a challenge. Therefore, w e prop ose here another route, that w e will describ e as a n alternative wa y of pro ducing a mo dification of the empirical measure. The idea is to recombine re p eatedly a set of sentences. Let us des c r ibe for this a general framework, concerned with a n arbitrary coun table state spa ce X . Septem ber 1, 2018 1 O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 2 Let P n = 1 n n X i =1 δ x i , x i ∈ X be the set of empirical measur es of a ll po ssible samples of size n . Let us consider a par a metric family { q θ , θ ∈ Θ } of Marko v k ernels on P n . Let us a ssume for simplicit y that for a n y P ∈ P n , the reachable set Q ∈ P n , P t ∈ N q t θ P, Q ) > 0 is finite, where q t θ is q θ comp osed t times with itself, so that for instance q 2 θ ( P, Q ) = P P ′ ∈ P n q θ ( P, P ′ ) q θ ( P ′ , Q ). In this case we c an define the Marko v k ernel b q θ ( P, Q ) = lim k →∞ 1 k k X t =1 q t θ ( P, Q ) . It is such that for an y P ∈ P n , b q θ ( P, · ) is an inv ar ia n t mea sure of q θ . More generally q θ b q θ = b q θ q θ = b q θ . The distribution b q θ ( P, · ) ∈ M 1 + P n induces a marginal distribution b Q θ ,P on X thro ugh the form ula b Q θ ,P = X Q ∈ P n b q θ ( P, Q ) Q. (1.1) In this pap er, w e will be concerned with estimators of the form b P = b Q θ , P , if θ is fixed in adv ance, or of the form b Q b θ , P , if b θ is an estimator of the par ameter θ depending also on P . Another interpretation of our framework is to consider q θ as a co mm uni- cation mo del. One s p eaker hea r s a set of sentences descr ibed by its empir ic a l distribution P ∈ P n (whic h mea ns that he will not make us e of the sp ecial order in which he ha s hea rd them). He uses those sen tences to learn the cor- resp onding lang uage. Then he teaches a nother sp eaker what he has learnt by outputting ano ther random set of sentences, distributed accor ding to q θ ( P, · ). The language model (a s opp osed to the comm unication mo del q θ ), is b Q θ ,P , the av erage senten ce distribution along an infinite c hain of comm unicating s p eakers. If we start fro m a recurr en t state P , and we assume that θ is known, w e o btain a communi ca tio n mo del where the tar get sentence distribution b Q θ ,P can b e lea rn t without error from the set o f se ntences output b y any inv olved sp eak er. Indeed b Q θ ,P = b Q θ ,Q for any Q in the comm unicating clas s o f P , which in this situation is also the reachable set from P . This err or fr e e estimation b eha viour is desirable for a communication mo del. It tells us that the langua ge ca n be transmitted from sp eaker to sp eak er without distortion, a des ir able feature in the case of a larg e num b er of sp eakers. The mo del may als o account for w eak stimulus learning, the fact that human b eings learn langua g e through a limited num ber of examples co mpared with the v a riet y of new sentences they are able to formulate. Indeed, wherea s the size of the suppo rt of P ∈ P n , the n umber of s en tences heard b y one sp eaker, is constant and equal to n , the supp ort of the language mo del b Q θ ,P may b e muc h la r ger. W e will actually g ive a toy exa mple where the n um b er of sentences in the language is expo nen tial with n . O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 3 In the la nguage transmission in terpreta tio n, we ma y ev aluate the in terest of the mo del by studying whether it can mo del a la rge family of sentence distri- butions. This r ic hness will dep end on the num b er o f recurrent communicating classes o f the communication Markov model q θ , since any inv ar ian t distribution b q θ ( P, · ) is a conv ex combination of the unique inv ariant meas ures supp orted b y each recurrent comm unicating clas s. The situation is even simpler in the case when all P ∈ P n are recurr en t states (a fact w e will be able to prov e in our particular model). In this case b q θ ( P, · ) is the unique inv ariant measur e suppo rted b y the r e c ur ren t comm unicating class to which P b elongs. In this pap er w e will focus o n the constr uction and mathematical prop erties of the communication mo del q θ . W e will also touc h on the estimation problem stated in the op ening of this introduction by providing some estimator b Q b θ, P where b θ ( P ) is an estimato r of the para meter computed on the observed sam- ple. Howev er we will leave the mathematical prop erties of this estimator for further studies. W e will b e conten t with providing some promising preliminary exp e rimen ts computed on a sma ll sample and will share with the reader some qualitative ex planations o f its behaviour. The par ameter θ of o ur mo del will b e a new k ind of grammar , closely rela ted to Context F ree Gr ammars, but used to generate sen tences in a different wa y . 2. T oric grammars Now that we hav e explained our general framework based on a communi catio n Marko v kernel q θ defined on empirical distributions, let us come to natura l language mo deling more sp ecifically , and describe a dedicated family of kernels. Natural la nguage pro cessing in linguistics ha s b een using more and mo re elab orate mathematical to ols (a brief presentation of some of them is g iv en by E. Stabler in [ 14 ]). The n -gr am mo dels are widely used, althoug h they fail to grasp the r ecursiv e nature of natural langua ges, and do not use the syntactic prop erties of sentences. Efforts hav e b een made to improv e the perfor mance of these mo dels, by introducing syntax, (Della Pietra et al. 199 4 [ 9 ]; Roa r k 2 001 [ 12 ]; T an et al. 2012 [ 15 ]). One wa y to do this is to use Co n text F ree Grammars, also named phra se structure grammars, in tro duced by N. Chomsky as p ossible mo dels for the logica l structure of natural lang ua ges (see for example [ 4 – 6 ]), and their probabilistic v ariants (Chi 1999 [ 3 ]). Our pro posal follows this trend, but with the go al to s e parate ourselves from classic n -g r ams, seeing syntax a s equiv a lence classes betw een constituents, which we try to discov er. W e co nsider some dictiona r y of w ords D . Each s tatistical sample, as explained in the introduction, is made of a set o f sentences. Each sentence is a sequence of w or ds of D . The sentences ma y b e of v ariable length. T o simplify notations we will use non normalize d empirical measure s . T hus, the state space of the co mm unication Marko v k ernel q θ will b e P n = n X i =1 δ s i , s i ∈ D + , O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 4 where we intro duce the notation D + = ∞ [ j =1 D j . W e will call P n the set o f texts of length n . Let us notice that for us, texts are unordered sets of sentences. The question of generating mea ningful order ed sequences o f sentences is also of interest, but will not b e address e d in this study . In order to define the comm unication kernel, we will describe ra ndom trans - formations on texts, related to the no tion o f Cont ext F ree Grammars . Let us start with an informa l pre s en tation. The communication kernel will p erform random recombinations of sen tences. Our p oin t of view is to s ee a Context F ree Grammar as the r esult of so me fragmentation pro cess applied to a set of se ntences. Let us ex plain this on a simple example. Consider the sentence This is my friend Peter. Imagine we would lik e to repr esen t this sentence as the result of pasting the expression m y friend in its cont ext, b ecause w e think lang uage is built b y cutting and pas ting expressions drawn from some large set of memor ized sentences. W e can do this by intro ducing the simple Cont ext F ree Gra mmar 0 → This is 1 Peter . 1 → m y friend where we hav e used num b ered framed boxes for non terminal symbols, the start symbol be ing 0 . The tw o rules mean that we can rewrite the start symbol 0 to obtain the r igh t-hand side of the first rule, and that w e can then rewr ite the non terminal symbol 1 as the right-hand side of the second rule. Since w e w ant to see the rules of the g rammar as the result of some splitting op eration, we are going to use more s ymmetric notations. Instead of considering that w e have describ ed our or ig inal sen tence with the help of t wo rules a nd tw o non terminal symbols 0 and 1 , we may as well consider that w e hav e split our origina l sentence in to tw o new sen tences using thr e e non terminal sym b ols, namely 0 → , 1 and 1 → . T o emphasize this interpretation, we can a dopt more symmetric nota tio ns and write these three non termina l sym b ols as [ 0 , ] 1 and [ 1 . With these new notations, the repr esen tation of o ur o r iginal sentence is now [ 0 This is ] 1 Peter . [ 1 my friend In this new representation, the rewriting r ules can be replaced b y merge opera- tions of the type a ] i c + [ i b 7→ abc O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 5 W e can make this mer g e o perations even more sy mmetric, if we consider that each express ion can b e represen ted b y any of its circular permutations. Indeed, each expressio n co n tains exactly one no n terminal symbol of the form [ i , and therefore is uniquely defined b y any of its circular p ermutations (since, due to this feature, w e can define the p erm utation in which the op ening bra c ket [ i comes first as the canonical form, and recover it from a n y other cir cular p ermut atio n). Using this co n ven tion, w e can write a ] i c as ca ] i and describ e the merg e oper ation as ca ] i + [ i b 7→ cab, or, renaming ca as a simply a s a ] i + [ i b 7→ ab. Let us formalize what w e ha ve explained so far . Let D be some dictionary o f words (whic h can b e fo r the sake of this mathematical description any finite set, representing the words of the natural language to be mo deled). Let us form the symbol set S = D ∪ [ i , ] i , i ∈ N . Let us define the set of circular permutations of a sequence of symbols as S ( w 0 , . . . , w ℓ − 1 ) = ( w ( i + j mo d ℓ ) , i = 0 , . . . , ℓ − 1 ) , j = 0 , . . . , ℓ − 1 , so that for instance S ( w 0 , w 1 , w 3 ) = { w 0 w 1 w 2 , w 1 w 2 w 0 , w 2 w 0 w 1 } , a nd its s up- po rt (the set of symbols included in the sequence) a s supp w 0 , . . . , w ℓ − 1 = w 0 , . . . , w ℓ − 1 . Let A + = S + ∞ n =1 A n , ] + = ] i , i ∈ N \ { 0 } , and consider the set of expressions E = n e ∈ S [ i a , i ∈ N , a ∈ D ∪ ] + + \ ] + o . In pla in words, an expr ession is a circular p erm utation of a finite sequence of symbols starting with a n opening brack et, con taining no other opening bracket and not reduced to an op ening brack et follow ed b y a closing bracket . This definition mirrors the fact that a given rule of a Context F ree Gr ammar has exactly one i → (the left side), and the rig h t side of the rule cannot b e just a no n ter minal symbol j . Indeed, if w e had allow ed i → j , or with our notations [ i ] j , we could as w ell hav e replaced i by j ev ery where. Definition 2 . 1 The set of toric gr ammars is the set G of p ositive me asur es G on E with finite supp ort su ch t ha t for any cir cu lar p ermutation e ′ ∈ S ( e ) of any expr ession e ∈ E , G ( e ′ ) = G ( e ) . In other words, a tor ic grammar G is a positive measure with finite suppo rt on the set of express io ns E satisfying G ( e ) = | S ( e ) | − 1 G S ( e ) . O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 6 Let us r emark that, in o ur definition of toric g rammars, on top of cho osing some sp ecial nota tio ns for Context F ree Gra mmars, we also in tro duced p ositiv e weigh ts, so that it is mor e the supp ort of a tor ic gra mma r than the gra mma r itself that corr esponds to the usual notion of Context F ree Grammar. The w eights will serve to keep tra c k o f word fr equencies throug h the pro cess of splitting a set of se ntences to obtain a toric grammar . Our aim is indeed to build a toric grammar from a text. T o b e consisten t with our definition of gra mmars, we will also define texts as p ositive meas ures. Let us give a forma l definition. W e will forget the sen tence order, a text will be an unordered set of sentences with pos sible repetitions. Definition 2 . 2 The set T of text s is the set of toric gr ammars with int e ger weights supp orte d by S ([ 0 D + ) , that is the set of t ori c gr ammars with inte ger weights using only one non terminal symb ol, the start symb ol [ 0 . In this definition, it sho uld b e understo o d that [ 0 D + = [ 0 , w 1 , . . . , w k , where k ∈ N \ { 0 } and w i ∈ D , 1 6 i 6 k , and that S [ 0 D + = [ e ∈ [ 0 D + S ( e ) . 3. A roadmap tow ards a communication mo del W e will use toric grammar s as in termediate steps to define the tra nsition prob- abilities of our communication mo del on texts. T o this purpos e, we will first in tro duce some general t ypes o f tra nsformations o n toric grammar s (reminding the rea der that in o ur formalism texts are so me sp ecial subset of toric gra m- mars). It will turn o ut that t wo types of expressio ns, global expr essions and lo cal expressions, will play different roles. Let us define them r espectively as E g = E ∩ S [ 0 S + , E ℓ = E ∩ S [ + S + , where we r emind that [ + = [ i , i ∈ N \ { 0 } and S + = S ∞ j =1 S j . An y toric grammar G ∈ G can b e accordingly decomp osed in to G = G g + G ℓ , where G g ( A ) = G A ∩ E g and G ℓ ( A ) = G A ∩ E ℓ , for any subse t A ⊂ E . The transitions of the comm unication chain with kernel q θ ( T , T ′ ) will be defined in t wo s teps. The firs t step cons is ts in lear ning from the text T a tor ic grammar G . T o this purp ose we will s plit the sentences of T into sy ntactic constituent s. The se cond step co nsists in merging the constituen ts aga in to pro duce a ra ndom new text T ′ . The par ameter θ = R of the c omm unication kernel q θ , will also b e a toric gra mmar. The r ole o f this reference gra mmar R will b e to pr o vide a sto ck of lo cal expressions to b e used when computing G O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 7 from T . W e will discuss later the q uestion o f the estimation o f R itself. F or the time b eing, w e will assume that the reference gr a mmar R is a para meter of the communi ca tio n chain, known to all inv olved speakers. W e could ha ve defined a communication kernel q b R ( T ) ( T , T ′ ), wher e the reference gra mma r b R ( T ) itself is estimated at ea c h step from the current text T , but we w ould ha ve obtained a model with w eaker pro perties, where, in particular, all the states are not neces s arily recurrent states. On the other hand, the pro of that the reachable set from any starting p oin t is finite still ho lds fo r this modified mo del, so that it do es provide an alter na tiv e wa y of defining a language mo del as described in the introduction. W e will still need an estimator b R ( T ) of the reference gr ammar, in o rder to provide a langua ge estimator b Q b R ( T ) , T ′ , where we ar e using the nota tio ns of eq. (1.1) on page 2. The estimation b R ( T ) of the reference grammar will b e achiev ed b y running some fra gmen tation pro cess o n the text T ∈ T . 4. Non sto c hastic synt ax spl i tting and m erging Let us now describ e the model, starting with the description of some no n random grammar transfor mations. W e alre a dy introduced a mo del for g rammars that includes texts as a s pecial case. W e hav e now to describ e how to generate a toric grammar from a text, with, or without, the help of a refer e nce grammar to learn the lo cal comp onent of the g rammar. The mechanism pro ducing a gr ammar from a text will be so me so rt of rando m parse algor ithm (or rather tentativ e parse algorithm). All of this will b e a c hieved b y t wo transfor mations on toric gr ammars that will resp ectiv ely split and mer ge expressions (syntagms) of a toric gr ammar in to smaller or bigger ones. W e will first describe the s e ts of p ossible splits and merges from a g iv en grammar . This will serve a s a basis to define ra ndom transitio ns from one grammar to another in s ubse q uen t sections. Let us first introduce so me elementary oper ations inv olving toric gr ammars. e ⊕ f = X s ∈ S ( e ) δ s + X s ∈ S ( f ) δ s , e, f ∈ E , e ⊖ f = X s ∈ S ( e ) δ s − X s ∈ S ( f ) δ s , e, f ∈ E , ρ ⊗ e = ρ X s ∈ S ( e ) δ s , ρ ∈ R , e ∈ E , The first op eration builds a toric grammar con taining expressions e and f with weigh ts 1, and the third one builds a toric g rammar co n taining expression e with weigh t ρ . W e can generalize these notations to be able to take the sum of a toric O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 8 grammar and an expressio n, as well as the sum of tw o toric gra mmars. G ⊕ e = G + X s ∈ S ( e ) δ s , G ∈ G , e ∈ E G ⊖ e = G − X s ∈ S ( e ) δ s , G ∈ G , e ∈ E G ⊕ G ′ = G + G ′ , G , G ′ ∈ G . With these notations, a split is descr ib ed as G ′ = G ⊖ ab ⊕ a ] i ⊕ [ i b, G , G ′ ∈ G , the fact that G , G ′ ∈ G implying that i ∈ N \ { 0 } , ab, a ] i , [ i b ∈ E and G ( ab ) > 1 . The (partial) order relation G 6 G ′ will also b e defined by the rule G 6 G ′ ⇐ ⇒ G ′ − G ∈ G , or equiv alently G 6 G ′ ⇐ ⇒ G ′ − G ∈ M + ( E ) . Let us resume our example. Starting from the o ne sen tence text T = 1 ⊗ [ 0 This is my friend Peter . we ge t after splitting the grammar G = [ 0 This is ] 1 Peter . ⊕ [ 1 my friend which can also be wr itten as G = Peter . [ 0 This is ] 1 ⊕ [ 1 my friend In this example, as well a s in the following, punctuation marks are treated as words, so that here the required dictionary has to include { is, my , friend, Peter, This, . } . Splitting a sentence providing a new label for ea c h split does not crea te generalizatio n, since it a llo ws only to merge back tw o e x pressions that came from the sa me split. T o crea te a gra mmar capable o f yielding new sentences, we need some lab el identification scheme. W e will p erform la b el ident ification through the mor e gener al pro cess of la b el remapping, iden tification being a consequence of the fact that the map may not be one to one. Let F = f : N → N such that f (0 ) = 0 be the set of lab el maps. F or an y symbol ] i or [ i , i ∈ N , let us define f ( ] i ) = ] f ( i ) and f ([ i ) = [ f ( i ) . Let us a lso define for any word w ∈ D , f ( w ) = w and for any expression e = ( w 0 , . . . , w ℓ − 1 ), f ( e ) = ( f ( w 0 ) , . . . , f ( w ℓ − 1 )). Since any grammar G ∈ G is a measure on the set of expres sions E , we can define its ima ge measure by f , considered as a map from E to E . W e will put f ( G ) = G ◦ f − 1 , meaning that f ( G )( A ) = G ( f − 1 ( A )), for any subset A ⊂ E . O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 9 Definition 4 . 1 Two lab el maps f and g ∈ F ar e said t o b e isomorphi c if ther e is a one to one lab el map h ∈ F such that g = h ◦ f . In t hi s c ase h − 1 ∈ F and f = h − 1 ◦ g . Two gr ammars G and G ′ ∈ G ar e said to b e isomorphic if ther e is a one t o one lab el map f ∈ F such t ha t f ( G ) = G ′ . In this c ase, f − 1 ( G ′ ) = G and we wil l write G ≡ G ′ . If f and g ar e two isomorph ic lab el maps, then for any toric gr ammar G ∈ G , f ( G ) and g ( G ) ar e isomorphic gr ammars. In the fol lowing of t hi s p ap er, to e ase notations and simplify exp osition, we wil l fr e ely identify isomorphi c lab el maps and isomorphic gr ammars and often sp e ak of them as if they wer e e qual. This being put, w e pro ceed with the in tro duction of a set of grammar trans- formations β that consist in a split with p ossible lab el remapping. The split will be the core co mp onent for generating a toric grammar fro m a text, by splittin g the sentences in smaller parts (syn tagms). Definition 4 . 2 (Spl itting rule) F or any G ∈ G , let us c onsider β ( G ) = n f ( G ′ ) , f ∈ F , G ′ ∈ G , G ′ = G ⊖ ab ⊕ a ] i ⊕ [ i b o ⊂ G . L et u s r emark that in this definition, ne c essarily, ab, a ] i , [ i b ∈ E , i ∈ N \ { 0 } , 1 ⊗ ab 6 G , and a ] i ⊕ [ i b 6 G ′ . L et u s put β ∗ ( G ) = + ∞ [ n =0 β ◦ · · · ◦ β | {z } n times ( G ) , the set of gr ammars that c an b e c onstructe d fr om r ep e ate d invo c ations of β . Lemma 4.1 L et us r e c al l that S = D ∪ [ i , ] i , i ∈ N and let u s put S ∗ = S + ∞ n =0 S n . F or any text T ∈ T , and any G ∈ β ∗ ( T ) , G is a toric gr ammar with inte ger weights, G ([ i S ∗ ) = G (] i S ∗ ) , i ∈ N \ { 0 } , G ( wS ∗ ) = T ( w S ∗ ) , w ∈ D ∪ { [ 0 } , and in p articular G ([ 0 S ∗ ) = T ([ 0 S ∗ ) , G ( wS ∗ ) 6 T ( w S ∗ ) , w ∈ D ∪ { [ 0 } + . This me ans t hat in any toric gr ammar obtaine d by splitting a text, the weights of expr essions c ontaining the two forms ] i and [ i of a lab el ar e b alanc e d, the wor d fr e quencies ar e t he same in the gr ammar and in t he text, and the numb er of senten c es c ontaine d in the tex t is given by the total weight of expr essions c ontaining the start symb ol [ 0 in the gr ammar. O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 10 Pr o of. F or the first assertion, a n induction on the n umber of a pplications of β yields the result, since T ([ i S ∗ ) = T (] i S ∗ ) = 0 , i ∈ N \ { 0 } , and, for any G ′ = G ⊖ ab ⊕ a ] i ⊕ [ i b , and any label j ∈ N \ { 0 , i } , G ′ ([ j S ∗ ) = G ([ j S ∗ ) , (4.1) G ′ (] j S ∗ ) = G (] j S ∗ ) , (4.2) whereas G ′ ([ i S ∗ ) = G ([ i S ∗ ) + 1 , (4.3) G ′ (] i S ∗ ) = G (] i S ∗ ) + 1 . (4.4) F or the second assertion, it suffices to remark that the weigh t of expressio ns beg inning with a given word is inv aria nt by application of β . Indeed, any word symbol w ∈ D ∪ { [ 0 } app ears the same n umber of times at the b eginning of an expression of 1 ⊗ ab and of a ] i ⊕ [ i b . This lemma is impor tan t, b ecause we will subseq uen tly imp ose re s trictions on the splitting r ule ba sed o n word frequencies. Our choice to define a new type of grammar as a positive measure on s ym b ol sequence s was made to keep track of w or d fr equencies throughout the construction. Let us now describ e the reverse of a splitting tra nsformation, that we will call a merge transfor mation. This transforma tion will b e c e ntral in generating new texts fro m a toric g r ammar, b y mer ging the syntagms int o bigger o nes , ending with a full sentence. Definition 4 . 3 (M e rge rule) F or any t ori c gr ammar G ∈ G we c onsider the fol lowing set of al lowe d mer ge tr ansformations α ( G ) = n G ′ ∈ G , G ′ = G ⊖ a ] i ⊖ [ i b ⊕ ab o . L et us r emark that in this definition, ne c essarily i ∈ N \ { 0 } , a ] i , [ i b, ab ∈ E , and a ] i ⊕ [ i b 6 G . The merge transfor ma tion is indeed the reverse of the split , in the se ns e that: Lemma 4.2 F or any G , G ′ ∈ β ∗ ( T ) , G ′ ∈ β ( G ) if, and only if, ther e is f ∈ F such that f ( G ) ∈ α ( G ′ ) . Pr o of. Let us supp ose that G ′ = f G ⊕ a ] i ⊕ [ i b ⊖ ab is in β ( G ). Then G ′ = f G ⊕ f ( a ] i ) ⊕ f ([ i b ) ⊖ f ( ab ), so that f ( a ] i ) , f ([ i b ) ∈ supp G ′ , f ( ab ) ∈ supp f ( G ) , and co nsequen tly f a ] i , f [ i b and f ( ab ) ∈ E . More over f ( G ) = G ′ ⊕ f ( a ) f ( b ) ⊖ f ( a ) ] f ( i ) ⊖ [ f ( i ) f ( b ), so that f ( G ) ∈ α ( G ′ ). On the other hand, if for some f ∈ F , f ( G ) ∈ α G ′ , f ( G ) = G ′ ⊕ ab ⊖ a ] i ⊖ [ i b . Since ab ∈ supp f ( G ) , there is e ∈ E s uc h that f ( e ) = ab . But this implies O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 11 that there is c, d ∈ S + such that a = f ( c ) and b = f ( d ). W e ca n then if needed mo dify f outside j ∈ N : [ j S ∗ ∈ s upp( G ) , to make sure that i ∈ f ( N ). Let f ( j ) = i . W e now g et that f G = G ′ ⊕ f ( c ) f ( d ) ⊖ f ( c ) ] f ( j ) ⊖ [ f ( j ) f ( d ), so that G ′ = f G ⊕ c ] j ⊕ [ j d ⊖ c d , proving that G ′ ∈ β ( G ). Another useful pro p erty of the merge rule is given by the following lemma: Lemma 4.3 F or any f ∈ F and any G ∈ G , f α ( G ) ⊂ α f ( G ) . Pr o of. Indeed, a n y G ′ ∈ f α ( G ) is of the form G ′ = f G ⊕ ab ⊖ a ] i ⊖ [ i b = f G ⊕ f ( a ) f ( b ) ⊖ f ( a ) ] f ( i ) ⊖ [ f ( i ) b ∈ α f ( G ) . Unfortunately , rep eating the merge transfor mation will not pro vide a text in all circumstances. Indeed, w e can end up with some ex pressions o f the type [ i a ] i b . How ever, since an expres s ion is allow ed to contain only one opening brac ket, we are sure that [ 0 6∈ supp([ i a ] i b ). T o contin ue the discuss ion, we will switch to a ra ndom context, where split and merge tr a nsformations are per formed acco rding to so me probability mea- sure. 5. Random spl it and me rge pro cesses The grammars w e described so far are obtained using splitting rules. T exts can be reconstructed using merge trans fo rmations. The s plitti ng rules as well as the merge rules allow for m ultiple c hoices at each step. W e will acco unt for this b y in tro ducing random pro cesses where these choices are ma de at random. W e will describe t wo types of random grammar transfo r mations. Each of these will app e ar as a finite length Ma r k ov c hain, where the length o f the c hain is given by a uniformly b ounded stopping time. − The lear ning pro cess (or splitting proces s) will s tart with a text and build a grammar through iterated splits; − the pro duction pro cess will s ta rt with a grammar and pro duce a text through iterated merge op erations. These tw o types of pro cesses ma y be combined int o a split and merge pro cess, going back a nd fo r th betw een texts and toric gr a mmars. Let us g iv e more formal definitions. Lea r ning a nd parsing pro cesses will b e some sp ecial kinds of splitting pro cesses, to be defined hereafter. Definition 5 . 1 (Spl itting pro cess) Given some r estricte d splitting rule β r : G → 2 G fr om the set of gr ammars to the set of subsets of G , such that for any G ∈ G , β r ( G ) ⊂ β ( G ) , a splittin g O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 12 pro cess is a time homo gene ous stopp e d Markov chain S t , 0 6 t 6 τ define d on G such that τ = inf t ∈ N : β r ( S t ) = ∅ , P S t = G ′ | S t − 1 = G > 0 ⇐ ⇒ G ′ ∈ β r ( G ) . Definition 5 . 2 (Pro duction pro cess) A production pro cess is a time homo genous stopp e d Markov chain P t , 0 6 t 6 σ define d on G such that σ = inf t ∈ N , α ( P t ) = ∅ , and P P t = G ′ | P t − 1 = G > 0 ⇐ ⇒ G ′ ∈ α ( G ) . Definition 5 . 3 (Spl it and Merge pro cess) Given a splitting pr o c ess S t , t ∈ N and a pr o duction pr o c ess P t , t ∈ N , a split and merge pro cess is a Markov chain G t ∈ G , t ∈ N , with tr ansitions P G 2 t +1 = G ′ | G 2 t = G = P S τ = G ′ | S 0 = G , t ∈ N , P G 2 t = G ′ | G 2 t − 1 = G = P P σ = G ′ | P 0 = G , P σ ∈ T , t ∈ N \ { 0 } , whose ini tial distribution is a pr ob ability me asur e on texts, so that almost sur ely G 0 ∈ T . Let us remark that w e have to impose the condition that P σ ∈ T , b ecause the pro duction pro cess do es not pro duce a true text with probability one. On the other hand it ca n yield back G 2 t − 2 with p ositiv e proba bility when started at G 2 t − 1 , a s will b e prov ed later on. Therefore P ( P σ ∈ T | P 0 = G ) > 0 for any G such that P ( G 2 t − 1 = G ) > 0. One wa y to simulate P G 2 t | G 2 t − 1 is to use a rejection metho d, simulating r epeatedly from the pro duction pro cess un til a true text is pro duced. In the exp eriments we made, P P σ ∈ T | P 0 = G was close to one and rejection a r a re event . Prop osition 5.1 L et S t , P t and G t b e a splitting pr o c ess, a pr o duction pr o c ess and the c orr e- sp onding split and mer ge pr o c ess, starting fr om G 0 = T ∈ T . F or any G ∈ G , any T ′ ∈ T , su ch that P t ∈ N P ( G 2 t +1 = G ) > 0 and P t ∈ N P ( G 2 t = T ′ ) > 0 , P τ 6 2 T D S ∗ − T [ 0 S ∗ S 0 = T ′ = 1 , (5.1) P σ 6 2 T D S ∗ − T [ 0 S ∗ P 0 = G = 1 . (5.2) In other wor ds, the length of al l t he splitting and pr o duction pr o c esses involve d in the split and mer ge pr o c ess have a un iform b ound, given by t wi c e the differ enc e b etwe en the numb er of wor ds and the numb er of sent enc es in the origi nal text. Pr o of. This pro of is a bit leng thy and is based on so me inv aria n ts in the split and merge op erations. It has been put off to app endix A.1 on page 29. O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 13 Prop osition 5.2 If G t is a split and mer ge pr o c ess st arting almost sur ely fr om the text G 0 = T ∈ T , ther e is a finite subset of toric gr ammars G T such that with pr ob ability e qual to one ther e is for e ach time t a gr ammar G ′ t isomorphi c to G t such that G ′ t ∈ G T . Thus, after identific ation of isomorphic gr ammars, we c an analyze the split and mer ge pr o c ess as a finite state Marko v chain, sinc e the re achable set fr om any starting p oint is finite. W e should however ke ep in mind that the finite state sp ac e G T dep ends on the initial state T , so the state sp ac e is stil l infinite, although any tr aje ctory wil l almost su r ely stay in a fin ite subset of r e achable states. Pr o of. Let us assume that the la bels of G ar e tak en from J 0 , W ℓ ( G ) K , meaning that G ([ i S ∗ ) = 0 for i > W ℓ ( G ). This can b e achiev ed, up to gra mmar iso mor- phisms, b y a pplying to G a suitable lab el map. Let us define the set of ca nonical expr essions E c = E ∩ [ i ∈ N [ i S ∗ ! , and the canonical decomp osition of G G = X e ∈ E c G ( e ) ⊗ e. W e see that G can b e describ ed b y the concatenatio n of the ca no nical expres- sions, ea c h rep eated a n umber of times equal to its weigh t, to form a sequence o f symbols of leng th W s ( G ). F rom the proo f of the previous propos ition, we know that W s ( G ) 6 M = 5 W w ( T ) − 3 W e ( T ) = 5 T ( DS ∗ ) − 3 T ([ 0 S ∗ ) . W e can represent G by a sequence o f exactly M symbols b y padding w ith trailing [ 0 symbols the repres e ntation descr ibed ab ov e. Let us give an exa mple G = 2 ⊗ [ 0 w 1 ] 1 w 2 ⊕ [ 1 w 3 ⊕ [ 1 w 4 can be co ded as [ 0 w 1 ] 1 w 2 [ 0 w 1 ] 1 w 2 [ 1 w 3 [ 1 w 4 [ 0 [ 0 [ 0 in the case when M = 15. Let us consider the s et of sym b ols S T = D ∪ [ 0 , [ i , ] i , 0 < i 6 2 T ( D S ∗ ) − T ([ 0 S ∗ ) . Since G uses o nly thos e sym b ols, we see fro m the prop o sed coding of G that it can take a t mo st | S T | M different v alues. Since | S T | = | D | + 1 + 4 T ( D S ∗ ) − T ([ 0 S ∗ ) O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 14 we have prov ed that | G T | 6 | D | + 1 + 4 T ( D S ∗ ) − T ([ 0 S ∗ ) 5 T ( DS ∗ ) − 3 T ([ 0 S ∗ ) . Let us notice that this b ound, while b eing finite, is very large. 6. Splitting rules and lab el identification In the previous section, we in tro duced some class of rando m pro cesses, a nd studied some of their general prop erties. In this section, w e ar e going to descr ibe some more specific sc hemes a nd go further in the description of split and merge pro cesses that can learn toric grammar s in a satisfacto ry way . The choice of splitting r ules and lab el ident ification rules ha s a decisive influ- ence on the way syn tactic categ ories and syntactic rules are learnt b y the split and merge pr o cess. While it is necessary as a starting p oint to consider r ules learnt from the text to b e parsed itself, it will also b e fruitful to c o nsider the case when a previously lear n t grammar R ∈ G can be used to go vern the splits. T o make things eas ier to grasp, let us e x plain o n some exa mple the basics of syntactic gener a lization by la bel identifi catio n. Let us start with the simple text with tw o sentences. G 0 = T = [ 0 This is my frie nd Peter . ⊕ [ 0 This is my neig h b o ur John . If we split “m y friend” and “my neighbour” in the t wo sentences us ing the same lab el, w e will form after tw o splits the gra mmar G 1 = [ 0 This is ] 1 Peter . ⊕ [ 0 This is ] 1 John . ⊕ [ 1 m y friend ⊕ [ 1 m y neighbour If no more splits are allow ed and we therefore r eac hed the stopping time of the splitting pro cess, so that τ = 2, we can proc e e d to the pro duction proces s , and reach after tw o more steps the new text G 2 that can either b e G 2 = G 0 or G 2 = [ 0 This is my neig h b o ur P eter . ⊕ [ 0 This is my friend John . Now is a go od time to remind the r eader of the distinction made in section 3 on page 6 ab out loca l and global expressions. Legitimate lo cal expr e s sions will be provided b y the re fer ence grammar R , whereas g lobal expres sions will be deduced fro m the text itself. This approach will b e pa rticularly efficient in the case when the se t of lo cal expressions is smaller than the set of globa l expressions. W e will need tw o different kinds of split pro cesses, o ne to learn the reference grammar from a text and the other one to per fo rm the first part of the transitions of the communication Ma rk ov chain. These split pro cesses ma y b e viewed as per forming some parsing of the text they a r e applied to. Here, we do not use par sing as it is usually used to dis- cov er whether a sentence is cor rect or not, we use it instead to discov er new expressions. O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 15 W e will start b y defining the parsing r ules to b e used in the communication chain. W e will call them narr ow par s ing r ules. W e will then pr oceed to the definition of a br o ad parsing rule suitable for learning the reference gr ammar b R ( T ) fr o m a text. Definition 6 . 1 L et us define the narr ow p arsing rule with r efer enc e gr ammar R as β n G , R = n G ′ ∈ G : G ′ = G ⊕ a ] i ⊕ [ i b ⊖ ab, ab ∈ E g , R [ i b > 0 o , G ∈ G . L et us r emark that, due t o the definition of the set of expr essions E and of G ⊂ M + ( E ) , the fact that G and G ′ ∈ G implies that i ∈ N \ { 0 } in this defini- tion, sinc e n e c essarily a ] i , [ i b ∈ E . It implies also that [ 0 ∈ supp( a ) , a c ondition e quivalent to ab ∈ E g . The narr ow p arsing rule dep ends on R only thr ough supp( R ) ∩ E l . L et us define the br o ad p arsing rule as β b G , R = n G ′ ∈ G : G ′ = G ⊕ a ] i ⊕ [ i b ⊖ ab , R a ] i + R [ i b > 0 , R aS ∗ 6 µ 1 R [ 0 S ∗ , and R bS ∗ 6 µ 2 R [ 0 S ∗ o , G , R ∈ G , wher e µ 1 , µ 2 ∈ R + ar e t wo p ositive r e al p ar ameters. Since the reference gra mmar is under construction during broad parsing , w e will mainly use this rule with R = G , as will be explained later. The same learning parameters µ 1 and µ 2 are present here and in the innov ation rule to b e describ ed next. They serve to split expres s ions in to sufficien tly infrequent halves, in o rder to constrain the mo del. Let us define no w maximal sequences, a no tion that will b e needed to define learning rules. Definition 6 . 2 Given some toric gr ammar G , we wil l say that a ∈ S + is G -maximal and write a ∈ max ( G ) when G ( aS ∗ ) > max G ( awS ∗ ) , G ( waS ∗ ) , w ∈ S . In other wor ds, a is a maximal s u bse quenc e among t he su bse quenc es with t he same weight in G . Note that if a is G -maximal, us u al ly G ( a ) = 0 (me aning that a is not an expr ession of the gr ammar, but only a sub ex pr ession) and if the gr ammar G has int e ger weights (which wil l b e the c ase if it has b e en pr o duc e d by a split and mer ge pr o c ess), then G ( aS ∗ ) > 2 . Definition 6 . 3 (Innov atio n rule) U sing the notations [ + = [ i , i ∈ N \ { 0 } and ] + = ] i , i ∈ N \ { 0 } , let us define the innovation rule with r efer enc e gr ammar R as β i G , R = n G ′ ∈ G : G ′ = G ⊕ a ] i ⊕ [ i b ⊖ ab, O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 16 R [ i S ∗ = 0 , { a, b } ∩ max ( R ) 6 = ∅ , R ( aS ∗ ) 6 µ 1 R [ 0 S ∗ , and R bS ∗ 6 µ 2 R [ 0 S ∗ o . Here ag ain, the rule will b e used while learning the reference grammar with R = G . W e will now in tro duce a label map that iden tifies the labels appear ing in the same context. Definition 6 . 4 (Lab el identificat io n through context ) Given some toric gr ammar G ∈ G , let us c onsider the r elation C ∈ N \ { 0 } 2 define d as C = ( i, j ) ∈ N \ { 0 } 2 : X a ∈ S ∗ G ( a ] i ) G ( a ] j ) + G ([ i a ) G ([ j a ) > 0 . The s m al lest e quivalenc e r elation c ontaining C defines a p artition of N \ { 0 } into e quivalenc e classes. Le t ( A k ) k ∈ N \{ 0 } b e an arbitr ary indexing of this p artition. Each p ositive inte ger fa l ls in a unique class of the p artition, so that the r elation i ∈ A χ G ( i ) defines a lab el map χ G : N → N in a non ambiguous way. The cho ic e of the indexing of t he p artition ( A k ) k ∈ N \{ 0 } do es not matter, sinc e two differ ent choic es le ad to two isomorph ic lab el maps. When applying χ G to G itself, we wil l use the short notation χ ( G ) def = χ G ( G ) . L et us c onsider the evolution of the num b er of lab els use d by G : L ( G ) = { i ∈ N : G ( ] i S ∗ ) > 0 } . It is e asy to se e that L χ ( G ) 6 L G and that χ ( G ) ≡ G if and only if L χ G ( G ) = L G , wher e the symb ol ≡ me ans isomorp hic. A c c or dingly ther e is k ∈ N such that χ k +1 G ≡ χ k G , and we c an take it to b e the smal lest inte ger such that L χ k +1 ( G ) = L χ k ( G ) . Conse quently, k is such that for any n > k , χ n G ≡ χ k G . W e wil l define χ ( G ) = χ k G , up to gr ammar isomorphisms (so that χ ( G ) b elongs to G / ≡ r ather than to G itself ). A characterisation in terms o f more elementary label maps will b e established in Prop osition A.6 o n page 37. This characterization provides a n alg orithm to compute χ in practice. W e a re no w ready to define a learning rule. Definition 6 . 5 L et us define the le arning rule β ℓ ( G ) = ( β i G , G , when β b G , G = ∅ , χ G ′ : G ′ ∈ β b G , G , otherwise . W e will define tw o kinds of splitting pr o cesses, based on tw o diff erent ch oice s of the restricted splitting rule β r . O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 17 Definition 6 . 6 (Learning pro cess) A lea rning pro c e s s is a splitting pr o c ess with r estricte d splitting ru le β r ( G ) = β ℓ ( G ) . Definition 6 . 7 (Par si ng pro cess) A parsing pr ocess with r efer enc e gr ammar R ∈ G is a splitting pr o c ess with r estricte d splitting rule β r ( G ) = β n ( G , R ) . Before we reach the aim of this pa per and describ e our statistical language mo del, we need to explor e so me of the prop erties of the pro duction, learning and parsing pro cesses in tro duced so far. 7. P arsing and ge neralization Let us introduce some nota tions for the o utput of par sing, learning a nd produc- tion pro cesses. Definition 7 . 1 L et S t b e a p arsing pr o c ess, with r efer enc e gr ammar R ∈ G . W e wil l use the fol lowing notation for the distribution of S τ . G T , R = P S τ | S 0 = T , T ∈ T . W e wil l also use a short notation for t he distribution of the output of a pr o duc- tion pr o c ess. T G = P P σ | P 0 = G ,P σ ∈ T , G ∈ G . Eventual ly, G T wil l b e t he pr ob ability distribution of the output of a le arning pr o c ess S t , ac c or ding to the definition G T = P S τ | S 0 = T , T ∈ T . A t this p oin t we ob viously may consider different notions o f par sing that we hav e to connect together. Namely , we would like to ma ke a link b et ween the following statements: − T G ( T ) > 0 , the g rammar G can pro duce the text T ; − G T , R ( G ) > 0, the text T can gener a te the gra mmar G when par sed with the help of the g rammar R ; − G T ( G ) > 0, the gr ammar R can b e learnt fr o m the text T . Lemma 7.1 The pr evious p arse notions ar e r elate d in the fol lowing way. F or any G , R ∈ G , and any T ∈ T , G T ( G ) > 0 = ⇒ T G ( T ) > 0 , G T , R ( G ) > 0 = ⇒ T G ( T ) > 0 , O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 18 T G ( T ) > 0 = ⇒ G T , G ( G ) > 0 . Conse quen tly, for any G , R ∈ G such that supp( G ) ∩ E l ⊂ supp ( R ) , and any T ∈ T , T G ( T ) > 0 ⇐ ⇒ G T , R ( G ) > 0 . Pr o of. This is one o f the co re lemmas of this work. The pro of is given in ap- pendix A.2 on page 31, on account of its length. It has the following importa n t implication. Prop osition 7.2 Given a p arsing pr o c ess S t b ase d on a r efer enc e gr ammar R ∈ G and a pr o duc- tion pr o c ess P t , the c orr esp onding split and mer ge pr o c ess G t is we akly r eversible, in the sense that for any T ∈ T , any G ∈ S t ∈ N supp P G 2 t +1 , P G 1 = G G 0 = T > 0 ⇐ ⇒ P G 2 = T G 1 = G > 0 . Conse quen tly, for any T , T ′ ∈ T and any G , G ′ ∈ S t ∈ N supp P G 2 t +1 , P G 2 = T ′ G 0 = T > 0 ⇐ ⇒ P G 2 = T G 0 = T ′ > 0 , P G 3 = G ′ G 1 = G > 0 ⇐ ⇒ P G 3 = G G 1 = G ′ > 0 . In other wor ds, t he two pr o c esses G 2 t and G 2 t +1 ar e we akly r eversible time ho- mo gene ous Markov chains. A s we alr e ady pr ove d that the set of r e achable states fr om any starting p oint is fin ite, it shows that they ar e r e curr ent Markov chains: they p artition t heir r esp e ctive state sp ac es into p ositive r e curr ent c ommunic ating classes. Pr o of. Let us remark first that P G 1 = G G 0 = T > 0 ⇐ ⇒ G T , R G > 0 P G 2 = T G 1 = G > 0 ⇐ ⇒ T G T > 0 . Moreov er, since G ∈ supp P G 2 t +1 for so me t ∈ N , there is T ′ ∈ T such that G T ′ , R ( G ) > 0, implying that supp G ∩ E l ⊂ supp R . This ends the pro of according to the last statement of the previous lemma. 8. Exp ectation of a random toric grammar In section 7 on the previous page, giv en some text T ∈ T , w e defined a random distribution on toric grammars G T that w e would like to use to learn a g rammar from a text. The most o b vious wa y to do this is to draw a tor ic g rammar at random a ccording to the distribution G T , and we a lready saw an algor ithm, describ ed b y a Markov chain a nd a stopping time, to do this. The distribution G T will be spread in general on man y grammars. This is a kind of instability that we would like to av oid, if p o ssible. A natural way to get O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 19 rid o f this insta bility would be to simulate the exp ectation of G T . T o do this, we ar e facing a problem: the usual definition of the exp ectation of G T , that is Z G d G T ( G ) , although well defined from a mathemacia l point of view, is a meaning less toric grammar, due to the p ossible fluctuatio ns of the label mapping. T o get a mean- ingful notion of exp ectation, we need to define in a meaningful w ay the sum of t wo toric grammars . W e will achiev e this in t wo steps. Let us introduce first the disjoint sum of tw o toric grammar s . W e will do this with the help of tw o disjoint lab el maps. Let us define the even and o dd lab el maps f e and f o as f e ( i ) = 2 i , f o ( i ) = max { 0 , 2 i − 1 } , i ∈ N . Definition 8 . 1 The disjoint sum of two toric gr ammars G , G ′ ∈ G is define d as G ⊞ G ′ = f e ( G ) + f o ( G ′ ) . Definition 8 . 2 Given a pr ob ability me asur e G ∈ M 1 + ( G ) with finite supp ort, we define the me an of G as I G d G ( G ) = χ ⊞ G ∈ G G ( G ) G ! . Lemma 8.1 If G i is an i.i.d. se quenc e of r andom gr ammars distribute d ac c or ding to G , then almost sur ely lim n → + ∞ 1 n χ n ⊞ i =1 G i ! = I G d G ( G ) . Pr o of. The pro of of this result is q uite lengthy , and p ostp oned till app endix A.3 on page 34. 9. Language mo del s W e are no w ready to define the language model announced in the intro duction. Given a reference g rammar R , and the corre s ponding split and merg e pro cess ( G t ) t ∈ N with refer ence R , we define the communication kernel q R ( T , T ′ ) o n T 2 as q R ( T , T ′ ) = P G 2 = T ′ | G 0 = T . A ccor ding to Pr opositio n 5 .2 on page 13 and Prop osition 7 .2 on the facing page, q R has finite reachable sets and is weakly re versible, so that all texts T ∈ T are po sitiv e recurrent sta tes of the comm unication kernel q R . O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 20 Thu s to ea ch text T ∈ T corresp onds a unique inv ariant text distribution b q R T , · , as ex plained in the introduction. As all states are p ositiv e recur ren t, b q R ( T , · ) is the unique inv ariant measure o f q R on the communicating class containing T . Mo reo ver, fro m the er godic theor e m, P b q R T , · = lim t →∞ 1 t t X j =1 δ G 2 j G 0 = T ! = 1 , showing that b q R T , · can be computed by an almost surely conv ergent Monte- Carlo simulation. Event ually , from the in v ariant probability measure o n texts b q R T , · , we deduce a pro babilit y measure on sentences b Q R , T as explained in the int ro duction, according to the formula b Q R , T = T [ 0 S ∗ − 1 X T ′ ∈ T b q R T , T ′ T ′ . (This is the same for m ula as in the intro duction, taking int o account the fact that texts in the suppo r t of b q R T , · are no n normalized empirical mea s ures with the same total mass equal to T [ 0 S ∗ , the num ber of sentences in the text T .) T o obtain a true lang uage estimator , there remains to estimate R by s o me estimator b R ( T ). W e will do this as describ ed in section 8 on page 18, putting b R ( T ) = I G d G T ( G ) . Let us remark that, acc o rding to lemma 8 .1 on the prec e ding page, b R ( T ) can be computed from rep eated sim ulations from the distribution G T . 10. Comparison with other mo dels 10.1. Comp arison wi th Context F r e e Gr ammars Given a toric gra mmar G ∈ β ∗ ( T ), we may consider the split and merg e pro cess G t with refer ence grammar G starting at G 1 = G (so here we sta rt at time 1 with an initial s tate that is a g rammar, instead o f star ting at time 0 w ith an initial state that is a text). Due to the weak reversibilit y of P ropo sition 7.2 on page 18, G 2 almost surely falls in the same recurrent communicating cla ss o f t 7→ G 2 t , and the unique in v ariant probability measure suppor ted by this rec urren t communi ca ting clas s defines a pro ba bilit y mea sure T G on texts, and therefor e a sto c hastic languag e mo del. This wa y of defining the langua g e gener ated by the grammar G can b e compared to the usual definition o f the la nguage generated by a Co ntext F ree Grammar. Indeed, the suppor t of G is a Cont ext F ree Grammar, so this is meaningful to consider the langua ge genera ted by this grammar and to compare it with the supp ort of our sto c hastic langua ge mo del. O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 21 None of these t wo sets of sentences is contained in the other o ne. In our sto c hastic mo del, the num ber of times a rule ca n b e used is b ounded, so if the recursive use of some r ules is p ossible, the deterministic languag e will in this sense b e lar ger. O n the other hand, the sto chastic model uses b oth pro duction and parsing to build new s e ntences, whereas the deterministic mo del uses only pro duction rules. In this re s pect, the s tochastic mo del may , at lea st in so me cases, define a m uch broader language, a s we will show on the following example. Let us take as dictionary the set D = { + , = } ∪ J 1 , N K , where J 1 , N K = i ∈ N , 1 6 i 6 N , and consider the tor ic gr a mmar G = N 2 ⊗ [ 0 ] N = N ⊕ N M i =1 N ⊗ [ i i ⊕ N M i =2 N ( i − 1 ) ⊗ [ i ] i − 1 + 1 , and the text T = N ⊗ N M i =1 [ 0 i +1 + · · · + 1 | {z } N − i tim es = N . It is easy to chec k that T G ( T ) > 0, (so tha t G ∈ β ∗ ( T ),) that indeed the suppo rt o f T is the languag e g enerated by supp( G ), seen as a Context F ree Grammar, and that the sto ch as tic lang uage T G generated by G is able to pro- duce with p ositiv e proba bilit y a set of sentences supp 2 ( T G ) def = [ T ∈ sup p( T G ) supp( T ) , equal to supp 2 ( T G ) = n [ 0 x 1 + · · · + x i = x i +1 + · · · + x j , 1 6 i < j 6 2 N , x k ∈ J 1 , N K , 1 6 k 6 j, i X k =1 x k = j X k = i +1 x k = N o . Here, the n umber o f senten ces pro duced by the underlying Context F ree Gram- mar is | supp( T ) | = N , whereas the num b er of sentences produced by our sto c hastic languag e mo del is | supp 2 ( T G ) | = 2 2( N − 1) . Thus, in this small exam- ple based on arithmetic expr e ssions (admittedly clos er to a computing language than it is to a natural la nguage), our new definition of the g enerated language induces a huge increase in the n umber of g e ner ated sentences. Note that with usua l Context F ree Grammar notations, supp( G ) would hav e been describ ed as 0 → N = N i → i , i = 1 , . . . , N , O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 22 i → i − 1 + 1 , i = 2 , . . . , N , where 0 is the star t symbo l and i , i = 1 , . . . , N , are other no n terminal symbols. T o co unt the num ber of elements in supp 2 ( T G ), one can remark that the n umber of ways N can be written as P i k =1 x k with an arbitra r y nu mber o f terms is a lso the num ber of incr e asing in teger sequences 0 < s 1 < · · · < s i − 1 < N of arbitra ry length, which is a ls o the n um b er of s ubsets { s 1 , . . . , s i − 1 } of { 1 , . . . , N − 1 } , that is 2 N − 1 . In tuitively sp eaking, the underlying Context F ree Gra mmar supp( G ) is lim- ited to pr oducing a small set o f global expressions o f the form i + 1 + . . . + 1 = N , whereas the sto chastic language model incorp orates some crude lo gical reason- ing that is ca pable of deducing from them a lar ge set of new globa l expressions. Let us remark a lso that, when we start as here from a text made of true arithmetic statement s, the langua ge gener ated by our langua g e mo del is also made of true arithmetic statements. This shows that our approach to languag e mo deling is capable of some sort of logical rea soning. 10.2. Comp arison wi th Markov mo dels The k ind of rea soning illustrated in the previous section is r elated to the fact that w e a na lyse g lobal s y n tactic structures represented by the global expressions of our toric gra mma r s. In order to give another p oint of compariso n, we would like in this section to make a qua litativ e compariso n with Markov mo dels, that do not shar e this feature. T o make a parallel b et ween toric grammar s and Ma rk ov mo dels, we are going to show how a Marko v mo del could b e describ ed in terms of toric grammars and lab el iden tification rules . T o build a Markov mo del in our framework, we have to use a deterministic splitting (or pa rsing) rule. This is b e c a use in a Marko v mo del, c o nditional prob- abilities are sp ecified from left to right in a rigid data independent wa y . Let us in tro duce the Markov splitting rule β m ( G ) = G ′ ∈ G , G ′ = G ⊖ [ 0 aw ] i ⊕ [ 0 a ] j ⊕ [ j w ] i , i, j ∈ N \ { 0 } , a ∈ D + , w ∈ D , G [ j S ∗ = 0 . W e will describ e no w label identification rules using concepts in tro duced in app endix A.3 o n page 34. Let us say that the pair of lab els p ∈ N \ { 0 } 2 is G -Marko v if there is w ∈ D such that G w ] p 1 S ∗ G w ] p 2 S ∗ > 0. Let us say that the s equence of pairs of labels p 1 , . . . , p k is G -Marko v if p j is ξ p 1 ,...,p j − 1 G - Marko v. It ca n b e pr o ved as in the cas e o f cong ruen t sequences that if σ is a per m utation and p is G -Marko v, then p ◦ σ is als o G -Markov. It can a ls o b e prov ed that if p and q a re maximal G -Markov sequences, then ξ p ≡ ξ q , and therefore ξ p ( G ) ≡ ξ q ( G ). W e will call ξ p ( G ) ∈ G / ≡ the Marko v closure of G and O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 23 use the notation ξ p ( G ) def ≡ µ ( G ),where µ ( G ) is the Ma rk ov p enden t of χ ( G ) in the construction of toric gra mmars. Let S t , 0 6 t 6 τ b e a splitting pro cess ba sed on the r e s tricted splitting rule β r ( G ) = µ ( G ′ ) , G ′ ∈ β m ( G ) . It is no t very difficult to chec k that the supp ort of S τ is contained in a single isomorphic class of gr ammars, so that, up to lab el remapping the result of this splitting pro cess is deterministic. More spe cifically , star ting from a text T = n M j =1 [ 0 w j 1 . . . w j ℓ ( j ) , where w j i ∈ D \ { . } , 1 6 i < ℓ ( j ), 1 6 j 6 n , and w j ℓ ( j ) = . , 1 6 j 6 n so that all sentences end with a per iod, we obtain a grammar isomorphic to G = n M j =1 [ 0 w j 1 ] w j 1 ℓ ( j ) − 1 M i =2 [ w j i − 1 w j i ] w j i ⊕ [ w j ℓ ( j ) − 1 w j ℓ ( j ) , where we have used words as lab e ls instead of integers, since in this mo del, due to the la b el identification r ule, labels are functions of words (namely ] w is the non terminal symbol following the w ord w ∈ D ). W e can now define a Marko v pro duction mec hanism, to replace the pro duc- tion process . It is describ ed as a Marko v chain X i , i ∈ N , where X i ∈ D ∪ { ∆ } , where ∆ / ∈ D is a padding symbol used to embed finite sentences into infinite sequences of sym b ols, all equal to ∆ for indices larger than the sentence length. The distribution of the Mar k ov chain X i is as follows. Its initial distribution is P ( X 0 = w ) = G [ 0 w ] w G [ 0 S ∗ , and its transition probabilities ar e P X i = ∆ | X i − 1 = . = 1 , P X i = . | X i − 1 = w = G [ w . G [ w S ∗ , w ∈ D \ { . } P X i = w ′ | X i − 1 = w = G [ w w ′ ] w ′ G [ w S ∗ , w, w ′ ∈ D \ { . } . Roughly sp eaking, the difference with the pro duction pro cess P t defined previ- ously is that in the production pro cess the pro duction rules are drawn at random without r eplacemen t whereas here, the pro duction rules a re drawn with replace- men t. It is easy to see that the initial distribution and trans ition probabilities of the Marko v chain X i are the empirical initial distribution a nd empirical tra nsition probabilities of the training text T . In co nclusion, to build a Ma rk ov model using the same fra mew ork as for toric grammars , we had to modify t wo steps in a dramatic wa y: O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 24 − we had to change the splitting pro cess, and replace the random split- ting pro cess o f toric grammars with a non random splitting process which chains for w ard transitions in a linear way; − we had to c hang e in a dramatic wa y the lab el iden tification rule to re- place the fo rwar d and b ackwar d glob al c ondition of toric g rammars with a b ackwar d only lo c al c ondition . (The mo dification of the pro duction pr ocess is less crucia l and b oils down to drawing pro duction rules with or without r eplacemen t.) W e hop e that this discussion of Markov mo dels will help the reader r ealize that our model prop osal is indeed r eally different fro m the Markov mo del at sentence level. W e could hav e extended easily the discussion to Marko v models of higher order, or to more ge ner al context tree mo dels. W e let the r eader fig ure out the details. All these mor e sophisticated mo dels show the same differences from toric gra mmars: a more rigid splitting pro cess and lo cal backward lab el iden tification rules. 11. A small exp eriment Let us end this study with a small exa mple. Here we use a small text that is meant to mimic what could be found in a tutor ial to learn English a s a foreign language. W e hav e a dded a more elab orate sentence at the end of the text to show its impact. Mor e systematic exp eriment s are yet to be carried out, although the conception o f this mo del was guided by exp eriment al trial and error s with mo dels starting with v ariable length Markov ch ains, b efore w e tried globa l rules leading to grammar s. This is the training text T (each line shows an expression, star ting with its weigh t) : 1 [0 He is a clever guy . 1 [0 He is doi ng some shopping . 1 [0 He is lau ghing . 1 [0 He is not interested in sports . 1 [0 He is wal king . 1 [0 He likes to walk in the streets . 1 [0 I am driv ing a car . 1 [0 I am ridi ng a horse too . 1 [0 I am runn ing . 1 [0 Paul is cross ing the str eet . 1 [0 Paul is drivi ng a car . 1 [0 Paul is ridin g a horse . 1 [0 Paul is walki ng . 1 [0 Peter is walk ing . 1 [0 While I was walking , I saw Paul crossing the street . And no w, the new sen tences produced by the model ( that is b y b Q b R , T , approximated o n 5 0 iterations of the commu nication chain with k ernel q b R ). 1 [0 Paul is drivi ng a car too . O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 25 1 [0 Paul is doing some shopping . 1 [0 Paul is laugh ing . 1 [0 Paul is ridin g a horse too . 1 [0 Paul is runni ng too . 1 [0 Paul is runni ng . 1 [0 Paul is not interested in sports too . 1 [0 Paul is not interested in sports . 1 [0 Paul is a clever guy too . 1 [0 Paul is a clever guy . 1 [0 Paul is walki ng too . 1 [0 Peter is driv ing a car too . 1 [0 Peter is driv ing a car . 1 [0 Peter is doin g some shopping . 1 [0 Peter is laug hing . 1 [0 Peter is ridi ng a horse too . 1 [0 Peter is ridi ng a horse . 1 [0 Peter is runn ing too . 1 [0 Peter is runn ing . 1 [0 Peter is not interested in sports . 1 [0 Peter is a clever guy . 1 [0 Peter is cros sing the street . 1 [0 He is dri ving a car too . 1 [0 He is dri ving a car . 1 [0 He is rid ing a horse too . 1 [0 He is rid ing a horse . 1 [0 He is run ning too . 1 [0 He is run ning . 1 [0 He is not interested in sports too . 1 [0 He is cro ssing the street too . 1 [0 He is cro ssing the street . 1 [0 He is wal king too . 1 [0 I am driv ing a car too . 1 [0 I am doin g some shop ping . 1 [0 I am laug hing too . 1 [0 I am laug hing . 1 [0 I am ridi ng a horse . 1 [0 I am not interested in sports . 1 [0 I am a clever guy . 1 [0 I am cros sing the street too . 1 [0 I am cros sing the street . 1 [0 I am walk ing too . 1 [0 I am walk ing . 1 [0 While I was driving a car , I saw Paul doing some shopping too . 1 [0 While I was driving a car , I saw Paul doing some shopping . 1 [0 While I was driving a car , I saw Paul riding a horse . 1 [0 While I was driving a car , I saw Paul crossing the street . 1 [0 While I was driving a car , I saw Paul walking . 1 [0 While I was driving a car , I saw Peter riding a horse . 1 [0 While I was doing some shopping , I saw Paul riding a horse . 1 [0 While I was doing some shopping , I saw Paul walking . O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 26 1 [0 While I was laughing too , I saw Peter cross ing the str eet . 1 [0 While I was laughing , I saw Pete r riding a horse . 1 [0 While I was riding a horse , I saw Paul driving a car too . 1 [0 While I was riding a horse , I saw Paul driving a car . 1 [0 While I was riding a horse , I saw Paul laughing . 1 [0 While I was riding a horse , I saw Paul running . 1 [0 While I was riding a horse , I saw Paul walking . 1 [0 While I was riding a horse , I saw Peter not interested in sports . 1 [0 While I was running , I saw Paul laughing . 1 [0 While I was running , I saw Paul not int erested in sports . 1 [0 While I was running , I saw Paul a cleve r guy . 1 [0 While I was running , I saw Paul walking . 1 [0 While I was not inte rested in sports , I saw Paul driving a car . 1 [0 While I was not inte rested in sports , I saw Paul riding a horse . 1 [0 While I was a clever guy , I saw Paul running . 1 [0 While I was a clever guy , I saw Paul crossing the street . 1 [0 While I was a clever guy , I saw Paul walking . 1 [0 While I was crossing the street , I saw Paul riding a horse . 1 [0 While I was crossing the street , I saw Paul running . 1 [0 While I was crossing the street , I saw Paul crossing the street . 1 [0 While I was crossing the street , I saw Paul walking . 1 [0 While I was crossing the street , I saw Pete r walking . 1 [0 While I was walking , I saw Paul driving a car . 1 [0 While I was walking , I saw Paul laughing . 1 [0 While I was walking , I saw Paul riding a horse . 1 [0 While I was walking , I saw Paul running . 1 [0 While I was walking , I saw Paul not int erested in sports . 1 [0 While I was walking , I saw Paul crossing the street too . 1 [0 While I was walking , I saw Paul walking . 1 [0 While I was walking , I saw Peter not intere sted in spo rts . 1 [0 While I was walking , I saw Peter walking . The reference grammar was lea rn t fir st, and was computed from 10 samples of G T . (W e did not normalize the weights, since we were int eres ted in the supp ort of the lo cal expressions only .) 10 [0 He likes to walk ]6 ]3 streets . 2 [0 ]1 ]8 cle ver guy . 2 [0 ]1 doing some shopping . 2 [0 ]1 laughing . 2 [0 ]1 not interested ]6 spo rts . 2 [0 ]1 riding ]8 horse . 2 [0 ]1 riding ]8 horse ]2 . 2 [0 ]1 running . 24 [0 ]7 am ]5 . 28 [0 Paul is ]5 . 40 [0 He is ]5 . 4 [0 ]1 crossing ]3 street . 4 [0 ]1 driving ]8 car . 5 [0 ]4 is ]5 . 6 [0 ]1 walking . O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 27 7 [0 Peter is ]5 . 8 [0 While ]7 was ]5 , ]7 saw ]4 ]5 . 10 [1 He is 2 [1 Peter is 2 [1 While ]7 was ]5 , ]7 saw ]4 6 [1 ]7 am 8 [1 Paul is 2 [2 too 30 [3 the 14 [4 Paul 1 [4 Peter 16 [5 crossi ng ]3 stre et 16 [5 drivin g ]8 car 16 [5 riding ]8 horse 34 [5 walkin g 8 [5 ]5 too 8 [5 ]8 clever guy 8 [5 doing some shopping 8 [5 laughin g 8 [5 not interested ]6 sports 8 [5 running 20 [6 in 50 [7 I 50 [8 a Although we did not yet make the softw are developmen t effort r equired to test large text co pora , we learnt a few interesting things from what we already tried: − As it is, the model requires the inclusion of a sufficient num ber of s imple and redundant sentences to start generalizing. A t this stage, we do no t know whether this could b e av oided by changing the learning rules. W e made quite a few attempts in this direction. All of them resulted in the pro duction of gra mmatical nonsense . Breaking the global constraints that are enforced b y the mo del seems to hav e a dr amatic effect on grammatical coherence. This could b e a c lue that these g lobal conserv ation rules reflect some fundamen tal feature of the syn tactic structure o f natural languages. Including a bunch of “simple” sentences made of freq uen t w or ds may b e seen as introducing a pinch of super vision in the learning pro cess. − The co nstrain ts on sub expressions frequencies in the lear ning rule 6.1 (page 15) a nd 6.3 were added to av oid some unw anted generalizations. F or instance here we to ok µ 1 R [ 0 S ∗ = µ 2 R [ 0 S ∗ = 5. If we had chosen 10 instead of 5, sentences of the kind [0 While I was walking , I saw He crossing the street . would have emerged, where the pronoun “He” is substituted to a noun in the wrong plac e . W e delib erately wr o te the tr aining text in suc h a wa y that “He” is more fre q uen t than any noun, since we exp ect that to b e true for any r easonable large co rpus. Doing so , we were able to rule o ut O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 28 the wrong cons tr uction by low ering the frequency constra in t to av oid the un wan ted substitution. − Despite all the limitations of this sma ll exa mple, it shows that the mo del is able to find o ut no n trivial new constructs, lik e [0 While I was laughing too, I saw Peter crossing the street. where it has disco vered that “to o” could b e a dded to the sub ordinate clause op ening the sen tence. W e are quite pleased to see that such thi ngs could be le a rn t along very general labe l identification rules, while all the generalized sentences remain, if not all gr a mmatically corr ect, at least all grammatically plausible. Of co urse this judgement is purely subjective. But since we have no mathematical or o ther wise qua n titative definition of what na tural langua g es ar e, we hav e to b e co n ten t with a sub jectiv e ev a luation of models. Studying how this lea rning mo del scales with large co rpor a is still a work to be done (it will require from us that w e optimize our co de so that it can run efficient ly on la rge data sets). 12. Conclusion W e have built in this pap er a new statistical fr a mew ork for the syntactic analys is of natural languages . The main idea per v ading our a ppr oac h is that trying to estimate the distri- bution of an isolated r andom sen tence is hop eless. Instead we prop ose to build a Marko v c hain on sets of sentences (called texts in this pap er), with non trivia l recurrent comm unicating classes and to define our language mo del as the inv ar i- ant measure s o f this Ma rk ov chain o n each of these r e c ur ren t communicating classes. At each s tep, the Marko v chain recombin es the set o f sen tences co nsti- tuting its current state, using c ut a nd paste op erations describ ed by grammar rules. In this wa y we define the probability distribution of an isola ted ra ndo m sentence only in an indirect w ay . W e replace the hard question of generating a random sentence by the hopefully simpler one of re c om bining a set of sen tences in a wa y that keep the des ired distribution in v ariant. The strong p oin ts of our a pproach are − a decisive departure from Marko v mo dels that ar e kno wn to fa il to catch the recursive structure of natural languages; − a new “communication mo del” concept that defines a Marko v ch ain on texts a nd in para llel on toric gr ammars. This results in a new definition of the language gener ated by what appea rs as a w eighted Context F r ee Grammar (called a toric g rammar in the pap er). This new per spective on language pro duction may help to overcome the challenge of weak stimulus learning; − in this resp ect, the split and merge pro cess with reference grammar R is the ma jor mathematical achiev ement of the pap er. It has non trivial O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 29 mathematical pro perties proving that it can be simulated using a b ounded n umber of op erations a t each step, and that the s ta te space is divided into recurrent co mm unicating classes each including a finite n umber of states; − preliminary expe r imen ts o n small corp ora a re encoura ging. They g iv e the (ackno wledgedly subjective) feeling that the mo del catches the structure of the natural langua ges we tried (F rench and English). So me inflection rules and other gr ammatical subtleties may b e missed, but exp erimental outputs nevertheless give us the impre s sion that we are heading in the right dir e ction. On the o ther ha nd, the model nee ds some refinement s. In particular, our prop osal to build a reference grammar from a text T ∈ T through the g r ammar exp e ctation b R ( T ) = I G d G T G is clearly only a first foray in to unknown terr itory . W e ho pe to be able to elab- orate more on this part of o ur research progra m in the future. App endix A: Pro ofs A.1. Bound on the length of splitting and pr o duction pr o c esses Pr o of of Pr op osition 5.1 on p age 12. Let us define the length of an expr e s sion e ∈ S k ∩ E as ℓ ( e ) = k . Let us introduce some remarka ble weight s as sociated with a grammar G ∈ β ∗ ( T ). W s ( G ) = X e ∈ E G ( e ) , W e ( G ) = X e ∈ E G ( e ) ℓ ( e ) − 1 , W l ( G ) = + ∞ X i =1 G ([ i S ∗ ) , W w ( G ) = X w ∈ D G ( wS ∗ ) . Let us define the set of ca nonical expr essions as E c = E ∩ [ i ∈ N [ i S ∗ ! . Using previously in tro duced notations, we can write the grammar as G = X e ∈ E c G ( e ) ⊗ e. O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 30 W e will call this the cano nical decomp osition of G . The tw o weights W s ( G ) and W e ( G ) a re be tter under stoo d in terms of this cano nical deco mposition. They can be expressed as W s ( G ) = X e ∈ E c G ( e ) ℓ ( e ) , W e ( G ) = X e ∈ E c G ( e ) . This shows that W s ( G ) counts the “n umber of symbols” in the canonical de- comp osition of G , wherea s W e ( G ) counts the num ber of expressions (that is G ( E c ), the weigh t put by the grammar on canonica l expressions). W e can a ls o see fro m the definit ions that W l ( G ) coun ts the num ber of cano nica l expressions starting with a pos itive (that is no n terminal) lab e l, that we will call for s hort the num ber o f labels , a nd that W w ( G ) count s the num b er of words. Since a split increases the n umber of ca nonical expressions by one, the num b e r of s ym b ols in ca nonical expr e s sions by tw o, the num ber o f lab els by one, and keeps the num b er of words consta nt, whereas a mer ge decreases these qua n tities in the same pro portions, the following qua n tities are inv a rian t in all the to r ic grammars in volved: for any G ∈ G such that P t ∈ N P G t = G > 0, W s ( G ) − 2 W e ( G ) = W s ( T ) − 2 W e ( T ) , W e ( G ) − W l ( G ) = W e ( T ) − W l ( T ) = W e ( T ) , W w ( G ) = W w ( T ) . Moreov er, for the same reaso ns, for any T ′ ∈ T and G ∈ G s uc h that P t ∈ N P G 2 t = T ′ > 0 a nd P t ∈ N P G 2 t +1 = G > 0, P τ = W l ( S τ ) S 0 = T ′ = 1 , P σ = W l ( G ) P 0 = G , P σ ∈ T = 1 . Thu s, we will prov e the lemma if we can bound W l ( G ) (or equiv alently W l ( S τ ) when S 0 = T ′ , since S τ almost surely sa tisfies the co nditions imp osed on G ). W e c an then remark that X e ∈ E c G ( e ) 1 ℓ ( e ) > 3 6 X e ∈ E c G ( e ) ℓ ( e ) − 2 = W s ( G ) − 2 W e ( G ) , X e ∈ E c G ( e ) 1 ℓ ( e ) = 2 = X e ∈ E G ( e ) 1 ℓ ( e ) = 2 X w ∈ D 1 e ∈ w S ∗ 6 X e ∈ E G ( e ) X w ∈ D 1 e ∈ w S ∗ = W w ( G ) , beca use an y canonical expressio n of leng th 2 is of the form e = [ i w , with i ∈ N and w ∈ D , so that for any e ∈ E c of length 2, X e ′ ∈ S ( e ) X w ∈ D 1 e ′ ∈ w S ∗ = 1 . O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 31 Thu s W e ( G ) 6 W w ( G ) + W s ( G ) − 2 W e ( G ) , and consequently we can b ound W l ( G ) b y the split and merge inv a rian t bound W l ( G ) 6 W l ( G ) − W e ( G ) + W w ( G ) + W s ( G ) − 2 W e ( G ) . This, added to the fact that W l ( T ) = 0 and W s ( T ) = W w ( T ) + W e ( T ), prov es that W l ( G ) 6 2 W w ( T ) − W e ( T ) . This ends the pro of, since W w ( T ) = T ( D S ∗ ) and W e ( T ) = T ( [ 0 S ∗ ). A.2. Parsing R elations Pr o of of lemma 7.1 on p age 17. The implicatio n T G ( T ) > 0 = ⇒ G T , G ( G ) > 0 is less trivial than it may seem. Indeed we can rev erse the path o f the splitting pro cess S t , be it a parsing or a learning pr o cess, to obtain a path follow ed with po sitiv e proba bility by the pro duction pro cess, but r e versing the pro duction pro cess do es no t give a parsing pro cess. Let us illustrate this difficulty on a simple example. Consider T = 1 ⊗ [ 0 abcd and G = [ 0 a ] 1 ⊕ [ 1 b ] 2 ⊕ [ 2 c ] 3 ⊕ [ 3 d. The pro duction path G , [ 0 ab ] 2 ⊕ [ 2 c ] 3 ⊕ [ 3 d, [ 0 ab ] 2 ⊕ [ 2 cd, T has p ositiv e probability . The reverse path may hav e a p ositive probability for the learning pro cess but not for the parsing pro cess with r eference G , s ince none o f the expressions [ 0 ab ] 2 or [ 2 cd belongs to the suppo rt of G . T o parse T according to G , one can instead follow with p ositive probabilit y suc h a path as T , [ 0 abc ] 3 ⊕ [ 3 d, [ 0 ab ] 2 ⊕ [ 2 c ] 3 ⊕ [ 3 d, G . T o pro ve the lemma, we will hav e to sho w that it is always p ossible to find suc h an alternative parsing path. This pro perty is fundamental to our approach, since it prov es that the toric gra mmars w e build can b e used to pa rse the texts they can pro duce. Let us s ta rt with the easiest part of the pro of. Assume that G T , R ( G ) > 0 . This means that there is a path G 0 , . . . , G k such that G 0 = T , G k = G , and G t ∈ β n ( G t − 1 , R ). Anyho w it is easy to chec k that G t ∈ β n ( G t − 1 , R ) = ⇒ G t − 1 ∈ α ( G t ) , so that the reverse path is followed with a po sitiv e probability by the pro duction pro cess. This means that T G ( T ) > 0. O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 32 In the case of the lear ning pro c e s s, if G T G > 0, there is a path G t , 0 6 t 6 k , such tha t G t ∈ β ℓ ( G t − 1 ), G 0 = T and G k = G , consequently there is a la bel map f t ∈ F such that f t ( G t − 1 ) ∈ α ( G t ). W e can then remar k that f k ◦ · · · ◦ f t ( G t − 1 ) ∈ α f k ◦ · · · ◦ f t +1 ( G t ) , beca use as already prov ed b efore in lemma 4.3 on page 11, f α ( G ) ⊂ α f ( G ) . Let us consider the pa th e G t = f k ◦ · · · ◦ f k − t +1 G k − t . It b egins a t e G 0 = G k = G and ends at e G k = f k ◦ · · · ◦ f 1 G 0 = T . A ccording to the previous remark, this path is fo llowed b y the pro duction pro cess with p ositive probabilit y , proving that T G ( T ) > 0. Let us now co me to the pro of of the third implication of the le mma. F or this let us assume now that T G ( T ) > 0. Consider a path G 0 , . . . , G k such that G 0 = G , . . . , G k = T and G t ∈ α ( G t − 1 ). W e are go ing to define some de c- or ate d path e G 0 , . . . , e G k with some added pa ren theses. In tro duce a new set of symbols B = ( i , ) i , i ∈ N \ { 0 } and assume that it is disjo int from the o ther symbols used so far, s o that B ∩ S = ∅ . Consider the set of toric g r ammars e S based on the enlar ged dictionary D ∪ B , and the pro jection π : e G → G defined with the help of the canonica l decomposition of toric gr ammars as π X e ∈ e E c G ( e ) ⊗ e = X e ∈ e E c G ( e ) ⊗ π ( e ) , where e E c is the set of cano nical expressions based on the enlarg ed dictionary D ∪ B , and where π ( e ) is obtained by removing from the sequence of sym b ols e the symbols b elonging to the decoration set B (that is the parentheses). Let us put e G 0 = G and define e G t for t = 1 , . . . , k by induction. W e will chec k on the go that π e G t = G t . It is obviously true for e G 0 , b ecause e G 0 ∈ S , so that π e G 0 = e G 0 = G 0 . That said, let us des c r ibe the construction o f e G t , assuming that e G t − 1 is alre a dy defined, and satisfies π e G t − 1 = G t − 1 . Consider the sequence of symbols a a nd b ∈ S ∗ and the index i ∈ N \ { 0 } s uch that G t = G t − 1 ⊕ ab ⊖ a ] i ⊖ [ i b. Since π e G t − 1 = G t − 1 , and since a ] i ⊕ [ i b 6 G t − 1 , there are e a ∈ e S ∗ and e b ∈ e S ∗ such that π ( e a ) = a , π ( e b ) = b , a nd e a ] i ⊕ [ i e b 6 e G t − 1 . (The choice of e a and e b may not be unique, in which case w e can make an y arbitrary choice). Let us define e G t = e G t − 1 ⊕ e a ( i e b ) i ⊖ e a ] i ⊖ [ i e b. Since π e a ( i e b ) i = π e a e b = ab , π e G t = π e G t − 1 ⊕ π e a ( i b ) i ⊖ π e a ] i ⊖ π [ i e b = G t − 1 ⊕ ab ⊖ a ] i ⊖ [ i b = G t , where we have used the o b vious fact that π is linear. O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 33 W e a re now going to define a nother ma pping b et ween g rammars th at allows to recov er G from an y e G t (obviously the decorations where a dded to keep track of G ). Let us define ψ : e S ′ → S on the set of decorated gra mmars e S ′ which a re suppo rted b y express ions where the parentheses ( i ) i are matched (at the same level) by the form ula ψ X e ∈ e E c e G ( e ) ⊗ e = X e ∈ e E c e G ( e ) ψ ( e ) , where ψ ( e ) is defined by the rules ψ ( e ) = ( ψ [ i a ] j c + ψ [ j b , if e = [ i a ( j b ) j c, with a, b, c ∈ e S ∗ ψ ( e ) = 1 ⊗ e, otherwise. It is easy to chec k that this definition is not am biguous and that ψ ( e ) = ψ ′ ( e ) ⊕ M ( i a ) i ∈ supp( e ) [ i ψ ′ ( a ) , where ψ ′ ( e ) is the expressio n obta ined from e by r eplacing all the sequences betw een outer pa ren theses pairs ( j a ) j b y ] j . This is may be eas ier to gras p on some example: ψ [ 0 a ( 1 b ( 2 c ) 2 d ) 1 e ( 3 f ( 4 g ) 4 ) 3 h = [ 0 a ] 1 e ] 3 h ⊕ [ 1 b ] 2 d ⊕ [ 2 c ⊕ [ 3 f ] 4 ⊕ [ 4 g . It is easy to c heck by induction that e G t ∈ e S ′ . Let us chec k mo reo ver that ψ e G t = G . Indeed ψ e G 0 = ψ G = G and ψ e G t = ψ e G t − 1 ⊕ ψ e a ( i e b ) i ⊖ ψ e a ] i ⊖ ψ [ i e b = ψ e G t − 1 , since ψ is linear and ψ e a ( i e b ) i = ψ e a ] i ⊕ ψ [ i b . W e ar e no w going to define a contin uation for the path e G t , 0 6 t 6 k that will bring us back to G . W e w ill main tain during our inductiv e construction tw o prop erties: ψ e G t = G , and supp e G t ∩ e E l ⊂ E l , where e E l is the set of lo cal decorated expressio ns, so that e E l = e ∈ e E : [ 0 / ∈ supp( e ) . W e already proved that the first prop erty is satisfied b y e G k . As π e G k = G k = T , supp( e G k ) ∩ e E l = ∅ , so that the second co ndition is also satisfied. Let us assume that, for some t > k , e G t − 1 has b een defined and satisfies the t wo conditions ab o ve, and let us pro ceed to the construction of e G t . O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 34 As long as e G t − 1 6∈ S , (and this will be the ca se for t < 2 k ), find some canonical expression e ∈ e E c \ E c , such that e G t − 1 ( e ) > 1 . F ro m o ur induction hypotheses, w e see that necessa rily [ 0 ∈ supp( e ). O ur contin uation will b e suc h that each such expression has matching pa ren theses with matching lab els, and w e will chec k this on the go while building it b y induction. Among thos e matching pairs of parentheses, there is necessa r ily at least one inner pa ir. W e can fo r instance choose the one starting with the las t op ening pa ren thesis ( j of the sequence e . This choice makes it ob vious that the subsequence of e enclosed betw een ( j and ) j contains no fu rther parentheses. Since ψ is linear and preser ves p ositive measures, G ⊖ ψ ( e ) = ψ e G t − 1 ⊖ ψ ( e ) = ψ e G t − 1 ⊖ e > 0, On the other hand, e has the form e = [ 0 a ( j b ) j c , where ψ ( b ) = b (since ( j ) j is an inner pair of pa ren theses in e ). As ψ ( e ) = ψ [ 0 a ] j c + ψ [ j b and ψ [ j b = [ j b , this s ho ws that [ j b 6 G , and therefore that G ([ j b ) > 0. Let us now define e G t = e G t − 1 ⊖ e ⊕ [ j b ⊕ [ 0 a ] j c. Applying ψ to e G t , we see as previously that ψ e G t = ψ e G t − 1 = G . As e G k contains k pairs of parent heses, and w e consume one pa ir at each s tep t > k , we see that e G 2 k contains no more pare ntheses, so that e G 2 k ∈ G and e G 2 k = ψ e G 2 k = G . Let us put now G t = π e G t , for t = k + 1 , . . . , 2 k . W e see that G t = G t − 1 ⊖ [ 0 abc ⊕ [ 0 a ] j c ⊕ [ j b, where [ 0 abc, [ 0 a ] j c ∈ E a nd G [ j b > 0, so tha t G t ∈ β n ( G t − 1 , G ), therefore G k = T , . . . , G 2 k = G is a path of p ositiv e probability under the parsing pro- cess with reference G , leading from T to G , in o ther words, G T , G ( G ) > 0 as required. A.3. Conver genc e to the exp e ctation of a r andom toric gr ammar, pr o of of lemma 8.1 on p age 19 The pro of of this results is based on the fact that the o peration G , G ′ 7→ χ G ⊞ G ′ is asso ciative. Let us b egin the pro of b y several definitions and lemmas. F or a n y grammar G ∈ G a nd any pa ir o f indices p = ( p 1 , p 2 ) ∈ N \ { 0 } 2 , we will sa y that p is G -congruent when there is a ∈ S ∗ such that G ( a ] p 1 ) G ( a ] p 2 ) > 0 or G ([ p 1 a ) G ([ p 2 a ) > 0. Let us define the lab el map ξ p as ξ p ( i ) = ( i, when i / ∈ { p 1 , p 2 } , min { p 1 , p 2 } , when i ∈ { p 1 , p 2 } . O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 35 F or any sequence p 1 , . . . , p k ∈ ( N \ { 0 } ) 2 k of pairs of indices, let us define the lab el map ξ p 1 ,...,p k as ξ p 1 ,...,p k = ξ ξ p 1 ,...,p k − 1 ( p k ) ◦ ξ p 1 ,...,p k − 1 , where f ( i, j ) = ( f ( i ) , f ( j )), for any ( i, j ) ∈ N \ { 0 } 2 . Let us say that ( p 1 , . . . , p k ) ∈ N \ { 0 } 2 k is G -congruent, if ξ p 1 ,...,p j − 1 ( p j ) is ξ p 1 ,...,p j − 1 G -congruent for any j 6 k , a nd that it is maximal G -co ngruen t is it is G -congruent and a n y G -congruent sequence of the form ( p 1 , . . . , p k , p k +1 ) is such that ξ p 1 ,...,p k ( p 1 k +1 ) = ξ p 1 ,...,p k ( p 2 k +1 ) , or equiv alently such that ξ p 1 ,...,p k +1 = ξ p 1 ,...,p k . Lemma A.1 F or any se quenc e ( p 1 , . . . , p ℓ ) ∈ N \ { 0 } 2 ℓ , for any k < ℓ , ξ p 1 ,...,p ℓ = ξ ξ p 1 ,...,p k ( p k +1 ,...,p ℓ ) ◦ ξ p 1 ,...,p k . Pr o of. By induction on ℓ for k fixed. This is tr ue from the definition for ℓ = k + 1. Assuming we have established the lemma for ℓ − 1 , we can write ξ p 1 ,...,p ℓ = ξ ξ p 1 ,...,p ℓ − 1 ( p ℓ ) ◦ ξ p 1 ,...,p ℓ − 1 = ξ ξ ξ p 1 ,...,p k ( p k +1 ,...,p ℓ − 1 ) ◦ ξ p 1 ,...,p k ( p ℓ ) ◦ ξ ξ p 1 ,...,p k ( p k +1 ,...,p ℓ − 1 ) ◦ ξ p 1 ,...,p k = ξ ξ p 1 ,...,p k ( p k +1 ,...,p ℓ ) ◦ ξ p 1 ,...,p k , Lemma A.2 F or any p ermutation σ of { 1 , . . . , k } , ξ p 1 ,...,p k ≡ ξ p σ (1) ,...,p σ ( k ) Pr o of. Let us consider the smalles t equiv alence relation co n taining the se t { p 1 , . . . , p k } . Let π k : N \ { 0 } → C b e the corresp onding pro jection of ea c h lab el to its comp o nen t. Let us befine the lab el map π k b y π k (0) = 0 a nd π k ( i ) = min π k ( i ) , i > 0 . W e are going to prov e by induction on k that ξ p 1 ,...,p k ≡ π k . Since π k is inv aria nt b y p ermu tation of the sequence ( p 1 , . . . , p k ), this will prove the lemma. Let us rema rk now that ξ p 1 ,...,p k ≡ π k if and only if ξ p 1 ,...,p k ( i ) = ξ p 1 ,...,p k ( j ) ⇐ ⇒ π k ( i ) = π k ( j ) , i, j > 0 . So we are going to prov e this equiv a lence. It is easy to see fro m the pr evious lemma that for any in teger m = 1 , . . . , k , ξ p 1 ,...,p k ( p 1 m ) = ξ p 1 ,...,p k ( p 2 m ) . (A.1) O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 36 Indeed, ξ p 1 ,...,p k = ξ ξ p 1 ,...,p m ( p m +1 ,...,p k ) ◦ ξ ξ p 1 ,...,p m − 1 ( p m ) ◦ ξ p 1 ,...,p m − 1 , so that, changing p m for ξ p 1 ,...,p m − 1 ( p m ), we are back to proving the result when m = k = 1 , w her e it is obvious from the definitions. Now, eq. (A.1) on the previous page and the minimality of π k implies that π k ( i ) = π k ( j ) = ⇒ ξ p 1 ,...,p k ( i ) = ξ p 1 ,...,p k ( j ) , i, j > 0 . Let us assume conv ersely that ξ p 1 ,...,p k ( i ) = ξ p 1 ,...,p k ( j ) and let m = min ℓ : ξ p 1 ,...,p ℓ ( i ) = ξ p 1 ,...,p ℓ ( j ) . Since ξ p 1 ,...,p m = ξ ξ p 1 ,...,p m − 1 ( p m ) ◦ ξ p 1 ,...,p m − 1 , we see that necessarily ξ p 1 ,...,p m − 1 { i, j } = ξ p 1 ,...,p m − 1 { p 1 m , p 2 m } , and that this set con tains t wo distinct elements. Exchanging the role of i and j if necessary , we ca n a ssume without loss of generality that ξ p 1 ,...,p m − 1 ( i, j ) = ξ p 1 ,...,p m − 1 p m . F ro m the induction hypothesis, this implies that π m − 1 ( i, j ) = π m − 1 ( p m ). Since the equiv alence relation defined by π m − 1 is a subset of the eq uiv alence relation defined by π k , this implies that π k ( i, j ) = π k ( p m ). Since mor eo ver π k ( p 1 m ) = π k ( p 2 m ), this implies that π k ( i ) = π k ( j ). Lemma A.3 F or any f ∈ F , any se quenc e of p airs of p ositive lab els p 1 , . . . , p k , t her e is a lab el map g ∈ F such that ξ f ( p 1 ,...,p k ) ◦ f = g ◦ ξ p 1 ,...,p k . Pr o of. W e have to prove that ξ p 1 ,...,p k ( i ) = ξ p 1 ,...,p k ( j ) = ⇒ ξ f ( p 1 ,...,p k ) ◦ f ( i ) = ξ f ( p 1 ,...,p k ) ◦ f ( j ) , i, j > 0 . F ro m the pro of of the previous lemma, it is enough to chec k that the right-hand side holds when ( i, j ) = p m , m = 1 , . . . , k , which is then ob vio us . Lemma A.4 If f ∈ F and ( p 1 , . . . , p k ) is G -c ongruent, then ( f ( p 1 ) , . . . , f ( p k )) is also f ( G ) - c ongruent. Pr o of. Assume that for some a ∈ S ∗ ξ p 1 ,...,p m − 1 G a ] ξ p 1 ,...,p m − 1 ( p 1 m ) > 0 . Then, ξ f ( p 1 ,...,p m − 1 ) ◦ f = g ◦ ξ p 1 ,...,p m − 1 , and O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 37 ξ f ( p 1 ,...,p m − 1 ) ◦ f G g ( a ) ] ξ f ( p 1 ,...,p m − 1 ) ◦ f ( p 1 m ) = g ◦ ξ p 1 ,...,p m − 1 G g ( a ) ] g ◦ ξ p 1 ,...,p m − 1 ( p 1 m ) = ξ p 1 ,...,p m G g − 1 ◦ g a ] ξ p 1 ,...,p m − 1 ( p 1 m ) > ξ p 1 ,...,p m − 1 G a ] ξ p 1 ,...,p m − 1 ( p 1 m ) > 0 . The same is tr ue when p 1 m is replac e d with p 2 m and when a ] ξ p 1 ,...,p m − 1 ( p 1 m ) is replaced with a [ ξ p 1 ,...,p m − 1 ( p 2 m ) . The lemma is a stra igh tforward consequence of these remarks and the defi- nition of a congruent sequence. Lemma A.5 If ( p 1 , . . . , p k ) and ( q 1 , . . . , q ℓ ) ar e b oth G -c ongruent, then ( p 1 , . . . , p k , q 1 , . . . , q ℓ ) is G -c ongruent . Pr o of. According to the previous lemma, ξ p 1 ,...,p k ( q 1 , . . . , q ℓ ) is ξ p 1 ,...,p k G - congruent. Coming back to the definition this prov es that ξ ξ p 1 ,...,p k ( q 1 ,...,q ℓ − 1 ) ◦ ξ p 1 ,...,p k ( q ℓ ) is ξ ξ p 1 ,...,p k ( q 1 ,...,q ℓ − 1 ) ◦ ξ p 1 ,...,p k G -congruent . In lemma A.1 on page 3 5 we ha ve moreov er pro ved that ξ p 1 ,...,p k ,q 1 ,...,q ℓ − 1 = ξ ξ p 1 ,...,p k ( q 1 ,...,q ℓ − 1 ) ◦ ξ p 1 ,...,p k . This identit y applied to the a bov e statement shows that ( p 1 , . . . , p k , q 1 , . . . , q ℓ ) satisfies the definition of a G -congr uen t sequence. Prop osition A.6 If ( p 1 , . . . , p k ) and ( q 1 , . . . , q ℓ ) ar e b oth maximal G -c ongruent, then ξ p 1 ,...,p k G ≡ ξ q 1 ,...,q ℓ G ≡ χ G . Pr o of. F rom the pr evious lemma, ( p 1 , . . . , p k , q 1 , . . . , q ℓ ) is G -congruent. Since p is maxima l, ξ p 1 ,...,p k ,q 1 ,...,q ℓ = ξ p 1 ,...,p k . In the same wa y ξ q 1 ,...,q ℓ ,p 1 ,...,p k = ξ q 1 ,...,q ℓ . W e hav e seen moreov er in a previous lemma that ξ p 1 ,...,p k ,q 1 ,...,q ℓ ≡ ξ q 1 ,...,q ℓ ,p 1 ,...,p k . This prov es that ξ p 1 ,...,p k ≡ ξ q 1 ,...,q ℓ . W e see from the definition o f χ (see Definition 6 .4 on pag e 16) that there is some ma ximal G -congruent se q uence r 1 , . . . , r m such that χ G = ξ r 1 ,...,r m G . Therefore χ G ≡ ξ p 1 ,...,p k G ≡ ξ q 1 ,...,q ℓ G . O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 38 Prop osition A.7 F or any G , G ′ ∈ G , χ χ ( G ) ⊞ G ′ = χ G ⊞ G ′ . Conse quen tly, for any G , G ′ , G ′′ ∈ G , χ χ G ⊞ G ′ ⊞ G ′′ = χ G ⊞ G ′ ⊞ G ′′ . Pr o of. Let us assume that G , G ′ and χ ( G ) use disjoint lab el sets, so that χ χ G ⊞ G ′ ≡ χ χ ( G ) + G ′ , χ G ⊞ G ′ ≡ χ G + G ′ . Let p 1 , . . . , p k be some maximal G -congr uen t seq uence . It is also G + G ′ - congruent, and since label sets are disjoint, ξ p 1 ,...,p k G + G ′ = ξ p 1 ,...,p k G + G ′ . Let us con tinue the sequence p 1 , . . . , p k to form a max ima l G + G ′ -congruent sequence p 1 , . . . , p ℓ . Let ( q k +1 , . . . , q ℓ ) b e defined as q m = ξ p 1 ,...,p k ( p k + m ) . W e see from the definitions that ( q k +1 , . . . , q ℓ ) is a maximal ξ p 1 ,...,p k G + G ′ - congruent sequence, and therefore a maximal ξ p 1 ,...,p k ( G ) + G ′ -congruent se- quence. Consequently χ χ G + G ′ ≡ ξ q k +1 ,...,q ℓ ξ p 1 ,...,p k G + G ′ = ξ q k +1 ,...,q ℓ ◦ ξ p 1 ,...,p k G + G ′ = ξ ξ p 1 ,...,p k ( p k +1 ,...,p ℓ ) ◦ ξ p 1 ,...,p k G + G ′ = ξ p 1 ,...,p ℓ G + G ′ ≡ χ G + G ′ , proving the pr oposition. Pr o of of lemma 8.1 on p age 19. Let π b e the pro jection of G on G / ≡ . F ro m the law of lar ge num b ers, w e have that, for all G ∈ G , 1 n n X i =1 1 ( G i ≡ G ) − → n →∞ G ( π ( G )) . Let us now remark that n ⊞ i =1 n − 1 G i = ⊞ G ∈ G / ≡ ⊞ i G i ∈ G n − 1 G i . Thus 1 n χ n ⊞ i =1 G i = χ ⊞ G ∈ G / ≡ χ ⊞ i G i ∈ G n − 1 G i O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 39 = χ ⊞ G ∈ G / ≡ n X i =1 n − 1 1 ( G i ∈ G ) ! χ ( G ) ! = χ ⊞ G ∈ G / ≡ n X i =1 n − 1 1 ( G i ∈ G ) ! G ! . W e used here Pr opos itio n A.7 on the preceding page and the fact that for any a, b ∈ R + , χ ( a G ) ⊞ ( b G ) = ( a + b ) χ ( G ) , which comes fro m the followi ng re a soning: Suppose that { 1 , . . . , d } = { i ; G ([ i S ∗ ) > 0 } , and let p i = (2 i, 2 i − 1). Since eac h p i is ( a G ) ⊞ ( b G )-congruent, ( p 1 , . . . , p d ) is also ( a G ) ⊞ ( b G )-congruen t, from lemma A.5 on page 37. It is quite stra igh t- forward to see that ξ p 1 ,...,p d ( a G ) ⊞ ( b G ) ≡ ( a + b ) G . This implies that χ ( a G ) ⊞ ( b G ) = χ ◦ ξ p 1 ,...,p d ( a G ) ⊞ ( b G ) = χ ( a + b ) G = ( a + b ) χ ( G ) . T o tak e the limit inside χ , we need to prove that χ is con tinuous in a suitable sense. A ctually , G 7→ χ ( G ) is contin uous on sets of fixed supp ort, and this is what is required to conclude. Indeed, for any sequence ( G i ) with fixed suppor t for n large enough, there is a fixed lab el map f (depending on the suppor t) suc h that for n large eno ugh χ ( G i ) = f ( G i ), and the result follows from the fact that G 7→ f ( G ) is contin uous; since f ( G )( A ) = G ( f − 1 ( A )). Consequently lim n →∞ 1 n χ n ⊞ i =1 G i ! = χ ⊞ G ∈ G / ≡ lim n →∞ 1 n n X i =1 1 G i ∈ G G ! = χ ⊞ G ∈ G / ≡ G ( G ) G ! = χ ⊞ G ∈ G / ≡ G ( G ) χ ( G ) ! = χ ⊞ G ∈ G / ≡ ⊞ G ∈ G G ( G ) χ ( G ) ! = χ ⊞ G ∈ G G ( G ) χ ( G ) = χ ⊞ G ∈ G G ( G ) G = I G d G ( G ) . O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 40 App endix B: Language pro duced by a T oric Grammar In this app endix, we make a deterministic study of the la ng uage pro duced b y a toric grammar G ∈ β ∗ ( T ). More precisely , w e are in terested in the suppor t o f the distribution T G of the final state o f the pro duction proc ess. Lemma B.1 L et T ∈ T b e some text and G ∈ β ∗ ( T ) b e some gr ammar obtaine d by splitting this text a fin ite num b er of times. The numb er of splits p erforme d c an b e r e ad in G and is e qual to n = + ∞ X i =1 G ] i S ∗ . L et us put α ( G ) = α n ( G ) . Then, T ∈ α ( G ) ⊂ T , mor e over α ( G ) = supp( T G ) . Pr o of. The gramma r G is obtained b y making a s uccession of splits. Each of those splits add one [ i and o ne ] i to the grammar , whereas in the o r iginal text there are no [ i nor ] i , except for the [ 0 at the b eginning of each sen tence. Since application of an elemen t of F do es not change the num ber of suc h symbols, they may b e used to count the nu mber of splits perfor med. Let us tak e then a s e q uence of toric grammar s T = G 0 , . . . , G n = G , such that G k ∈ β ( G k − 1 ). F rom lemma 4.2 on pag e 10, there is a sequence f 1 , . . . , f n ∈ F such that f k G k − 1 ∈ α G k . Let us pr o ve b y induction that for any k = 0 , . . . , n , f k ◦ · · · ◦ f 1 T ∈ α k G k . Indeed, this is true for k = 0, since G 0 = T . Mo r eo ver, assuming that the assertion holds for k − 1, w e deduce that f k ◦ · · · ◦ f 1 T ∈ f k α k − 1 G k − 1 ⊂ α k − 1 f k G k − 1 ⊂ α k G k . showing that if the a ssertion holds for k , it a lso holds for k + 1. F or k = n , we obtain that f n ◦ · · · ◦ f 1 T ∈ α n G n . As f n ◦ · · · ◦ f 1 T = T , since T is a text, and G n = G , w e get that T ∈ α n G . Let us consider now G ′ ∈ α ( G ). Le t ( G = G 0 , . . . , G n = G ′ ) the chain of grammars leading to G ′ . Then for any k = 0 , . . . , n , + ∞ X i =1 G k ] i S ∗ = n − k , since G k ∈ α ( G k − 1 ) and each merge takes aw ay o ne [ i and one ] i . This implies that P + ∞ i =1 G ′ ] i S ∗ = 0, and thus G ′ ∈ T . Note that, as remar k ed ab o ve, rep eated merges may create elements of the t yp e [ i a ] i b . Howev er, this will not happen if n successful merges can b e p er- formed. Indeed in the case when expres sions of the form [ i a ] i b remain un- matched during the mer ge proces s, we will get α G k = ∅ for some k < n . O. Catoni and T. Mainguy / Septemb er 1, 2018 / T oric gr ammars 41 References [1] Baker, J.K. (19 79). T ra inable grammars for sp e ec h r ecognition. The Jour- nal of the Ac oustic al S o ciety of A meric a , 65 (S1 ) S13 2–S132. [2] Chi, Z. and Geman, S. (1998). Es timatio n o f probabilistic context-free grammars . Computational Linguistics , 24 (2) 29 9–305. MIT Press. [3] Chi, Z. (199 9). Statistical pro perties of proba bilistic context-free gra m- mars. Computational Linguistics , 25 (1) 131 –160. MIT P ress. [4] Chomsky, N. (1 956). Three Models for the Description o f Language. IRE T r ansactions on Information The ory [5] Chomsky, N . (1957). Syntactic Stru ctur es . Mouton & Co. [6] Chomsky, N . (1965). A sp e cts of t he The ory of Synt ax . MIT Press. [7] Chomsky, N . (1995). Th e Minimalist Pr o gr am. MIT Press. [8] Cohen, S. B. and Smith, N. A. (2012). Empirica l Risk Minimization for Probabilistic Gr ammars: Sa mple Complexit y and Hardness of Learning. Computational Linguistics . 38 (3) 479– 526. [9] Della P ietra, S. , Della P ietra, V. , Gillett, J. , Laffer ty, J. , Printz, H. and Ureš, L . . Inference and estimation o f a long-range trig r am mo del. Gr ammatic al Infer enc e and Ap plic ations , 7 8–92. Spr ing er. [10] Lari, K. and Y oung, S. (1990). The estimatio n of sto chastic context-free grammars using the inside-outside alg orithm. Computer sp e e ch & languag e , 4 (1) 35– 56. Elsevier . [11] Norris, J. R. (1998 ) Markov Chains . Cambridge Univ ersity P ress. [12] R oark, B. (2001 ). Probabilistic T op-Down Parsing and Languag e Mo del- ing. Computational Linguistics . 27 (2) 24 9 –276. [13] Sakakibara, y. (1990). Lea rning c o n text-free gra mmars from structura l data in p o lynomial time. The or etic al Computer Scienc e , 76 (2) 223–24 2. Elsevier. [14] St abler, E . (200 9) Ma thematics of languag e lea rning. Histoir e, Épisté- molo gie, L angage 31 (1) 127– 145. [15] T an, M., Zh ou, W., Zheng, L . and W ang, S. (20 12) A Scalable Dis- tributed Syntactic, Semantic, a nd Lexical Langua ge Mo del. Computational Linguistics . 38 (3) 631 –671.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment