On the minimal penalty for Markov order estimation

We show that large-scale typicality of Markov sample paths implies that the likelihood ratio statistic satisfies a law of iterated logarithm uniformly to the same scale. As a consequence, the penalized likelihood Markov order estimator is strongly co…

Authors: Ramon van H, el

ON THE MINIMAL PENAL TY FOR MARKO V ORDER ESTIMA TION B Y R A M O N V A N H A N D E L Princeto n U niver sity W e show that larg e-scale typicality of Marko v sample paths implies that the likelihood ratio statistic satisfies a law of iterated logarithm uniformly to the same scale. As a consequence, the penalized likelihood Markov order estimator is strongly consistent for penalties growing as slo wly as log log n when an upper bound is imposed on the order which may grow as rapidly as log n . Our method of proof, using techniques from empirical process theory , does not rely o n the explicit expression for the maximum lik elihood estimator in the Marko v case and could therefore be applicable in other settings. 1. Intr oduction. For the purp oses of this paper , a Mark ov chain is a discrete time stochastic process ( X k ) k ≥ 1 , taking value s in a state space A of finite cardi- nality | A | < ∞ , such that the condition al law of X k gi v en the past X 1 , . . . , X k − 1 depen ds on the most recent r states X k − r , . . . , X k − 1 only . The smallest number r for which this ass umption is satisfied is called the ord er of the Marko v chain. It is e viden t that the order o f a Mark o v chain de termines the most parsimonio us repre - sentat ion of the law of the process. T hus estimatio n of the order from observed data is a problem of p ractic al interest, which m oreo ve r rais es intere sting mathemati cal questi ons at the intersection of probabili ty , statistics and informatio n theory . Denote by P ( x 1: n ) the proba bility of the sequence x 1: n ∈ A n under the law P , and denote by Θ r the collection of all laws of Markov chains whose order is at most r . As the parameter space s Θ r ⊂ Θ r +1 are incre asing, the nai v e maxi- mum lik eli hood estimate of the o rder ˆ r n = argmax r sup P ∈ Θ r P ( x 1: n ) f ails to be consis tent. Instead, w e intodu ce the penalized likelihoo d order estimator ˆ r n = argmax 0 ≤ r< κ ( n ) ( sup P ∈ Θ r log P ( x 1: n ) − p en( n, r ) ) , where p en( n, r ) is a penalty fu nction and κ ( n ) is a cuto f f function. T he estimator is called str o ngly consiste nt if ˆ r n → r ⋆ P ⋆ -a.s. as n → ∞ whene ve r the law of the observ atio ns P ⋆ is the law of a Markov chain whose order is r ⋆ . W e aim to unders tand w hich penal ties and cutoff s yield a strongly consistent estimator . AMS 2000 subject classifications: Primary 62M05; secondary 60E15, 60F15, 60G42, 60J10 K e ywor ds and phrases: order estimati on, uniform law of i terated logarithm, mart ingale inequali- ties, empirical process theory , large-scale typicality , Marko v chains 1 2 RAMON V AN HANDEL Results of this type date back to Fines so [ 4 ], who consider s the case where the order r ⋆ of the Mark ov chai n P ⋆ is known a priori to be bounded abo ve by some consta nt r ⋆ < K . In this setting , Finesso sho ws that the penalty and cutof f p en( n, r ) = C | A | r log log n, κ ( n ) = K yield a strongly consis tent order estimator for a suffici ently lar ge constant C (by [ 1 ], p. 59 2, it suf fi ces to choose C > 2 | A | ). It ca n be ar gued from the la w of iterate d logar ithm fo r martin gales that a pena lty of this f orm is the minimal penalty that a chie ves stro ng co nsiste ncy , so that the result is essentially optimal (in the sense that the probability of underestimatio n of the order is minimize d). H o w e ver , the r equire ment impose d by the kno wledge of an a prio ri upper bou nd on th e order is a significant dra wback and is unrealis tic in man y application s. Order estimation in the absence of an upper bound has been in vestigat ed, for exa mple, by Kief fer [ 5 ]. Howe ver , th e penal ty used th ere is significantly la rger than the minimal penalt y in the case of an a prio ri uppe r bou nd. Kief fer’ s co njecture th at the well known B IC penalty p en( n, r ) = 1 2 | A | r ( | A | − 1) log n yields a strongly consis tent order estimator was pro ved by Csisz ´ ar and Shields [ 3 ]. The best result to date, due to Csisz ´ ar [ 2 ], sho w s that the penalty and cutof f p en( n, r ) = c | A | r log n , κ ( n ) = ∞ yield a strongl y consis tent order estimator for any choice of the consta nt c > 0 . Ho wev er , this pena lty is still larger than th e minimal penalty obta ined by Fine sso in the ca se of an a priori upp er bound on the o rder . These resul ts raise a b asic questi on [ 2 , 3 ]: is the log n gro w th of the penalty the necess ary pr ice to be paid for the lack of a prior upper bound on the order , or is the minimal pos sible penalty log log n already sufficie nt for con sistency in the absence of a prior upper bound? 1.1. Results of this paper . The purpos e of this paper is twofold. First, we will sho w that a penalty of order log log n does indeed suffice for consis tency of the Mark ov orde r estimator , provid ed we impose a cutof f of or - der κ ( n ) ∼ log n . Remarkably , this is precisely the same cutoff as is required to establ ish the consiste ncy of minimum description length (MDL ) orde r estimators [ 2 ], of which the BIC p enalty is an approximat ion. As the log log n penalt y is m uch smaller th an th e BIC pe nalty for larg e n , this constitut es a significa nt improve ment ov er pre vious results. Ho w e ver , th e bas ic question posed abo ve is only partial ly re- solv ed, as our results fall short of esta blishi ng consiste ncy of the log log n penalty in the absence of a cutof f κ ( n ) = ∞ as is done in [ 2 , 3 ] for the BIC penalty . Second, we introdu ce a ne w app roach for pro ving consis tenc y of orde r esti ma- tors in the absence of a prior upper bound on the order . T he techniques used in pre vious work [ 2 , 3 ] rely heavi ly on rathe r de licate explicit computation s which MARK O V ORDER ESTIMA TION 3 exp loit the a vail ability of a cl osed form e xpression for the maximum li kelihood es- timator in the Mark ov case. In contrast, our method of proof, which uses techn iques from empirical process theory [ 6 , 7 ], is entirely differ ent and can b e a pplied muc h more generally . T he present approac h could therefore provid e a possible startin g point for e xtending the results of Csisz ´ ar and Shields to problems where an expli cit exp ression for the maximum lik elihoo d is not av ailable, such as the challe nging proble m of orde r estimat ion in hid den Mar kov models (see [ 1 ], Chapter 15). 1.2. Compariso n with the appr oach of Csisz ´ ar and Shields. A direct cons e- quenc e of our m ain resu lt is th at the penalty and cutof f p en( n, r ) = C ⋆ | A | r log log n, κ ( n ) = α ⋆ log n with sui table constants C ⋆ and α ⋆ , where α ⋆ depen ds on the observ ation law P ⋆ , yield a strongly consisten t pe nalized likeliho od estimato r (in order to obtain a strong ly consiste nt order estimator which does not r equire p rior k nowledg e of P ⋆ it suf fices to choose κ ( n ) = o (log n ) ). T he upper bo und κ ( n ) = α ⋆ log n is inherited directl y from the lar ge scale typ icality proper ty which pla ys a cen tral role also in [ 2 , 3 ]. Our main r esult states t hat if large scale t ypicality h olds with an upper boun d r < κ (2 n ) on the order , then the like lihood rati o statistic satisfies a law of iterated logari thm uniformly for r < κ ( n ) (the details are in the follo wing section). Strong consis tency of t he penaliz ed likelihoo d order estimator then follo ws directly . It is instructi ve to make a compariso n with the ap proach of [ 2 , 3 ] for the pe nalty p en( n, r ) = c | A | r log n . The proof of strong consiste ncy in this setting consis ts of two parts. F irst, large -scale ty picality is used to pro ve strong consistenc y of th e estimato r with cut off κ ( n ) = α ⋆ log n . Next, a separa te argumen t is employe d to sho w that the larger orders r ≥ α ⋆ log n are neg ligible. Our resul t improves the first part of the proo f, as we sho w that the conclusio n already holds for the smaller penalt y p en( n, r ) = C ⋆ | A | r log log n . Ho weve r , the se cond par t of the proof is missing in our setting , and it is unclear whether such a result could in fact be establ ished. T he res olution of this problem should eff ecti vely identify the minimal penalt y for Markov order estima tion in the absence of a cutof f. Let us also note th at the first part of th e p roof in [ 2 ] makes use of a sort of trunca ted la w of iterated lo garith m for the empiric al transition pro babilities of the Marko v chain. Howe ver , the result in [ 2 ] implies that the like lihood ratio statistic gro ws as log log n on ly for orders as la rge as lo g log n , while the bound gro ws as log n for orders as lar ge as log n . Our m ain resul t sho ws that such a bound is not the best possib le, resolv ing in the negati ve a questio n posed in [ 2 ], p. 1621. 1.3. Or ganizatio n of the paper . In Section 2 , we se t u p the notation to be used throug hout the p aper and state our main results. In Section 3 , we red uce the proof 4 RAMON V AN HANDEL of our main result to the problem of establishin g a suitable deviat ion bound. The requis ite d e viation bo und is proved in Section 4 . The proof is based on an extensio n of a maximal inequal ity of van de Geer [ 7 ], which can be found in the Appendix . 2. Main r esults. Let us fi x on ce and for all the alp habet A of finite ca rdinal ity | A | < ∞ and the canon ical spac e Ω = A N endo wed with its Borel σ -field and coordi nate process ( X k ) k ≥ 1 ( X k ( ω ) = ω ( k ) for ω ∈ Ω ). W e will write x m : n for a sequen ce ( x m , . . . , x n ) ∈ A n − m +1 . Moreo ver , for any prob ability mea sure P on Ω , we will write P ( x m : n ) and P ( x m : n | x r : s ) instead of P ( X m : n = x m : n ) and P ( X m : n = x m : n | X r : s = x r : s ) , respect i vely , whene ver no confusion can arise. A Mark ov chai n is de fi ned b y a proba bility measure P such that f or some r ≥ 0 P ( x 1: n ) = P ( x 1: r ) n Y i = r +1 P ( x i | x i − r : i − 1 ) for all n ≥ r , x 1: n ∈ A n . W e w ill al ways presume that our Markov chai ns are time homoge neous: P ( X i = x r +1 | X i − r : i − 1 = x 1: r ) = P ( x r +1 | x 1: r ) for all i > r , x 1: r +1 ∈ A r +1 . W e deno te by Θ r the set of all probab ility measures that s atisfy t hese condition s for the gi ven v alue o f r ( Θ 0 is the class of all i.i.d. processes) . Note that Θ r ⊂ Θ r +1 for all r . The or der of a Mark ov chain P is the smallest r ≥ 0 such that P ∈ Θ r . Through out th e p aper we fix a distinguis hed Marko v chain P ⋆ of order r ⋆ , rep- resent ing the true probabil ity law of an observ ed pro cess. W e assume that P ⋆ is statio nary and ir red ucible . O n the basis o f a sequenc e of obs ervatio ns x 1: n we obtain an estimate ˆ r n of the true order r ⋆ by maximizing the pena lized likelihoo d ˆ r n = argmax 0 ≤ r< κ ( n ) ( sup P ∈ Θ r log P ( x 1: n ) − p en( n, r ) ) , where p en( n, r ) is a penalty function and κ ( n ) is a cutof f function. If ˆ r n n →∞ − − − → r ⋆ P ⋆ -a.s. , the estimator is called str ongly consistent . R E M A R K 2.1. As discuss ed in [ 3 ], the as sumption t hat P ⋆ is irreducib le is necess ary for the order estimation prob lem to be well posed, while stationarit y of P ⋆ entails no loss of ge nerali ty . In particu lar , the latter claim follo ws from the fact that a ny irreduci ble Marko v chain P is ab solutely continuo us with re spect to a station ary Marko v chain P s with the same transitio n probabil ities, so that strong consis tency un der P s automati cally holds under P also. MARK O V ORDER ESTIMA TION 5 Define for any s equen ce a 1: r ∈ A r and n ≥ 1 the random vari able N n ( a 1: r ) = n X i = r +1 1 x i − r : i − 1 = a 1: r , that is, N n ( a 1: r ) is the numb er of times the sequen ce a 1: r appear s as a subse- quenc e of x 1: n − 1 . By the ergodi c theo rem, the approximati on N n ( a 1: r ) / ( n − r ) ≈ P ⋆ ( a 1: r ) holds for lar ge n . The lar ge sc ale typi cality pr operty esse ntially r equires that this approximati on ho lds unifo rmly for all a 1: r with r < ρ ( n ) . As in [ 2 , 3 ], this idea plays an essentia l role in the proof of our main resu lt. D E FI N I T I O N 2.2 . The process P ⋆ is said to satisfy the lar ge-s cale typica lity proper ty with cuto f f ρ ( n ) if there exists a cons tant η < 1 such that     1 P ⋆ ( a 1: r ) N n ( a 1: r ) n − r − 1     < η for all a 1: r ∈ A r with P ⋆ ( a 1: r ) > 0 , r < ρ ( n ) e ventually as n → ∞ P ⋆ -a.s. W e are now ready to state the main result of this paper , which can be vie w ed as a law of iterated logarithm for the l ikelihood ratio statistic. A similar result w as establ ished in [ 4 ], Lemma 3.4.1 for the case of a fixe d order r > r ⋆ . O ur key inno vatio n is that here the result holds uniformly over the order r ⋆ < r < κ ( n ) , where κ (2 n ) is a cuto f f for which the lar ge-scal e typicality property holds. T H E O R E M 2.3. Let κ ( n ) ≤ n/ 4 be an incr easing fun ction, such that the pr o- cess P ⋆ satisfi es the lar ge-s cale typicality pr operty with cutof f κ (2 n ) . T hen ther e is a nonr andom c onstant C 0 > 0 ( dependin g only on η ) such that sup r ⋆ 0 and all r < r ⋆ . As p en( n, r ) /n → 0 as n → ∞ , this implies that P ⋆ -a.s. we ha ve ev entually as n → ∞ sup P ∈ Θ r log P ( x 1: n ) − p en( n, r ) < sup P ∈ Θ r ⋆ log P ( x 1: n ) − p en( n, r ⋆ ) ∀ r < r ⋆ . As κ ( n ) ≥ r ⋆ for n sufficien tly la rge, this shows that lim in f n →∞ ˆ r n ≥ r ⋆ P ⋆ -a.s. On the other hand, it is sho w n in [ 2 , 3 ] that the lar ge-scale typicality property holds with cu toff κ (2 n ) ≤ α ⋆ log 2 n for some co nstant α ⋆ which depe nds on P ⋆ (the consta nt η i n Definition 2.2 may be fixed arbi trarily ). By Theo rem 2 . 3 , sup r ⋆ r ⋆ , so we find that P ⋆ -a.s. we ha ve ev entually as n → ∞ sup P ∈ Θ r log P ( x 1: n ) − p en( n, r ) < sup P ∈ Θ r ⋆ log P ( x 1: n ) − p en( n, r ⋆ ) for all r ⋆ < r < κ ( n ) . Thus lim sup n →∞ ˆ r n ≤ r ⋆ P ⋆ -a.s. R E M A R K 2.5. The proo fs of lar ge-scale typical ity in [ 2 , 3 ] a ctually establ ish a slig htly stronger result, where the co nstant η in Definition 2.2 is replaced by n − β for some β > 0 . This i mprov ement is not needed for Theorem 2.3 to hold. R E M A R K 2.6 . The orem 2.3 states that the constant C 0 depen ds only on the v alue of η in Definition 2.2 . Unfort unately , the cons tants obtaine d by our method of proof are expect ed to be far from optimal; one can read off a v alue for C 0 of order 10 6 in the proof of Theorem 2.3 , which is likel y e xcessi vely large. R E M A R K 2.7. It is n ot difficul t to estab lish t hat ther e is a constan t C suc h t hat 1 n ( sup P ∈ Θ r log P ( x 1: n ) − sup P ∈ Θ r ⋆ log P ( x 1: n ) ) ≤ C MARK O V ORDER ESTIMA TION 7 for all n and r . It follo ws that sup r > (log | A | ) − 1 log n 1 p en( n, r ) ( sup P ∈ Θ r log P ( x 1: n ) − sup P ∈ Θ r ⋆ log P ( x 1: n ) ) ≤ | A | − 1 2 | A | e ventually as n → ∞ . In order to obtain a version of Corollary 2.4 with κ ( n ) = ∞ , the k ey d ifficulty is there fore to deal with orders in the range α ⋆ log n ≤ r ≤ (log | A | ) − 1 log n . It is an open que stion whet her it is possibl e t o c lose this gap. 3. Reduction to a devia tion bound. The proof of Theorem 2.3 consis ts of two steps. In this sect ion, we will pro ve th e result assuming that the likeli hood ratio statistic satisfies a certai n dev iation boun d. The requi site de viation bou nd, which is stated in the follo wing P roposit ion, will be pro ved in the next secti on. P RO P O S I T I O N 3.1. Define F n = G n ∩ G 2 n , wher e G n denote s the e vent (      1 P ⋆ ( a 1: r ) N n ( a 1: r ) n − r − 1      ≤ η for all a 1: r ∈ A r with P ⋆ ( a 1: r ) > 0 , r < ρ ( n ) ) , with ρ ( n ) i ncr easing and ρ ( n ) ≤ n/ 2 . Then ther e e xist constants C 1 , C ′ 1 , C 2 > 0 , which can be c hosen to depend only on η , such that P ⋆ " F n ∩ m ax i = n,..., 2 n ( sup P ∈ Θ r log P ( x 1: i ) − log P ⋆ ( x 1: i | x 1: r ) ) ≥ ε # ≤ C ′ 1 e − ε/C 1 for all n ≥ 1 , r ⋆ < r < ρ ( n ) , and ε ≥ C 2 | A | r . Conceptu ally , this result c an be u nderstood as follo ws. It is well kno wn in clas- sical statistic s t hat, in “regular” cases, the like lihood ratio sta tistic sup P ∈ Θ r log P ( x 1: n ) − log P ⋆ ( x 1: n ) con ver ges weakly as n → ∞ to a χ 2 -distri b uted rando m v ariable. T herefor e, we exp ect the likeliho od ratio statistic to po ssess exponen tial tails at least for l ar ge n . Proposit ion 3.1 provi des a p recise nonasymptoti c descriptio n of this phenomenon . W e now pro ve Theorem 2.3 presuming that Proposit ion 3.1 holds. P RO O F O F T H E O R E M 2 . 3 . W e clearly nee d only con sider seque nces x 1: n with 8 RAMON V AN HANDEL P ⋆ ( x 1: n ) > 0 . W e be gin w ith some straig htforward estimates: sup r ⋆ λ r whene ver P ⋆ ( x 1: r ) > 0 , so that sup r >r ⋆ − log P ⋆ ( x 1: r ) | A | r ≤ C := log(1 /λ ) sup r >r ⋆ r | A | r < ∞ . W e conclude that it suffices to pro ve sup r ⋆ C 1 , we find that ∞ X n =1 P ⋆ " F 2 n ∩ max 2 n ≤ i ≤ 2 n +1 1 log log i sup r ⋆ 1 . In this manner , one can estab lish that t he result is stil l v alid und er the weaker assumption that the lar ge-scale typica lity propert y holds with cutof f κ ( γ n ) fo r some γ > 1 . Ho wev er , this does not appear to l ead t o a substa ntially dif ferent conclusion f or the o rder est i- mation pro blem. In order to keep the notation and pro ofs as transp arent as possib le we hav e restricted our results to the case γ = 2 , but the necessary modifications for the case of arbitr ary γ > 1 are easily implemented . 4. Pro of of Pr oposition 3.1 . The longes t part of th e proof of Theor em 2.3 consis ts of the proof of Propositio n 3 . 1 . T o establish this result, we adapt an ap- proach using techniq ues from empirical process theory [ 6 , 7 ] that was originally de veloped to obtain rates of con ver gence for nonpar ametric maximum likeliho od 10 RAMON V AN HANDEL estimato rs in the i.i.d. setting. At the heart of the proof of Propositio n 3.1 lies an ext ension of a max imal inequ ality for families of martin gales unde r b racketing en- trop y cond itions , due to van de Geer [ 7 ], T heorem 8.13. The ex tension of this result that is needed for our purpose s is dev eloped i n the A ppendi x. 4.1. Pr eliminary computation s. Any measure P ∈ Θ r is uniquel y d etermined by its initial probabili ty P ( x 1: r ) and its transition probabilit y P ( x r +1 | x 1: r ) . It is easily seen that the measure which maximizes the log-like lihood lo g P ( x 1: n ) of P ∈ Θ r assign s unit probab ility to the obse rve d initial path x 1: r . Thus for r > r ⋆ sup P ∈ Θ r log P ( x 1: n ) − log P ⋆ ( x 1: n | x 1: r ) = sup P ∈ Θ r n X i = r +1 log  P ( x i | x i − r : i − 1 ) P ⋆ ( x i | x i − r : i − 1 )  . The f amily of funct ions log( P ( x i | x i − r : i − 1 ) / P ⋆ ( x i | x i − r : i − 1 )) ( P ∈ Θ r ) is P ⋆ -a.s. unifor mly boun ded from above but not from belo w . T o a vo id problems la ter on, we apply a standa rd trick. For any P ∈ Θ r , define ˜ P ( x i | x i − r : i − 1 ) = P ( x i | x i − r : i − 1 ) + P ⋆ ( x i | x i − r : i − 1 ) 2 . Thus ˜ P is a M arko v chain whose transiti on probabi lities are an equal m ixture of the tran sition probab ilities o f P and P ⋆ (the ini tial pro babili ties of ˜ P are irre lev ant for our purpos es and need not be defined). By conca vity of the logarithm, we find sup P ∈ Θ r log P ( x 1: n ) − log P ⋆ ( x 1: n | x 1: r ) ≤ 2 su p P ∈ Θ r n X i = r +1 log ˜ P ( x i | x i − r : i − 1 ) P ⋆ ( x i | x i − r : i − 1 ) ! . It therefo re suf fices to obtain a de viation bound for the right hand side of this exp ression, who se summands are P ⋆ -a.s. uniformly bounde d above and belo w . 4.2. P eeling . T he first part of the proo f of Proposi tion 3.1 aims to re duce the proble m to a de viation inequality for marting ales. T o t his end we emp loy a peeling de vice from the theory of w eighte d empirical processes. Define the natura l fi ltration F n = σ { X 1 , . . . , X n } . For a ny P ∈ Θ r , we define M P n = n X i = r +1 ( log ˜ P ( x i | x i − r : i − 1 ) P ⋆ ( x i | x i − r : i − 1 ) ! − E ⋆ " log ˜ P ( x i | x i − r : i − 1 ) P ⋆ ( x i | x i − r : i − 1 ) !      F i − 1 #) , which is a martingale (under P ⋆ ) by cons truction. It is easil y seen tha t M P n = n X i = r +1 log ˜ P ( x i | x i − r : i − 1 ) P ⋆ ( x i | x i − r : i − 1 ) ! + D P n , MARK O V ORDER ESTIMA TION 11 where we ha ve defined D P n = − n X i = r +1 X a i ∈ A P ⋆ ( a i | x i − r : i − 1 ) log ˜ P ( a i | x i − r : i − 1 ) P ⋆ ( a i | x i − r : i − 1 ) ! . W e also define for any P , P ′ ∈ Θ r the qua ntity H n ( P , P ′ ) = n X i = r +1 X a i ∈ A  ˜ P ( a i | x i − r : i − 1 ) 1 / 2 − ˜ P ′ ( a i | x i − r : i − 1 ) 1 / 2  2 . Note that p H n ( P , P ′ ) defines a random dis tance o n Θ r . A s we will see bel o w , the role of the s et F n (and hence the la rge-sca le typicality assumption) in the proof of Proposit ion 3.1 is that it allows us to cont rol th is ran dom distance . L E M M A 4.1. F or any ε > 0 , n ≥ 1 and r > r ⋆ P ⋆ " F n ∩ m ax i = n,..., 2 n ( sup P ∈ Θ r log P ( x 1: i ) − log P ⋆ ( x 1: i | x 1: r ) ) ≥ ε # ≤ ∞ X k =0 P ⋆ " F n ∩ sup P ∈ Θ r 1 H n ( P , P ⋆ ) ≤ 2 k ε max i = n,..., 2 n M P i ≥ 2 k − 1 ε # . P RO O F . F rom the disc ussion abo ve, it is clear that P ⋆ " F n ∩ max i = n,..., 2 n ( sup P ∈ Θ r log P ( x 1: i ) − log P ⋆ ( x 1: i | x 1: r ) ) ≥ ε # ≤ P ⋆   F n ∩ max i = n,..., 2 n sup P ∈ Θ r i X ℓ = r +1 log ˜ P ( x ℓ | x ℓ − r : ℓ − 1 ) P ⋆ ( x ℓ | x ℓ − r : ℓ − 1 ) ! ≥ ε 2   = P ⋆ " F n ∩ max i = n,..., 2 n sup P ∈ Θ r n M P i − D P i o ≥ ε 2 # . No w note that as − log x ≥ 2 − 2 √ x for x > 0 , D P n ≥ 2 n X i = r +1 X a i ∈ A P ⋆ ( a i | x i − r : i − 1 ) 1 − ˜ P ( a i | x i − r : i − 1 ) 1 / 2 P ⋆ ( a i | x i − r : i − 1 ) 1 / 2 ! = H n ( P , P ⋆ ) . Therefore , we can estimate P ⋆ " F n ∩ max i = n,..., 2 n ( sup P ∈ Θ r log P ( x 1: i ) − log P ⋆ ( x 1: i | x 1: r ) ) ≥ ε # ≤ P ⋆ " F n ∩ max i = n,..., 2 n sup P ∈ Θ r n M P i − H i ( P , P ⋆ ) o ≥ ε 2 # ≤ P ⋆ " F n ∩ sup P ∈ Θ r  max i = n,..., 2 n M P i − H n ( P , P ⋆ )  ≥ ε 2 # . 12 RAMON V AN HANDEL W e no w partition the space Θ r into an inner ring { P ∈ Θ r : H n ( P , P ⋆ ) ≤ ε } and a collectio n of conc entric rings { P ∈ Θ r : 2 k − 1 ε ≤ H n ( P , P ⋆ ) ≤ 2 k ε } (note that this is a random partition, as the quantity H n ( P , P ′ ) depends on the observed path). Applying the union bound gi ves the estimates P ⋆ " F n ∩ max i = n,..., 2 n ( sup P ∈ Θ r log P ( x 1: i ) − log P ⋆ ( x 1: i | x 1: r ) ) ≥ ε # ≤ P ⋆ " F n ∩ su p P ∈ Θ r  max i = n,..., 2 n M P i − H n ( P , P ⋆ )  1 H n ( P , P ⋆ ) ≤ ε ≥ ε 2 # + ∞ X k =1 P ⋆ " F n ∩ sup P ∈ Θ r ( max i = n,..., 2 n M P i − H n ( P , P ⋆ ) ) × 1 2 k − 1 ε ≤ H n ( P , P ⋆ ) ≤ 2 k ε ≥ ε 2 # ≤ ∞ X k =0 P ⋆ " F n ∩ su p P ∈ Θ r 1 H n ( P , P ⋆ ) ≤ 2 k ε max i = n,..., 2 n M P i ≥ 2 k − 1 ε # . The proof is complete . 4.3. Contr ol of H n . Our next tas k is to contr ol the quantity H n ( P , P ′ ) . First , we sho w that on the ev ent F n the quanti ty H n is compara ble to H ( P , P ′ ) = X a 1: r +1 ∈ A r +1 P ⋆ ( a 1: r )  ˜ P ( a r +1 | a 1: r ) 1 / 2 − ˜ P ′ ( a r +1 | a 1: r ) 1 / 2  2 , which is a nonrand om squa red dista nce on Θ r . L E M M A 4.2. Ther e ex ist co nstants C 3 , C 4 suc h that for any n ≥ 1 , we have H 2 n ( P , P ′ ) ≤ C 3 H n ( P , P ′ ) and ( n − r ) C − 1 4 H ( P , P ′ ) ≤ H n ( P , P ′ ) ≤ ( n − r ) C 4 H ( P , P ′ ) for all P , P ′ ∈ Θ r and r ⋆ < r < ρ ( n ) on the event F n . P RO O F . It is easily seen that for any n ≥ 1 H n ( P , P ′ ) = X a 1: r +1 ∈ A r +1 N n ( a 1: r )  ˜ P ( a r +1 | a 1: r ) 1 / 2 − ˜ P ′ ( a r +1 | a 1: r ) 1 / 2  2 . MARK O V ORDER ESTIMA TION 13 On the ev ent F n , we hav e by construct ion (1 − η ) P ⋆ ( a 1: r ) ≤ N n ( a 1: r ) n − r ≤ (1 + η ) P ⋆ ( a 1: r ) and (1 − η ) P ⋆ ( a 1: r ) ≤ N 2 n ( a 1: r ) 2 n − r ≤ (1 + η ) P ⋆ ( a 1: r ) for al l a 1: r ∈ A r and r < ρ ( n ) . Here w e h a ve used th at ρ ( n ) ≤ ρ (2 n ) as ρ ( n ) i s presumed to be increa sing. In partic ular , we ha ve N 2 n ( a 1: r ) ≤ 1 + η 1 − η 2 n − r n − r N n ( a 1: r ) ≤ 4 1 + η 1 − η N n ( a 1: r ) , where we hav e used that n − r > n/ 2 as r < ρ ( n ) < n/ 2 . The result follows directl y pro vided we choose C 3 , C 4 (depen ding only on η ) suf fi ciently lar ge. Next, we control the quantity H n ( P , P ⋆ ) in terms of the “Bernstein norm” needed in order to ap ply the results dev eloped in th e Appendi x. As in the Ap- pendix , we define the functio n φ ( x ) = e x − x − 1 . L E M M A 4.3. Define for any P ∈ Θ r , r > r ⋆ and n ≥ 1 R P n = 8 n X i = r +1 E ⋆ " φ 1 2      log ˜ P ( x i | x i − r : i − 1 ) P ⋆ ( x i | x i − r : i − 1 ) !      !      F i − 1 # . Then R P n ≤ 8 H n ( P , P ⋆ ) for any P ∈ Θ r , r > r ⋆ and n ≥ 1 . P RO O F . N ote that log( ˜ P ( x i | x i − r : i − 1 ) / P ⋆ ( x i | x i − r : i − 1 )) ≥ − log (2) . By [ 7 ], Lemma 7.1, we ha ve φ ( | x | ) ≤ ( e x − 1) 2 for any x ≥ − log (2) / 2 . Therefore R P n ≤ 8 n X i = r +1 E ⋆   ˜ P ( x i | x i − r : i − 1 ) 1 / 2 P ⋆ ( x i | x i − r : i − 1 ) 1 / 2 − 1 ! 2       F i − 1   = 8 n X i = r +1 X a i ∈ A P ⋆ ( a i | x i − r : i − 1 ) ˜ P ( a i | x i − r : i − 1 ) 1 / 2 P ⋆ ( a i | x i − r : i − 1 ) 1 / 2 − 1 ! 2 . The result follo ws imm ediatel y . T ogethe r with Lemma 4 .1 , we obtain the follo wing. 14 RAMON V AN HANDEL C O RO L L A RY 4.4. Define for any σ > 0 the ball Θ r ( σ ) = { P ∈ Θ r : H ( P , P ⋆ ) ≤ σ } . Then for any ε > 0 , n ≥ 1 and r ⋆ < r < ρ ( n ) P ⋆ " F n ∩ m ax i = n,..., 2 n ( sup P ∈ Θ r log P ( x 1: i ) − log P ⋆ ( x 1: i | x 1: r ) ) ≥ ε # ≤ ∞ X k =0 P ⋆ " F n ∩ sup P ∈ Θ r ( C 4 2 k ε/ ( n − r )) 1 R P 2 n ≤ C 3 2 k +3 ε max i ≤ 2 n M P i ≥ 2 k − 1 ε # . The proof is straig htforw ard and is therefore omitted. 4.4. Contr ol of the brac keting entr opy . W e hav e now reduce d the proof of Proposit ion 3.1 to the problem of estimating the summands in Corollar y 4.4 . W e aim to do this by applyi ng Proposition A.2 in the Appendix with Θ ⊆ Θ r , ξ P i = ( log( ˜ P ( x i | x i − r : i − 1 ) / P ⋆ ( x i | x i − r : i − 1 )) for i > r , 0 for i ≤ r , and K = 2 . T o this end, the ma in remaini ng dif ficulty is to estimate the b racketing entrop y of Definiti on A.1 . This is our nex t order of bus iness. L E M M A 4.5. Given c > 0 , ther e exists C 5 > 0 dependin g o nly o n c such that log N (2 n, Θ r ( σ ) , F n , 2 , δ ) ≤ | A | r +1 log C 5 p (2 n − r ) σ δ ! for all n ≥ 1 , r ⋆ < r < ρ ( n ) , σ > 0 and 0 < δ ≤ c p (2 n − r ) σ . P RO O F . F ix n ≥ 1 , r ⋆ < r < ρ ( n ) , σ > 0 an d 0 < δ ≤ c p (2 n − r ) σ throug hout the proof. W e begin by defining the fa m ily of functions T β = { p : A r +1 → R + : P ⋆ ( a 1: r ) 1 / 2 p ( a 1: r +1 ) 1 / 2 ∈ β Z + ∀ a 1: r +1 ∈ A r +1 } , where β > 0 is to be determined in due course. W e claim that for any P ∈ Θ r , there exist λ P , γ P ∈ T β such that for all a 1: r +1 ∈ A r +1 with P ⋆ ( a 1: r ) > 0 λ P ( a 1: r +1 ) ≤ P ( a r +1 | a 1: r ) ≤ γ P ( a 1: r +1 ) and γ P ( a 1: r +1 ) 1 / 2 − λ P ( a 1: r +1 ) 1 / 2 ≤ β P ⋆ ( a 1: r ) 1 / 2 . MARK O V ORDER ESTIMA TION 15 Indeed , this follo ws immediately by setting λ P ( a 1: r +1 ) = ⌊ β − 1 P ⋆ ( a 1: r ) 1 / 2 P ( a r +1 | a 1: r ) 1 / 2 ⌋ β − 1 P ⋆ ( a 1: r ) 1 / 2 ! 2 , γ P ( a 1: r +1 ) = ⌈ β − 1 P ⋆ ( a 1: r ) 1 / 2 P ( a r +1 | a 1: r ) 1 / 2 ⌉ β − 1 P ⋆ ( a 1: r ) 1 / 2 ! 2 for all a 1: r +1 ∈ A r +1 with P ⋆ ( a 1: r ) > 0 . Therefo re P ⋆ -a.s. Λ P i := log ˜ λ P ( x i | x i − r : i − 1 ) P ⋆ ( x i | x i − r : i − 1 ) ! ≤ ξ P i ≤ log ˜ γ P ( x i | x i − r : i − 1 ) P ⋆ ( x i | x i − r : i − 1 ) ! := Υ P i for all P ∈ Θ r , i > r (we set Λ P i = Υ P i = 0 for i ≤ r ), where we h ave defined ˜ γ P ( x i | x i − r : i − 1 ) = { γ P ( x i − r : i ) + P ⋆ ( x i | x i − r : i − 1 ) } / 2 and ˜ λ P ( x i | x i − r : i − 1 ) = { λ P ( x i − r : i ) + P ⋆ ( x i | x i − r : i − 1 ) } / 2 . Moreo ver , w e can esti m ate 8 2 n X i =1 E " φ Υ P i − Λ P i 2 !      F i − 1 # ≤ 4 2 n X i =1 E   ˜ γ P ( x i | x i − r : i − 1 ) 1 / 2 ˜ λ P ( x i | x i − r : i − 1 ) 1 / 2 − 1 ! 2       F i − 1   ≤ 8 2 n X i = r +1 X a i ∈ A  ˜ γ P ( a i | x i − r : i − 1 ) 1 / 2 − ˜ λ P ( a i | x i − r : i − 1 ) 1 / 2  2 ≤ 4 X a 1: r +1 ∈ A r +1 N 2 n ( a 1: r )  γ P ( a 1: r +1 ) 1 / 2 − λ P ( a 1: r +1 ) 1 / 2  2 ≤ 4 β 2 X a 1: r +1 ∈ A r +1 N 2 n ( a 1: r ) P ⋆ ( a 1: r ) , where we ha ve u sed that φ ( x ) ≤ ( e x − 1) 2 / 2 for x ≥ 0 and [ 7 ], Lemma 4 .2. As in the proof of Lemma 4.2 , we find that for an y P ∈ Θ r 8 2 n X i =1 E " φ Υ P i − Λ P i 2 !      F i − 1 # ≤ 4 C 4 (2 n − r ) | A | r +1 β 2 on the ev ent F n (as r < ρ ( n ) by assumption). T herefo re, if we choose β = δ p 4 C 4 (2 n − r ) | A | r +1 , then { (Λ P i , Υ P i ) 1 ≤ i ≤ 2 n } P ∈ Θ r ( σ ) is a (2 n, Θ r ( σ ) , F n , 2 , δ ) -brack eting set. T o com- plete the proof we must estimate the cardina lity of this set. 16 RAMON V AN HANDEL W e approach this probl em through a well kno wn geometric de vice. W e can rep- resent an y function from A r +1 to R as a vecto r in R | A | r +1 in the o bvious fash ion. In particul ar , for any p : A r +1 → R , denote by ι [ p ] the representa tiv e in R | A | r +1 of the functio n ˜ p ( a 1: r +1 ) = P ⋆ ( a 1: r ) 1 / 2 p ( a 1: r +1 ) 1 / 2 . Then by [ 7 ], Lemma 4.2 ι [Θ r ( σ )] ⊆ B ( x 0 , 4 √ σ ) ∩ R | A | r +1 ++ , x 0 = ι [ P ⋆ ( a r +1 | a 1: r )] , where B ( x, h ) denotes the Euclidean ball in R | A | r +1 with center x and radi us h . O n the other hand, we clearly hav e ι [ T β ] = ( β Z + ) | A | r +1 ⊂ R | A | r +1 . D efine for any x, x ′ ∈ R | A | r +1 with x ′ ≻ x the cube [ x, x ′ ] := { ˜ x ∈ R | A | r +1 : x  ˜ x  x ′ } . Let Ξ β := { x ∈ ( β Z + ) | A | r +1 : [ x, x + β 1 ] ∩ B ( x 0 , 4 √ σ ) 6 = ∅ } , where 1 ∈ R | A | r +1 denote s the v ector all of w hose entrie s are one. Then clea rly ι [Θ r ( σ )] ⊆ B ( x 0 , 4 √ σ ) ∩ R | A | r +1 ++ ⊆ [ x ∈ Ξ β [ x, x + β 1 ] , and, in particular , it is easily established from our prev ious computations that N (2 n, Θ r ( σ ) , F n , 2 , δ ) ≤ | Ξ β | . No w su ppose that x ′ ∈ [ x, x + β 1 ] fo r some x ∈ Ξ β . Then th ere is an x ′′ ∈ [ x, x + β 1 ] such that x ′′ ∈ B ( x 0 , 4 √ σ ) . I n particular , we ha ve k x ′ − B ( x 0 , 4 √ σ ) k ∞ ≤ β , and therefore k x ′ − B ( x 0 , 4 √ σ ) k 2 ≤ | A | ( r +1) / 2 β , for e ver y x ′ ∈ [ x, x + β 1 ] , x ∈ Ξ β . W e con clude that [ x ∈ Ξ β [ x, x + β 1 ] ⊆ B ( x 0 , 4 √ σ + | A | ( r +1) / 2 β ) . Therefore , we can estimate | Ξ β | β | A | r +1 = vol   [ x ∈ Ξ β [ x, x + β 1 ]   ≤ vol  B ( x 0 , 4 √ σ + | A | ( r +1) / 2 β )  = (4 √ σ + | A | ( r +1) / 2 β ) | A | r +1 v ol ( B (0 , 1)) . But from [ 6 ], p. 249 we ha ve the estimate v ol( B (0 , 1)) ≤ √ 2 π e | A | ( r +1) / 2 ! | A | r +1 . Substitu ting t he express ion fo r β and rearranging, w e find that | Ξ β | ≤ { (8 √ C 4 + c ) √ 2 π e } p (2 n − r ) σ δ ! | A | r +1 , where we ha ve used that δ ≤ c p (2 n − r ) σ . The proof is easily completed . MARK O V ORDER ESTIMA TION 17 4.5. End o f the pr oof. T o complete the pro of of Prop osition 3.1 , i t re m ains to put togeth er the results obtained abov e with Proposi tion A.2 in the Appendix . P RO O F O F P R O P O S I T I O N 3 . 1 . In the foll owing, we will a lways apply Lemma 4.5 and Prop osition A.2 with the same constants c, c 0 , c 1 > 0 . The appropriat e v alues of thes e consta nts will be d etermined below . W e will also fix n ≥ 1 , r ⋆ < r < ρ ( n ) and ε ≥ C 2 | A | r , with the constan t C 2 to be determin ed. T o apply Coro llary 4.4 , we in v oke Pro position A.2 with K = 2 , α = 2 k − 1 ε , and R = C 3 2 k +3 ε (fixing k ≥ 0 for the time being ). W e find that P ⋆ " F n ∩ sup P ∈ Θ r ( C 4 2 k ε/ ( n − r )) 1 R P 2 n ≤ C 3 2 k +3 ε max i ≤ 2 n M P i ≥ 2 k − 1 ε # ≤ 2 exp " − 2 k − 5 ε C 3 C 2 ( c 1 + 1) # , pro vided that c 2 0 ≥ C 2 ( c 1 + 1) and c 0 Z √ C 3 2 k +3 ε 0 r log N (2 n, Θ r ( C 4 2 k ε n − r ) , F n , 2 , u ) du ≤ 2 k − 1 ε ≤ c 1 C 3 2 k +2 ε. T o ensu re that the second inequalit y hol ds, it su ffices to choo se c 1 = (8 C 3 ) − 1 , an d the c onditi on on c 0 is sa tisfied by choosing c 0 = C p (8 C 3 ) − 1 + 1 . T o simplif y the first inequ ality , choose c = p 8 C 3 /C 4 . Then the v ariable u in the integra l satis fi es u ≤ q C 3 2 k +3 ε ≤ c q (2 n − r ) C 4 2 k ε/ ( n − r ) , so by Lemma 4.5 it suf fices to ensure that 2 k − 1 ε ≥ | A | ( r +1) / 2 C q (8 C 3 ) − 1 + 1 Z √ C 3 2 k +3 ε 0 v u u t log (4 C 4 ) 1 / 2 C 5 √ 2 k ε u ! du, where we ha ve u sed that r < ρ ( n ) ≤ n/ 2 implies (2 n − r ) / ( n − r ) ≤ 4 . Defining C 6 := Z √ 8 C 3 0 v u u t log (4 C 4 ) 1 / 2 C 5 v ! dv < ∞ , a simple change of v ariables shows that the abo ve inequality is equi vale nt to 2 k − 1 ε ≥ | A | ( r +1) / 2 C 6 C q (8 C 3 ) − 1 + 1 √ 2 k ε, or , eq uiv alently , 2 k ε ≥ 4 C 2 6 C 2 ((8 C 3 ) − 1 + 1) | A | r +1 . 18 RAMON V AN HANDEL But this is alwa ys satisfied if we choose C 2 = 4 C 2 6 C 2 ((8 C 3 ) − 1 + 1) | A | . W ith these choice s o f c, c 0 , c 1 , C 2 , we ha ve thus sho wn that by Corollary 4.4 P ⋆ " F n ∩ m ax i = n,..., 2 n ( sup P ∈ Θ r log P ( x 1: i ) − log P ⋆ ( x 1: i | x 1: r ) ) ≥ ε # ≤ 2 ∞ X k =0 exp " − 2 k ε 2 5 C 2 ( C 3 + 1 / 8) # ≤ C ′ 1 exp  − ε C 1  with C 1 = 2 5 C 2 ( C 3 + 1 / 8) , C ′ 1 = 2 1 − e − C 2 / 2 5 C 2 ( C 3 +1 / 8) , where we ha ve used ε ≥ C 2 . This complete s t he proo f. APPEN DIX A : A MAXIMAL INEQU ALITY FOR MAR T INGALES The purp ose of this Appe ndix is to ob tain a de viation bound on the sup remum of an uncou ntable family of martingales, exten ding a result of van de Geer [ 7 ]. W e wo rk on a filtered prob ability space (Ω , F , { F i } i ≥ 0 , P ) . W e are giv en a pa- rameter set Θ and a collection ( ξ θ i ) i ≥ 1 , θ ∈ Θ of random v ariables such that ξ θ i is F i -measura ble for all i, θ . This setti ng will be presumed throu ghout the App endix . In the follo wing w e will freq uently use the func tion φ ( x ) = e x − x − 1 . D E FI N I T I O N A.1 . Let n ∈ N , F ∈ F , K > 0 and δ > 0 be giv en. A finite collec tion { (Λ j i , Υ j i ) 1 ≤ i ≤ n } j =1 ,...,N of rando m variab les is called a ( n, Θ , F , K, δ ) - bra cke ting set if Λ j i , Υ j i are F i -measura ble for all i, j , and for e very θ ∈ Θ , th ere is a 1 ≤ j ≤ N (the map θ 7→ j is nonrandom) such that P -a.s. Λ j i ≤ ξ θ i ≤ Υ j i for all i = 1 , . . . , n and such that 2 K 2 n X i =1 E " φ | Υ j i − Λ j i | K !      F i − 1 # ≤ δ 2 on F . W e denote as N ( n, Θ , F , K, δ ) the cardi nality N of the smallest ( n, Θ , F , K, δ ) - brack eting set ( log N ( n, Θ , F , K, δ ) is called the bra c keting entr opy ). The follo wing extends a result of v an de Geer [ 7 ], Theorem 8.13. P RO P O S I T I O N A.2 . F ix K > 0 , and define for all i ≥ 0 M θ i = i X ℓ =1 { ξ θ ℓ − E [ ξ θ ℓ | F ℓ − 1 ] } , R θ i = 2 K 2 i X ℓ =1 E " φ | ξ θ ℓ | K !      F ℓ − 1 # . MARK O V ORDER ESTIMA TION 19 Ther e is a univer sal cons tant C > 0 such that for any n ∈ N , R < ∞ and F ∈ F P " F ∩ sup θ ∈ Θ 1 R θ n ≤ R max i ≤ n M θ i ≥ α # ≤ 2 exp " − α 2 C 2 ( c 1 + 1) R # for any α, c 0 , c 1 > 0 such that c 2 0 ≥ C 2 ( c 1 + 1) and c 0 Z √ R 0 q log N ( n, Θ , F , K, u ) du ≤ α ≤ c 1 R K . [ F or exampl e, the choice C = 100 works. ] R E M A R K A. 3. T hroug hout, all uncoun table suprema sh ould be interpr eted as essent ial suprema under the measu re P . Thus measurabi lity problems are a voided . For our pu rposes, the k ey improve ment o ver [ 7 ], Theorem 8.13 is that t he bound in this result is gi ven for max i ≤ n M θ i rather than M θ n . This is essential in order to employ the blocking proc edure in the proof o f Theore m 2.3 . R ather than repeat the proof of [ 7 ], The orem 8.13 here w ith the nece ssary modifications , we tak e the oppor tunity to obtain a m ore gen eral result from w hich Proposi tion A.2 follo ws. 1 T H E O R E M A.4 . F ix K > 0 , and define for all i ≥ 0 M θ i = i X ℓ =1 { ξ θ ℓ − E [ ξ θ ℓ | F ℓ − 1 ] } , R θ i = 2 K 2 i X ℓ =1 E " φ | ξ θ ℓ | K !      F ℓ − 1 # . Then we have for any n ∈ N , R < ∞ , F ∈ F and x > 0 P " F ∩ sup θ ∈ Θ 1 R θ n ≤ R max i ≤ n M θ i ≥ 16 H + 32 √ Rx + 16 K x # ≤ 2 e − x , wher e we have written H = K log N ( n, Θ , F , K, √ R ) + 4 Z √ R 0 q log N ( n, Θ , F , K, u ) du. Before we procee d, let us prov e Pro position A.2 using Theorem A.4 . 1 A closer look at the proof of [ 7 ], Theorem 8.13 rev eals a fe w in consistencies which are corrected here. F or e xample, equation (A.12) in [ 7 ] seems to presuppose that X ≥ 0 on an ev ent A implies that P [ X | G ] ≥ 0 on A , wh ich need not be the case. Th e brack eting condition gi ven in [ 7 ], Definition 8.1 therefore seems too weak t o giv e the desired result. Si milarly , t he version of Bernstein’ s i nequality gi ven as [ 7 ], Lemma 8.9 does not appear to be the one used in the proof of Theorem 8.13. 20 RAMON V AN HANDEL P RO O F O F P R O P O S I T I O N A . 2 . Let α = p C 2 ( c 1 + 1) R x an d assume that the gi ven bounds on α hold. Then we can estimate x = α 2 C 2 ( c 1 + 1) R ≤ c 1 R K × α C 2 ( c 1 + 1) R ≤ α C 2 K , α = ( √ α ) 2 ≤ s c 1 Rα K . On the other hand, as N ( n, Θ , F , K, δ ) is noninc reasing, we ha ve c 0 q R log N ( n , Θ , F , K, √ R ) ≤ c 0 Z √ R 0 q log N ( n, Θ , F , K, u ) du ≤ α. Applying Theorem A.4 , we find that P " F ∩ sup θ ∈ Θ 1 R θ n ≤ R max i ≤ n M θ i ≥ ( 16 c 1 c 2 0 + 64 c 0 + 32 p C 2 ( c 1 + 1) + 16 C 2 ) α # ≤ 2 exp " − α 2 C 2 ( c 1 + 1) R # . But using c 2 0 ≥ C 2 ( c 1 + 1) ≥ C 2 , we can estimate 16 c 1 c 2 0 + 64 c 0 + 32 p C 2 ( c 1 + 1) + 16 C 2 ≤ 32 C 2 + 96 C ≤ 1 for C sufficien tly lar ge (e.g., C = 1 00 ). The remain der of the Appendix is de vote d to the proof of Theorem A.4 . It should be emphasized that the appr oach taken here is entirely standa rd in empirical proce ss theory : the notion of brack eting entrop y for martingale s and the proof of the req- uisite form of Bernst ein’ s inequality follo ws van de Geer [ 7 ], while th e relati vely transp arent proof of T heorem A.4 closely follo ws the proo f giv en by Mass art [ 6 ], Theorem 6.8 in the i.i.d. setting. The full p roofs are gi ven here for comple teness . Note also that we ha ve made no ef fort to optimize the constants in the proof (the consta nts are necessarily so m e w hat lar ger th an tho se obtain ed in [ 6 ] due to the presen ce of the addit ional maximum max i ≤ n M θ i ). A.1. A va riant of Bernstein’ s inequality . The follo wing result is a v ariant of Bernstein ’ s inequalit y for martingales. It slightly improve s on [ 7 ], L emma 8.11 in that w e do not assume that E [ ξ i | F i − 1 ] = 0 for all i (though it appea rs tha t this ver sion is implicitly used in the proof of [ 7 ], Theorem 8.13). MARK O V ORDER ESTIMA TION 21 P RO P O S I T I O N A.5. Let ( ξ i ) i ≥ 1 be a seque nce of r andom variables such that ξ i is F i -measur able for all i , and define the martingale M j = j X i =1 { ξ i − E [ ξ i | F i − 1 ] } for all j ≥ 0 . F ix K > 0 , and let ( Z j ) j ≥ 0 be pr edictable (i.e., Z j is F j − 1 -measur able) s uch that j X i =1 E [ | ξ i | m | F i − 1 ] ≤ m ! K m Z j for all m ≥ 2 , j ≥ 0 . Then we have for all α > 0 and Z > 0 P [ M j ≥ α and Z j ≤ Z for some j ] ≤ exp " − α 2 2 K ( α + 2 K Z ) # . P RO O F . G i ven λ − 1 > K we define the process ( S j ) j ≥ 0 as S j = e λM j − Z λ j , where Z λ j = P j i =1 E [ φ ( λ | ξ i | ) | F i − 1 ] . Using 1 + x ≤ e x , we find S j S j − 1 = e λξ j − E [ λξ j | F j − 1 ] − E [ φ ( λ | ξ j | ) | F j − 1 ] ≤ { 1 + φ ( λξ j ) + λξ j } e − E [ λξ j | F j − 1 ] 1 + E [ φ ( λ | ξ j | ) | F j − 1 ] . No w using the basic property φ ( x ) ≤ φ ( | x | ) and 1 + x ≤ e x , we hav e E " S j S j − 1      F j − 1 # ≤ e − E [ λξ j | F j − 1 ] ( 1 + E [ λξ j | F j − 1 ] 1 + E [ φ ( λ | ξ j | ) | F j − 1 ] ) ≤ e − E [ λξ j | F j − 1 ] { 1 + E [ λξ j | F j − 1 ] } ≤ 1 . Thus S j is a posit ive supermar tingal e. T o proceed , define the stoppi ng time τ = m in { j : M j ≥ α and Z j ≤ Z } . Then { M j ≥ α and Z j ≤ Z for some j } = { τ < ∞} . Moreo ver , as λ − 1 > K Z λ j = ∞ X ℓ =2 λ ℓ ℓ ! j X i =1 E h | ξ i | ℓ    F i − 1 i ≤ Z j ∞ X ℓ =2 ( λK ) ℓ = λ 2 K 2 1 − λK Z j for all j. Therefore Z λ τ ≤ λ 2 K 2 Z τ / (1 − λK ) , and we can estimate S τ = e λM τ − Z λ τ ≥ e λM τ − λ 2 K 2 Z τ / (1 − λK ) ≥ e λα − λ 2 K 2 Z/ (1 − λK ) on { τ < ∞} . W e obtain, using the supermarting ale property , P [ τ < ∞ ] ≤ E [ 1 { τ < ∞} e λ 2 K 2 Z/ (1 − λK ) − λα S τ ] ≤ e λ 2 K 2 Z/ (1 − λK ) − λα . The proof is complete d by choo sing λ − 1 = K + 2 K 2 Z/α . 22 RAMON V AN HANDEL C O RO L L A RY A.6. Let ( ξ i ) 1 ≤ i ≤ n be a sequence of r andom var iables such that ξ i is F i -measur able fo r a ll i , and fix K > 0 . Defin e ( M j ) 0 ≤ j ≤ n and ( R j ) 0 ≤ j ≤ n as M j = j X i =1 { ξ i − E [ ξ i | F i − 1 ] } , R j = 2 K 2 j X i =1 E  φ  | ξ i | K      F i − 1  . Then we have for all α > 0 and R > 0 P  max j ≤ n M j ≥ α an d R n ≤ R  ≤ exp " − α 2 2( K α + R ) # . If in additio n k ξ i k ∞ ≤ 3 U for all i , then for all α > 0 and R > 0 P  max j ≤ n M j ≥ α and R n ≤ R  ≤ exp " − α 2 2( U α + R ) # . P RO O F . T o obtain the first inequali ty , note that for any m ≥ 2 and j ≥ 0 1 m ! K m j X i =1 E [ | ξ i | m | F i − 1 ] ≤ ∞ X m =2 1 m ! K m j X i =1 E [ | ξ i | m | F i − 1 ] = R j 2 K 2 . W e can therefore apply Propositi on A.5 with Z j = R j / 2 K 2 . For the second in- equali ty , note that k ξ i k ∞ ≤ 3 U implies that for all m ≥ 2 and j ≥ 0 j X i =1 E [ | ξ i | m | F i − 1 ] ≤ (3 U ) m − 2 j X i =1 E h | ξ i | 2    F i − 1 i ≤ (3 U ) m − 2 R j ≤ m ! U m R j 2 U 2 , where we used that m ! ≥ 2 × 3 m − 2 for m ≥ 2 . W e can therefore app ly Prop osition A.5 with Z j = R j / 2 U 2 . It remains to use that R j is nond ecreas ing. A.2. Maximal inequalities for finite sets. The follo w ing result allo ws us to contro l finite families of random va riables that satisfy a Bernstein-t ype deviat ion inequa lity . A sha rper form of this res ult can be obta ined using an est imate on the moment ge nerating function of the random v ariables, see [ 6 ], Lemma 2.3, bu t we do not hav e such an estimate for the maximum m ax i ≤ n M θ i . Throughou t t he re- mainder of the Appendix, we define E A [ X ] = E [ 1 A X ] / P [ A ] for any e vent A ∈ F . L E M M A A.7 . Let X 1 , . . . , X N be rand om va riables such that P [ | X i | ≥ α ] ≤ exp " − α 2 2( K α + R ) # for all 1 ≤ i ≤ N . Then we have for any event A ∈ F E A  max i =1 ,...,N | X i |  ≤ s 8 R log  1 + N P [ A ]  + 8 K log  1 + N P [ A ]  . MARK O V ORDER ESTIMA TION 23 P RO O F . L et ψ ( x ) be a Y oung function . Then ψ E A [max i ≤ N | X i | ] max i ≤ N k X i k ψ ! ≤ E A " max i ≤ N ψ | X i | k X i k ψ !# ≤ X i ≤ N E A " ψ | X i | k X i k ψ !# ≤ 1 P [ A ] X i ≤ N E " ψ | X i | k X i k ψ !# ≤ N P [ A ] , where k · k ψ denote s the Orlicz norm. Therefore E A  max i =1 ,...,N | X i |  ≤ ψ − 1  N P [ A ]  max i =1 ,...,N k X i k ψ . T o proce ed, note that for 1 ≤ i ≤ N P [ | X i | 1 | X i |≤ R/K ≥ α ] = P [ R /K ≥ | X i | ≥ α ] ≤ exp " − α 2 4 R # , P [ | X i | 1 | X i |≥ R/K ≥ α ] = P [ | X i | ≥ α ∨ R /K ] ≤ exp  − α 4 K  . By [ 8 ], Lemma 2.2.1, k X i 1 | X i |≤ R/K k ψ 2 ≤ √ 8 R and k X i 1 | X i |≥ R/K k ψ 1 ≤ 8 K for all i , where ψ p ( x ) = e x p − 1 . The proof is easily completed. C O RO L L A RY A.8. Let ( ξ h i ) 1 ≤ i ≤ n , h = 1 , . . . , N be rand om variabl es such that ξ h i is F i -measur able for all i, h . F ix K > 0 , and defin e M h j = j X i =1 { ξ h i − E [ ξ h i | F i − 1 ] } , R h j = 2 K 2 j X i =1 E " φ | ξ h i | K !      F i − 1 # . Then we have E A  max h =1 ,...,N 1 R h n ≤ R max j ≤ n M h j  ≤ s 8 R log  1 + N P [ A ]  + 8 K log  1 + N P [ A ]  for any even t A ∈ F . If in additio n k ξ h i k ∞ ≤ 3 U for all i, h , then E A  max h =1 ,...,N 1 R h n ≤ R max j ≤ n M h j  ≤ s 8 R log  1 + N P [ A ]  + 8 U log  1 + N P [ A ]  for any even t A ∈ F . P RO O F . A pply the p reviou s lemma w ith X h = 1 R h n ≤ R max j ≤ n M h j . Note that as M h 0 = 0 , certainly X h ≥ 0 . Therefore X h = | X h | , and the requi site tail b ounds are obtaine d immediatel y from Corollary A.6 abo ve. 24 RAMON V AN HANDEL A.3. Pr oof of Theor em A.4 . W e now pro ceed to the proof of Theorem A.4 . W e follo w c losely the proof giv en by Massart [ 6 ], Theorem 6.8 in t he i.i.d. setti ng. The general approach, by means of a chaining devi ce with brack eting with adapti ve trunca tion, is standard in empirical process theory . Before we procee d to the proof, let us define the function Φ( x ) := 16 H + 32 √ Rx + 16 K x, where H is as defined in Theorem A.4 . W e clai m that in order t o prove the Theo- rem, it actual ly suffices to pro ve the estimate E A " sup θ ∈ Θ 1 R θ n ≤ R max i ≤ n M θ i # ≤ Φ  log  1 + 1 P [ A ]  for any e vent A ⊆ F . Indeed , if this is the case , then choosing A = F ∩ ( sup θ ∈ Θ 1 R θ n ≤ R max i ≤ n M θ i ≥ Φ( x ) ) allo ws us to estimate Φ( x ) ≤ E A " sup θ ∈ Θ 1 R θ n ≤ R max i ≤ n M θ i # ≤ Φ  log  2 P [ A ]  , from which t he conc lusion of Theorem A.4 is imm ediate . W e therefore concentrate without loss of gener ality on obtaining the abov e estimate. P RO O F O F T H E O R E M A . 4 . W e fix n ∈ N , K, R < ∞ , F ∈ F and A ⊆ F throug hout the proof. Define δ j = 2 − j √ R and N j = N ( n, Θ , F , K, δ j ) for j ≥ 0 . W e ass ume that N j < ∞ for all j , ot herwise there is nothing to prov e. Therefo re, for ea ch j , we can choos e a co llection B j = { (Λ j,ρ i , Υ j,ρ i ) 1 ≤ i ≤ n } ρ =1 ,...,N j that satisfies the conditio ns of Definition A.1 , and these will remain fixed througho ut the proof. In particu lar , for ev ery j, θ , there exists ρ ( j, θ ) such that Λ j,ρ ( j,θ ) i ≤ ξ θ i ≤ Υ j,ρ ( j,θ ) i for all i = 1 , . . . , n. For not ationa l simplicity , we will wri te Π j,θ i = Υ j,ρ ( j,θ ) i , ∆ j,θ i = Υ j,ρ ( j,θ ) i − Λ j,ρ ( j,θ ) i . At the heart of the proof is a chaining de vice: w e introd uce the telescopin g sum ξ θ i = { ξ θ i − Π τ θ i ,θ i ∧ Π τ θ i − 1 ,θ i } + { Π τ θ i ,θ i ∧ Π τ θ i − 1 ,θ i − Π τ θ i − 1 ,θ i } + τ θ i − 1 X j =1 { Π j,θ i − Π j − 1 ,θ i } + Π 0 ,θ i , MARK O V ORDER ESTIMA TION 25 where by con vention Π − 1 ,θ i = Π 0 ,θ i . The leng th of the chain is chosen adapti vely: τ θ i = m in { j ≥ 0 : ∆ j,θ i > a j } ∧ J. The le vels a j > 0 and J ≥ 1 will be determined later on (we will choose a j to contro l the seco nd term in Corollary A.8 , and we will ultimate ly let J → ∞ ). It will be con venien t to spl it the chain into three parts: ξ θ i = Π 0 ,θ i + J X j =0 ( ξ θ i − Π j,θ i ∧ Π j − 1 ,θ i ) 1 τ θ i = j + (A.1) J X j =1 n (Π j,θ i ∧ Π j − 1 ,θ i − Π j − 1 ,θ i ) 1 τ θ i = j + (Π j,θ i − Π j − 1 ,θ i ) 1 τ θ i >j o . (A.2) Denote by b j,θ i the summands in ( A.1 ) by c j,θ i the summands in ( A.2 ), and define the martinga les A θ i = P i ℓ =1 { Π 0 ,θ ℓ − E [Π 0 ,θ ℓ | F ℓ − 1 ] } , B j,θ i = P i ℓ =1 { b j,θ ℓ − E [ b j,θ ℓ | F ℓ − 1 ] } , and C j,θ i = P i ℓ =1 { c j,θ ℓ − E [ c j,θ ℓ | F ℓ − 1 ] } . W e will control each martingale s eparately . Contr ol of A θ . As φ is con vex and nond ecreasing, an d as | Π 0 ,θ ℓ − ξ θ ℓ | ≤ | ∆ 0 ,θ ℓ | , φ | Π 0 ,θ ℓ | 2 K ! ≤ φ | Π 0 ,θ ℓ − ξ θ ℓ | + | ξ θ ℓ | 2 K ! ≤ 1 2 φ | ∆ 0 ,θ ℓ | K ! + 1 2 φ | ξ θ ℓ | K ! . Using Definition A.1 , we find that R 0 ,θ n := 8 K 2 n X ℓ =1 E " φ | Π 0 ,θ ℓ | 2 K !      F ℓ − 1 # ≤ 2( δ 2 0 + R ) = 4 R on { R θ n ≤ R } ∩ F. Therefore E A " sup θ ∈ Θ 1 R θ n ≤ R max i ≤ n A θ i # ≤ E A " sup θ ∈ Θ 1 R 0 ,θ n ≤ 2( δ 2 0 + R ) max i ≤ n A θ i # ≤ s 32 R log  1 + N 0 P [ A ]  + 16 K log  1 + N 0 P [ A ]  by Corollary A.8 , where we ha ve used that A ⊆ F . Contr ol of B θ . Note that b j,θ ℓ ≤ 0 , so that b j,θ ℓ − E [ b j,θ ℓ | F ℓ − 1 ] ≤ E [(Π j,θ ℓ ∧ Π j − 1 ,θ ℓ − ξ θ ℓ ) 1 τ θ ℓ = j | F ℓ − 1 ] ≤ E [∆ j,θ ℓ 1 τ θ ℓ = j | F ℓ − 1 ] . 26 RAMON V AN HANDEL Consider first the case that j < J . When τ θ ℓ = j , we ha ve ∆ j,θ ℓ > a j . Thus b j,θ ℓ − E [ b j,θ ℓ | F ℓ − 1 ] ≤ 1 a j E [ | ∆ j,θ ℓ | 2 | F ℓ − 1 ] ≤ 2 K 2 a j E " φ | ∆ j,θ ℓ | K !      F ℓ − 1 # , where we ha ve used | x | 2 ≤ 2 K 2 φ ( | x | /K ) . In particular , B j,θ i ≤ 2 K 2 a j i X ℓ =1 E " φ | ∆ j,θ ℓ | K !      F ℓ − 1 # ≤ δ 2 j a j on F , where we ha ve applied Definition A.1 . As A ⊆ F , it follo ws that E A " sup θ ∈ Θ 1 R θ n ≤ R max i ≤ n B j,θ i # ≤ δ 2 j a j for j < J . No w consi der the case j = J . W e can estimat e B j,θ i ≤ i X ℓ =1 E [∆ J,θ ℓ | F ℓ − 1 ] ≤ " i i X ℓ =1 E [ | ∆ J,θ ℓ | 2 | F ℓ − 1 ] # 1 / 2 ≤ δ J √ i on F , where we ha ve applied the same computations as abov e. It follo ws that E A " sup θ ∈ Θ 1 R θ n ≤ R max i ≤ n B J,θ i # ≤ δ J √ n, where we ha ve used that A ⊆ F . Contr ol of C θ . As Π j,θ ℓ − Π j − 1 ,θ ℓ = Π j,θ ℓ − ξ θ ℓ + ξ θ ℓ − Π j − 1 ,θ ℓ , we ha ve − ∆ j − 1 ,θ ℓ ≤ Π j,θ ℓ − Π j − 1 ,θ ℓ ≤ ∆ j,θ ℓ , − ∆ j − 1 ,θ ℓ ≤ Π j,θ ℓ ∧ Π j − 1 ,θ ℓ − Π j − 1 ,θ ℓ ≤ 0 . Therefore − ∆ j − 1 ,θ ℓ 1 τ θ ℓ ≥ j ≤ c j,θ ℓ ≤ ∆ j,θ ℓ 1 τ θ ℓ >j . As ∆ j,θ ℓ ≤ a j whene ver τ θ ℓ > j , we find that k c j,θ ℓ k ∞ ≤ a j − 1 ∨ a j . Moreo ver , as | c j,θ ℓ | ≤ ∆ j − 1 ,θ ℓ ∨ ∆ j,θ ℓ ≤ ∆ j − 1 ,θ ℓ + ∆ j,θ ℓ , we obta in using that φ is con vex and nondec reasing (in the same manner as abo ve for the control of A θ ) R j,θ n := 8 K 2 n X ℓ =1 E " φ | c j,θ ℓ | 2 K !      F ℓ − 1 # ≤ 2( δ 2 j − 1 + δ 2 j ) on F , MARK O V ORDER ESTIMA TION 27 where we ha ve used Definition A.1 . As A ⊆ F , we can there fore estimate E A " sup θ ∈ Θ 1 R θ n ≤ R max i ≤ n C j,θ i # ≤ E A " sup θ ∈ Θ 1 R j,θ n ≤ 2( δ 2 j − 1 + δ 2 j ) max i ≤ n C j,θ i # . No w note that c j,θ ℓ depen ds on θ onl y thro ugh the valu es of ρ (0 , θ ) , . . . , ρ ( j, θ ) . In particu lar , for fixed j , the supremum of 1 R j,θ n ≤ 2( δ 2 j − 1 + δ 2 j ) max i ≤ n C j,θ i as θ varie s ov er Θ is in fact only the maximum ove r a finite collect ion of random varia bles, whose cardin ality is bounded above by the quan tity N j := j Y p =0 N p . W e therefore obtain the estimate E A " sup θ ∈ Θ 1 R θ n ≤ R max i ≤ n C j,θ i # ≤ s 16( δ 2 j − 1 + δ 2 j ) log  1 + N j P [ A ]  + 8 3 ( a j − 1 ∨ a j ) log  1 + N j P [ A ]  , where we ha ve applied Corollary A.8 . End of the pro of . Note that by constru ction M θ i = A θ i + J X j =0 B j,θ i + J X j =1 C j,θ i for all i, θ . Collecting the abov e e stimates giv es E A " sup θ ∈ Θ 1 R θ n ≤ R max i ≤ n M θ i # ≤ δ J √ n + δ 0 s 32 log  1 + N 0 P [ A ]  + 16 K log  1 + N 0 P [ A ]  + J − 1 X j =0 δ 2 j a j + J X j =1 ( δ j s 80 log  1 + N j P [ A ]  + 8 3 ( a j − 1 ∨ a j ) log  1 + N j P [ A ]  ) . W e aim to choose a j such that the log(1 + N j / P [ A ]) terms disa ppear . Set a j = δ j  8 3 log  1 + N j +1 P [ A ]  − 1 / 2 . 28 RAMON V AN HANDEL Then a j is decreasi ng with increasing j , so a j − 1 ∨ a j = a j − 1 and E A " sup θ ∈ Θ 1 R θ n ≤ R max i ≤ n M θ i # ≤ δ J √ n + 16 K log  1 + N 0 P [ A ]  + 16 J X j =0 δ j s log  1 + N j P [ A ]  . W e now estimat e as follo w s: J X j =0 δ j s log  1 + N j P [ A ]  ≤ J X j =0 δ j s log  1 + 1 P [ A ]  + J X j =0 δ j j X p =0 q log N p , and J X j =0 δ j j X p =0 q log N p ≤ ∞ X p =0 q log N p J X j =0 δ j 1 p ≤ j ≤ ∞ X p =0 q log N p ∞ X j = p δ j = 4 ∞ X p =0 ( δ p − δ p +1 ) q log N p ≤ 4 Z √ R 0 q log N ( n, Θ , F , K, u ) du. W e obtain E A " sup θ ∈ Θ 1 R θ n ≤ R max i ≤ n M θ i # ≤ δ J √ n + Φ  log  1 + 1 P [ A ]  . The result follo ws by letting J → ∞ . REFER ENCES [1] C A P P ´ E , O . , M O U L I N E S , E . , A N D R Y D ´ E N , T. (2005). Infer ence in hidden Marko v models . Springer S eries in Statisti cs. Springer, New Y ork. With Randal Douc’ s contributions to Chapter 9 and Christian P . Robert’ s to Chapters 6, 7 and 13, With Chapter 14 by Gersende Fort, P hilippe Soulier and Moulines, and Chapter 15 by St ´ ephane Boucheron and Elisabeth Gassiat. [2] C S I S Z ´ A R , I . (2 002). Large-scale typicality of Markov sample paths and consistency of MDL order estimators. IEEE T rans. Inform. Theory 48 , 6, 1616–1628 . Special issue on Shannon theory: perspecti ve, trends, and applications. [3] C S I S Z ´ A R , I . A N D S H I E L D S , P . C . (2000). The consistency of the BIC Marko v order estimator . Ann. Statist. 28 , 6, 1601 –1619. [4] F I N E S S O , L . (1990). Consistent estimation of t he order for Mark ov and hidden Markov chains. Ph.D. thesis, Uni v . of Maryland. [5] K I E FF E R , J . C . (1993). Strongly consistent code-based identification and order estimation for constrained finite-state model classes. IEEE T rans. Inform. Theory 39 , 3, 893–90 2. [6] M A S S A RT , P . (2007). Concen tration inequalities and model selection . Lecture Notes in Math- ematics, V ol. 1896 . Springe r , Berlin. Lectures from the 33rd Summer Schoo l on Probability Theory held in Saint-Flour , July 6–23, 2003, With a forewo rd by Jean Picard. MARK O V ORDER ESTIMA TION 29 [7] V A N D E G E E R , S . A . (2000). Applications o f empirical pr ocess theo ry . Cambridge Series in Statistical and Probabilistic Mathematics, V ol. 6 . Cambridge Univ ersity Press, Cambridge. [8] V A N D E R V A A RT , A . W . A N D W E L L N E R , J . A . (1996). W eak con verg ence and empirical pro- cesses . Springer Series in Statistics. Springer -V erlag, Ne w Y ork. W ith applications to statistics. D E PA R T M E N T O F O P E R A T I O N S R E S E A R C H A N D F I N A N C I A L E N G I N E E R I N G P R I N C E T O N U N I V E R S I T Y P R I N C E T O N , N J 0 8 5 4 4 U S A E - M A I L : rv an@princeton.edu

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment