On Sequential Estimation and Prediction for Discrete Time Series

The problem of extracting as much information as possible from a sequence of observations of a stationary stochastic process $X_0,X_1,...X_n$ has been considered by many authors from different points of view. It has long been known through the work o…

Authors: G. Morvai, B. Weiss

On Sequen tial Estimation and Prediction for Discrete Time Series Guszt´ av MOR V AI and Benjamin WEISS Sto c hasti cs and Dynamics, V ol. 7, No. 4. pp. 417-437 , 2007 Abstract The problem of extrac ting as m uc h in fo rm ation as p ossible from a sequence of observ ations of a stationary sto c hastic pr ocess X 0 , X 1 , ...X n has b een considered by many auth o rs fr o m different p oint s of view. It has long b een kno wn through th e work of D. Bailey that no u ni- v ersal estimator for P ( X n +1 | X 0 , X 1 , ...X n ) can b e f o un d whic h con- v erges to the true estima tor almost su rely . Despite this result, for restricted classes o f pro cesses, or for sequences of estimators along stopping times, universal estimato rs can b e found . W e pr e sent here a survey of some of the recent w ork that has b een done along these lines. 1 1 In tro ducti o n In a short comm unication that app eared in the Proceedings of the First In ternational IEEE-USSR Information W o r kshop [7], T om Cov e r fo rm ulated a n um b er of problems that ha v e generated a substan tial litera t ur e during the past thirty y ears. W e plan to su rve y a p ortion of these w orks, biased to b e sure b y our o wn in tersets. W e b egin b y quoting from Co v er’s pap er and recalling his first tw o q uestions: ” 1. A Question on the Prediction of Ergo dic Pro cesses The statemen t that ”w e can learn the statistic s o f an ergo dic pro ces s fr o m a sample function with probability 1” is b eing inv estigated for op erational significance. Let { X n } ∞ −∞ b e a stationa r y binary ergo dic pro cess with conditional probabilit y distributions p ( x n +1 | x n , . . . , x 1 ), n = 1 , 2 , . . . . W e know that w e can learn the statistics with probabilit y 1, but can w e learn p f ast enough? In other w ords, do es there exist an estimate ˆ p : X × X ⋆ → [0 , 1], X ⋆ = collection of all finite strings, for whic h ˆ p ( X n +1 | X n , . . . , X 1 ) − p ( X n +1 | X n , . . . , X 1 ) → 0 with proba bilit y 1? Do es there also exist a predictor ˆ p yie lding the con v ergence of ˆ p ( X 0 | X − 1 , X − 2 , . . . , X − n ) → p ( X 0 | X − 1 , X − 2 , . . . )? Since the statemen t of this problem, Bailey and Ornstein ha v e o btained some as yet unpublished results on this question that indicate a negativ e answe r to the first question and a po sitive answ er to the second.” Since the pro cess es are stationary , the (second) ba c kw ard prediction prob- lem is equiv a le nt to t he (first) for ward prediction problem as far as conv er- gence in probability is concerned. How ev er, for a lm ost sure results it turns out that they a r e fa r from b eing the same. Ornstein [30] gav e a rather complicated algo rithm for the bac kw ard prediction problem whereas Bailey 2 pro vided a pro of for t he nonexistence of a univ ersal algo rithm guara nteeing almost sure con v ergence in t he forward estimation problem. T o do this, Bai- ley in [5], assuming the existence of a univers al algorithm, used the Ornstein’s tec hnique of cutting and stack ing [31] for the construction of a ”counterex- ample” pro cess for whic h the alg orithm fails to conv erge (see Shields [34] for more details on t his metho d). The problem came to life again in the late eigh ties with the w ork of Ry abk o [33]. He used a simpler techniq ue, namely - relab elling a coun table state Mark o v c hain, in order to prov e the nonexistence of a univ ersal esti- mator for Cov er’s first problem (cf. also Gy¨ orfi, Morv ai and Y ak o witz [11]). In addition there w as a grow ing interes t in univ ersal algorithms of v ario us kinds in information theory a nd elsewhere, see F eder and Merha v [10] f o r a surv ey . Three approac hes ev olv ed in an at t empt t o obtain p ositiv e results for the problem o f forw ard es timation in t h e face of Bailey’s theorem. The first modifies the almost sure con v ergence to con v ergence in prob- abilit y or almost sure conv ergence of the Cesaro a v erages. This was done already by Bailey in his thesis . Cf. Algoet [2, 3] and W eiss [36]. The second giv es up on trying to es timate the distribution of the next output at all time mo ments n , and conc entrates on guaran teeing prediction only at certain stopping times, while the third restricts the class of pro cesses for whic h the s che me is sho wn to succ eed. Our in terest in this circle of ideas b egan with the PhD thesis o f the firs t author [15] in whic h he gav e an a lgorithm fo r the back w ard prediction that w as muc h simpler than Ornstein’s original sc heme (cf. Morv ai, Y ako witz and Gy¨ orfi [27] ). Before d escribing briefly the con ten ts of the s urv ey we w ill presen t this sc heme with a sk etc h of the proof of its v alidit y . Let { X n } ∞ n = −∞ b e a stationary and ergo dic time series taking v a lue s from X = { 0 , 1 } . (Note that all s tationa ry t im e series { X n } ∞ n =0 can be thoug h t to b e a tw o sided time series, that is, { X n } ∞ n = −∞ . ) F or notatio nal con v enienc e, let X n m = ( X m , . . . , X n ), where m ≤ n . 3 Here is the algorithm. F or k = 1 , 2 , . . . , define sequences λ k − 1 and τ k recursiv ely . Set λ 0 = 1 and le t τ k b e the time b et w een the o ccurrence of the pattern X − 1 − λ k − 1 at time − 1 and the last o ccurrence of the same pattern prio r to time − 1. F ormally , le t τ k = min { t > 0 : X − 1 − t − λ k − 1 − t = X − 1 − λ k − 1 } . Put λ k = τ k + λ k − 1 , where λ k is the length of the pattern X − 1 − λ k = X − 1 − τ k − λ k − 1 − τ k X − 1 − τ k . The observ ed vector X − 1 − λ k − 1 almost surely takes a v alue of p ositiv e proba- bilit y; th us by statio n arity , the string X − 1 − λ k − 1 m ust app ear in the sequence X − 2 −∞ almost surely . One denotes the k th estimate of P ( X 0 = 1 | X − 1 −∞ ) b y P k , and de fines it to b e P k = 1 k k X j =1 X − τ j . As in Ornstein [30], t he estimate P k is calculated from observ atio ns of ra n dom size. Here the random sample size is λ k . T o obtain a fixed sample-size 0 < t < ∞ v ersion, w e apply the same metho d as in Algo et [1], that is, let κ t b e the maxim um of in tegers k for which λ k ≤ t . F ormally , κ t = max { k ≥ 0 : λ k ≤ t } . No w put ˆ P − t = P κ t . The follo wing theorem w as es tablished in the PhD thesis of Morv ai [15]. Theorem 1.1 (Morvai [15]) F o r any stationary and er go dic binary time series { X n } , lim t →∞ ˆ P − t = P ( X 0 = 1 | X − 1 −∞ ) almost sur ely. 4 Pro of. W e ha v e P k − P ( X 0 = 1 | X − 1 −∞ ) = 1 k k X j =1 [ X − τ j − P ( X − τ j = 1 | X − 1 − λ j − 1 )] + 1 k k X j =1 P ( X − τ j = 1 | X − 1 − λ j − 1 ) − P ( X 0 = 1 | X − 1 −∞ ) . Observ e that the first term is an a v erage o f a b ounded marting a le differ- ence sequence and by Azuma’s exp onen tial b ound for b ounded marting a le differences [4] we get that the first term tends to ze ro. Morv ai sho w ed in his PhD thes is that P ( X − τ j = 1 | X − 1 − λ j − 1 ) = P ( X 0 = 1 | X − 1 − λ j − 1 ) . This obs erv ation is the key to handling the s econd term: 1 k k X j =1 P ( X − τ j = 1 | X − 1 − λ j − 1 ) − P ( X 0 = 1 | X − 1 −∞ ) = 1 k k X j =1 P ( X 0 = 1 | X − 1 − λ j − 1 ) − P ( X 0 = 1 | X − 1 −∞ ) . By the martingale c onv ergence theorem, P ( X 0 = 1 | X − 1 − λ j − 1 ) → P ( X 0 = 1 | X − 1 −∞ ) almost s urely , and s ince ordinary con v ergence implies Cesaro con v ergence this c ompletes the pro of of the t heorem. ✷ In this surv ey we will restrict ourselv es to finite or coun tably v a lue d pro- cesses . Some o f the dire ctions that w e surv ey ha ve been generalized to real v alued pro cesse s and some ev en to pro cesse s taking v alues in more general metric spaces. Some o f the k ey pap ers in these directions are Algo et [1, 2, 3], Morv ai et. al. [27, 26], W eiss [36 ] and Nob el [28]. 5 W e turn no w to a brief description of the con ten ts of our surve y . In § 2 w e will describ e some clas ses of pro cesses that will play an imp ortan t role for us. Next § 3 will con tain a sc heme f or forw ard prediction at all n whic h can b e sho wn to conv erge to t h e optimal prediction f o r the class of pro cess es with contin uous conditional pro babilities . This class includes of course k -step Mark o v c hains for an y k . In § 4 w e turn to a description of a sequence of stopping times together with estimators whic h con v erge a long that sequence to the conditio na l pr o b- abilit y estimator for all pro cesses . This sequence of stopping time s grows rather quic kly and w e giv e a sequence with a slo w er grow th rate but w e can demonstrate the con v ergence only f or pro cess es whose conditiona l prob- abilities are almost su rely con tin uous. Then in § 5 fo r finitarily Marko vian pro cess es w e giv e stopping times with an ev en slo w er growth rate. The fol- lo wing section considers this class in more detail with resp ect to the problem of estimating the length of the memory w ord that o ccurs as the con text at time n . W e conclude with a series of constructions and examples in §§ 7 − 9 that sho w the optimality of man y of these r esults. Along the w ay sev eral open questions are men tioned since muc h remains to be done b efore we achiev e a complete unde rstanding of what is p ossible and what is not. 2 Preliminaries - Clas s es o f S t o c hast ic Pro- cesses Let X b e discrete (finite or coun tably infinite ) alphab et. Let { X n } b e a stationary and ergo dic time s eries. F or nota tional conv enienc e let p ( x 0 − k ) a nd p ( y | x 0 − k ) denote the distribution P ( X 0 − k = x 0 − k ) and the conditional distribution P ( X 1 = y | X 0 − k = x 0 − k ), resp e ctiv ely . 6 Definition 1 . F or a stationary time series { X n } the (random) length K ( X 0 −∞ ) of the me mory of the sample path X 0 −∞ is the sm allest p ossible 0 ≤ K < ∞ suc h that for a ll i ≥ 1, all y ∈ X , a ll z − K − K − i +1 ∈ X i p ( y | X 0 − K +1 ) = p ( y | z − K − K − i +1 , X 0 − K +1 ) pro vided p ( z − K − K − i +1 , X 0 − K +1 , y ) > 0, and K ( X 0 −∞ ) = ∞ if there is no suc h K . Note that w e denote the random v ariables b y capital letters and particular realizations by lo w er case letters. F or example, p ( y | X 0 − K +1 ) is denoting the random v ariable whic h is a function of the random v aria bles X 0 − K +1 taking the v alue P ( X 1 = y | X 0 − k = x 0 − k ) w hen X 0 − k = x 0 − k . Definition 2 . The statio nary time series { X n } is said to b e finitarily Mark o- vian if K ( X 0 −∞ ) is finite (tho ug h not necess arily b ounded) a lmos t surely . This class includes of course all finite order Mark ov chains but also man y other pro cesses suc h as the finitarily determined processes of Kalik ow , Katznelson and W eiss [13], which serv e to represen t all isomorphism classes of zero en tropy pro cess es. F or some concrete ex amples that are not Marko vian consider the follo wing example: Example 1. Let { M n } b e a n y stationary and ergo dic first order Mark ov c hain with finite or coun tably infinite state space S . Let s ∈ S b e an arbit r a ry state with P ( M 1 = s ) > 0. Now let X n = I { M n = s } . By Shields ([35] Chapter I.2.c.1), the binary time series { X n } is stationary and ergo dic. It is also finitarily Mark o vian. (Indeed, the conditiona l probabilit y P ( X 1 = 1 | X 0 −∞ ) do es not dep end on v alues beyond the firs t (going backw ar ds) o ccurrence of one in X 0 −∞ whic h iden tifies the first (going back w ards) o ccurrence of state s in the Mark o v c hain { M n } . ) The resulting time series { X n } is not a Marko v c hain of an y order in general. (Indeed, consider the Marko v c hain { M n } with state space S = { 0 , 1 , 2 } and t rans ition probabilities P ( X 2 = 1 | X 1 = 0) = P ( X 2 = 2 | X 1 = 1) = 1, P ( X 2 = 0 | X 1 = 2) = P ( X 2 = 1 | X 1 = 2) = 0 . 5. This yields a stationary and ergo dic Mark ov chain { M n } , cf. (Example I.2.8 7 in Shields [35]. Clearly , the resulting time series X n = I { M n =0 } will not b e Mark o v o f any order. The conditional probabilit y P ( X 1 = 0 | X 0 −∞ ) depends on w hether un til the first (going bac kw a rds ) o ccurrence of one y ou see ev en or odd n um b e r of zeros.) These examples include all stationary and ergo dic binary renew al pro cesses with finite exp ected inter-arriv al t imes, a basic class for many applications. (A stationary and ergo dic binary renew al pro ces s is defined as a stationary and ergo dic binary pro cess suc h that the times b et w een o ccurrences of ones are indep enden t and iden tically distributed with finite exp ectation, cf. Chapter I.2.c.1 in Shields [35]). Let X ∗− b e the set of a ll one-sided sequence s, tha t is, X ∗− = { ( . . . , x − 1 , x 0 ) : x i ∈ X for all −∞ < i ≤ 0 } . Let f : X → ( −∞ , ∞ ) b e b ounded, otherwise arbitrary . Define the function F : X ∗− → ( −∞ , ∞ ) as F ( x 0 −∞ ) = E ( f ( X 1 ) | X 0 −∞ = x 0 −∞ ) . E.g. if f ( x ) = 1 { x = z } for a fixed z ∈ X then F ( y 0 −∞ ) = P ( X 1 = z | X 0 −∞ = y 0 −∞ ) . If X is countably infinite subset of the reals and f ( x ) = x then F ( y 0 −∞ ) = E ( X 1 | X 0 −∞ = y 0 −∞ ) . Define the distance d ∗ ( · , · ) on X ∗− as f o llo ws. F or x 0 −∞ , y 0 −∞ ∈ X ∗− let d ∗ ( x 0 −∞ , y 0 −∞ ) = ∞ X i =0 2 − i − 1 1 { x − i 6 = y − i } . Definition 2.1 We say that F ( X 0 −∞ ) is c ontinuous if a v e rsion of the func - tion F ( X 0 −∞ ) on the whole set X ∗− is c ontinuous with r esp e ct to me t ric d ∗ ( · , · ) . As w e hav e a lr eady mentioned any k -step Marko v c hain satisfies t his, but there are also ma ny ex amples with un b ounded memory . S. Kaliko w sho w ed 8 in [1 2] that the class can also b e c haracterized as those pro ce sses whic h can b e constructed as r a ndom Mark ov c hains. In this pro cedure, giv en a past X 0 −∞ one in v ok es an auxiliary indep enden t pro cess w hich c ho oses a random memory length K and then X 1 is chose n according to a fixed transition table from X K to X . Definition 2.2 We say that F ( X 0 −∞ ) is almost sur ely c ontinuous if for some set C ⊆ X ∗− which has pr ob a b ility one a version of the func tion F ( X 0 −∞ ) r e s t ricte d to this set C is c o nt inuous with r esp e ct to metric d ∗ ( · , · ) . This class is strictly larger than the pro cesse s with con tinuous conditiona l distributions. It contains man y of t he examples that hav e b een used to demonstrate the limitations of univ ersal sc hemes. In particular, it contains the class of finitary Mark ov pro cesse s where the usual con tin uity may not hold (cf. Morv ai and W eiss [17]). 3 F orward e s timation for pro cess e s w ith con- tin uous condi t ional dist ributions F or simplicit y w e will restrict our de tailed presen tation to the case where { X n } is a stationary and ergo dic binary time series. As w e hav e remark ed, since w e a r e interes ted primarily in p oin t wise results the restriction to ergo dic pro cess es do esn’t lead to an y loss of g enerality , while t he extension to finite state pro cess es is completely ro utine . Our goal is to estimate t he conditio nal probabilit y P ( X n +1 = 1 | X n 0 ) kno wing only the samples X n 0 but not t he nature of the pro ce ss. The follow ing algo rithm w hich w as in tro duced in Morv ai and W eiss [18] has sev eral nice features. F or pro cesses with contin uous conditional distri- bution the algorit hm will almost surely g iv e better and b etter prediction for X n +1 while for a ll other pro cesses some t ype of conv ergence will o bt a in. F or 9 k ≥ 1 define the random v ariables τ k i ( n ) which indicate where the k -blo c k X n n − k +1 o ccurs previously in the time series { X n } . F ormally w e set τ k 0 ( n ) = 0 and for i ≥ 1 le t τ k i ( n ) = min { t > τ k i − 1 ( n ) : X n − t n − k +1 − t = X n n − k +1 } . Let K n ≥ 1 a nd J n ≥ 1 b e se quences of nondecreasing p ositiv e in tegers tending to ∞ which will be fixed later. Define κ n as the largest 1 ≤ k ≤ K n suc h that there are at least J n o ccur- rences of the blo c k X n n − k +1 in the data segmen t X n 0 , that is, κ n = max { 1 ≤ k ≤ K n : τ k J n ( n ) ≤ n − k + 1 } if there is suc h k and 0 otherwise. Define λ n as the n umber of occurrences of the blo c k X n n − κ n +1 in the data segmen t X n 0 , that is, λ n = max { 1 ≤ j : τ κ n j ≤ n − κ n + 1 } if κ n > 0 and zero otherwise. Observ e that if κ n > 0 then λ n ≥ J n . Our estimate g n for P ( X n +1 = 1 | X n 0 ) is defined a s g 0 = 0 and for n ≥ 1 , g n = 1 λ n λ n X i =1 X n − τ κ n i ( n )+1 if κ n > 0 and zero otherwise. Theorem (Morv ai and W eiss [18]) L et { X n } b e a stationary a nd e r go dic time series taking values fr om a finite alphab et X . Assume K n = max(1 , ⌊ 0 . 1 log |X | n ⌋ ) and J n = max(1 , ⌈ n 0 . 5 ⌉ ) . Then (A) if the c onditional ex p e ctation P ( X 1 = 1 | X 0 −∞ ) is c ontinuous with r esp e ct to metric d ∗ ( · , · ) then lim n →∞ | g n − P ( X n +1 = 1 | X n 0 ) | = 0 almost sur el y, 10 (B) without any c ontinuity assumption, lim n →∞ 1 n n − 1 X i =0 | g i − P ( X i +1 = 1 | X i 0 ) | = 0 almo s t sur ely, (C) without any c ontinuity assumption, for arbitr ary ǫ > 0 , lim n →∞ P ( | g n − P ( X n +1 = 1 | X n 0 ) | > ǫ ) = 0 . Remarks: W e note that from the pro of of Ry abk o [33] and G y ¨ orfi, Morv ai, Y ak owitz [11] it is clear that the con tin uity condition in the first part of t he Theorem can not b e r elaxed. Ev en for the class of all statio nary and ergo dic binary time- series with merely almost surely con tin uous conditional probabilit y P ( X 1 = 1 | . . . , X − 1 , X 0 ) one can not a c hiev e the c onv ergence as in par t (A). W e do not kno w if the shifted v ersion of our prop osed sc heme g n solv es the bac kw ard estimation problem or not. That is, in the case when g n is ev aluated on ( X − n , . . . , X 0 ) rather than on ( X 0 , . . . , X n ), w e exp ec t con v ergence to b e hold for all pro cesses but w e ha v e b een unable to pro v e this . It is kno wn that when the algorithms of Ornstein [30], Algo et [1], Morv ai Y ako witz and Gy¨ orfi [27 ] for the b ackw ard estimation problem are s hifted forw ard parts (B) and (C) ho ld. F or part (C) this is immediate from sta- tionarit y while fo r part (B) it follows from a generalized ergo dic theorem, usually attributed to Breiman, bu t first prov ed b y Mak er [14 ]. Th us t h ere is no no v elt y in the existence of some sc heme with these properties. Ho w ev er, for the ab o v e algo rithm all three prop erties hold. W e should also p oin t out that if one knows that the pro cess is k -step Marko v for some fix ed k then of course it is not ve ry hard to see that that the e mpirical distributions of the k + 1- block s conv erge almost surely b y the ergo dic theorem and this easily forms the basis of a sche me whic h will succeed in the forw ard prediction of these pr o cesses. 11 4 Estimating Along Stop p ing Times The forw ard prediction problem f o r a binary time series { X n } ∞ n =0 is to esti- mate the probabilit y that X n +1 = 1 based on the observ ations X i , 0 ≤ i ≤ n without prior know ledge of the distribution of the pro cess { X n } . It is know n that this is not p ossible if one estimates at all v alues of n . Morv ai [16] presen ted a simple pro cedure whic h will attempt to mak e suc h a pr ediction infinitely often at carefully selected stopping times chos en b y the algorithm. The gr o wth rate o f the stopping times can b e determined. Here is his sc heme. Let { X n } ∞ n = −∞ denote a t w o-sided s tatio nary and ergo dic binary time series. F or k = 1 , 2 , . . . , define the sequences { τ k } and { λ k } recursiv ely . Set λ 0 = 0. Let τ k = min { t > 0 : X λ k − 1 + t t = X λ k − 1 0 } and λ k = τ k + λ k − 1 . (By statio n arity , the string X λ k − 1 0 m ust app ear in the sequenc e X ∞ 1 almost surely . ) The k th estimate of P ( X λ k +1 = 1 | X λ k 0 ) is denoted b y P k , and is defined a s P k = 1 k − 1 k − 1 X j =1 X λ j +1 . Theorem 4.1 ( Morvai [16] ) F or al l stationary and er go dic binary time series { X n } , lim k →∞  P k − P ( X λ k +1 = 1 | X λ k 0 )  = 0 almos t sur ely. F or some extensions of the algorithm see M orv ai and W eiss [19]. One of the draw back s of this sc heme is that the gro wth of the stopping times { λ k } is rather rapid. 12 Theorem 4.2 ( Morvai [16] ) L et { X n } b e a stationary and er go dic binary time series. Supp ose that H > 0 wher e H = lim n →∞ − 1 n + 1 E lo g p ( X 0 , . . . , X n ) is the pr o c ess entr opy. L et 0 < ǫ < H b e arbitr ary. Then for k lar ge enough, λ k ( ω ) ≥ c c · · c almost sur ely, wher e the heig ht of the tower is k − d , d ( ω ) is a fin it e numb er which dep ends on ω , and c = 2 H − ǫ . Morv ai and W eiss [17] exhibited an es timator whic h is consis ten t on a certain stopping time sequence for a restricted class o f stationary time series but which has a m uc h slo w er rate o f gr owth. Define the stopping times no w as follows. Set ζ 0 = 0. F or k = 1 , 2 , . . . , define sequence η k and ζ k recursiv ely . Let η k = min { t > 0 : X ζ k − 1 + t ζ k − 1 − ( k − 1)+ t = X ζ k − 1 ζ k − 1 − ( k − 1) } and ζ k = ζ k − 1 + η k . One denotes t h e k th estimate of P ( X ζ k +1 = 1 | X ζ k 0 ) b y g k , and defines it to b e g k = 1 k k − 1 X j =0 X ζ j +1 . Theorem 4.3 ( Morvai and Weiss [17] ) L et { X n } b e a stationary binary time series. Then lim k →∞    g k − P ( X ζ k +1 = 1 | X ζ k 0 )    = 0 almost sur ely pr ovide d that the c onditional p r ob ability P ( X 1 = 1 | X 0 −∞ ) is al m ost sur ely c ontinuous. 13 Remark. W e note that for all stationary binary time-series, the estimation sc heme desc rib ed abov e is consisten t in probability . Next w e will g iv e some univ ersal estimates fo r the grow th rate of the stopping times ζ k in terms of t he en trop y rate of the pro cess. This is nat ur a l since the ζ k are defined b y r e currence times f or blocks of length k , and these a r e kno wn to grow expo ne ntially with the entrop y rate. Theorem 4.4 ( Morvai and Weiss [1 7] ) L et { X n } b e a stationary and er go dic binary time series. Then for arbitr ary ǫ > 0 , ζ k < 2 k ( H + ǫ ) eventual ly almost sur ely, wher e H d e n ot es the entr opy r ate as s o ciate d w it h time s e ries { X n } . This upp er b ound is m uc h more fav o urable than the low er b ound in Mor- v ai [16]. F or some extensions of this a lg orithm see Morv ai and W eiss [2 4 ]. 5 Some Improv emen ts for Fin i t a ril y Mark o- vian Pro c e sses Let { X n } ∞ n = −∞ b e a stationary and ergo dic (not necessarily finitarily Marko- vian) t ime series taking v alues fr o m a discrete (finite or coun tably infinite) alphab et X . Morv ai and W eiss [23] prov ided the follow ing alg o rithm w hich impro v es the p erformance of the previous one in case the pro cess turns out to b e finita rily Mark o vian. F or k ≥ 1, let 1 ≤ l k ≤ k b e a nondecreasing unbounded sequence of in tegers, that is, 1 = l 1 ≤ l 2 . . . and lim k →∞ l k = ∞ . Define auxiliary stopping times ( similarly to Morv ai a nd W eiss [17]) as fol- lo ws. Set ζ 0 = 0. F or n = 1 , 2 , . . . , let ζ n = ζ n − 1 + min { t > 0 : X ζ n − 1 + t ζ n − 1 − ( l n − 1)+ t = X ζ n − 1 ζ n − 1 − ( l n − 1) } . 14 Note that if l n = n then one gets ζ n = η n in Morv a i and W eiss [1 7]. The p oin t here is that l n ma y gro w slow ly . Among other things, using ζ n and l n w e can define a v ery useful pro cess { ˜ X n } 0 n = −∞ as a function of X ∞ 0 as follows. Let J ( n ) = min { j ≥ 1 : l j +1 > n } and de fine ˜ X − i = X ζ J ( i ) − i for i ≥ 0. In o r d er to estimate K ( ˜ X 0 −∞ ) w e need to define some explic it statistics. Define ∆ k ( ˜ X 0 − k +1 ) = sup 1 ≤ i sup { z − k − k − i +1 ∈X i ,x ∈X : p ( z − k − k − i +1 , ˜ X 0 − k +1 ,x ) > 0 }    p ( x | ˜ X 0 − k +1 ) − p ( x | ( z − k − k − i +1 , ˜ X 0 − k +1 ))    . W e will divide the data segmen t X n 0 in to tw o parts: X ⌈ n 2 ⌉− 1 0 and X n ⌈ n 2 ⌉ . Let L (1) n,k denote the set of strings with length k + 1 whic h app ear at a ll in X ⌈ n 2 ⌉− 1 0 . That is, L (1) n,k = { x 0 − k ∈ X k +1 : ∃ k ≤ t ≤ ⌈ n 2 ⌉ − 1 : X t t − k = x 0 − k } . F or a fixed 0 < γ < 1 let L (2) n,k denote the set o f strings with length k + 1 whic h appear more than n 1 − γ times in X n ⌈ n 2 ⌉ . That is, L (2) n,k = { x 0 − k ∈ X k +1 : # {⌈ n 2 ⌉ + k ≤ t ≤ n : X t t − k = x 0 − k } > n 1 − γ } . Let L n k = L (1) n,k \ L (2) n,k . W e define the empiric al v ersion of ∆ k as f o llo ws: ˆ ∆ n k ( ˜ X 0 − k +1 ) = max 1 ≤ i ≤ n max ( z − k − k − i +1 , ˜ X 0 − k +1 ,x ) ∈L n k + i 1 { ζ J ( k ) ≤⌈ n 2 ⌉− 1 }      # {⌈ n 2 ⌉ + k ≤ t ≤ n : X t t − k = ( ˜ X 0 − k +1 , x ) } # {⌈ n 2 ⌉ + k − 1 ≤ t ≤ n − 1 : X t t − k + 1 = ˜ X 0 − k +1 } − # {⌈ n 2 ⌉ + k + i ≤ t ≤ n : X t t − k − i = ( z − k − k − i +1 , ˜ X 0 − k +1 , x ) } # {⌈ n 2 ⌉ + k + i − 1 ≤ t ≤ n − 1 : X t t − k − i + 1 = ( z − k − k − i +1 , ˜ X 0 − k +1 ) }      . 15 Note that the cut off 1 { ζ J ( k ) ≤⌈ n 2 ⌉− 1 } ensures that ˜ X 0 − k +1 is defined from X ⌈ n 2 ⌉− 1 0 . Observ e, that b y ergo dicit y , for an y fixed k , lim inf n →∞ ˆ ∆ n k ≥ ∆ k almost s urely . W e define an es timate χ n for K ( ˜ X 0 −∞ ) from samples X n 0 as f o llo ws. Let 0 < β < 1 − γ 2 b e arbitrary . Set χ 0 = 0, and f o r n ≥ 1 let χ n b e the s mallest 0 ≤ k n < n suc h tha t ˆ ∆ n k n ≤ n − β . Observ e that if ζ j ≤ ⌈ n 2 ⌉ − 1 < ζ j +1 then χ n ≤ l j +1 . Here the idea is that if K ( ˜ X 0 −∞ ) < ∞ then χ n will be equal to K ( ˜ X 0 −∞ ) ev en tually and if K ( ˜ X 0 −∞ ) = ∞ then χ n → ∞ . No w w e define the sequence of stopping times λ n along whic h w e will b e a ble to estimate. Set λ 0 = ζ 0 , and for n ≥ 1 if ζ j ≤ λ n − 1 < ζ j +1 then put λ n = min { t > λ n − 1 : X t t − χ t +1 = X ζ j ζ j − χ t +1 } and κ n = χ λ n . Observ e that if ζ j ≤ λ n − 1 < ζ j +1 then ζ j ≤ λ n − 1 < λ n ≤ ζ j +1 . If χ λ n − 1 +1 = 0 then λ n = λ n − 1 + 1. Note that λ n is a stopping time a nd κ n is our estimate for K ( ˜ X 0 −∞ ) from samples X λ n 0 . Let f : X → ( −∞ , ∞ ) b e b ounded. One denotes t h e n th estimate of E ( f ( X λ n +1 ) | X λ n 0 ) from samples X λ n 0 b y f n , and defines it to b e f n = 1 n n − 1 X j =0 f ( X λ j +1 ) . Fix p ositiv e real num bers 0 < β , γ < 1 suc h that 2 β + γ < 1, fix a sequence l n that 1 = l 1 ≤ l 2 , . . . , l n → ∞ and fix a b ounded function f ( · ) : X → ( −∞ , ∞ ) 16 and w ith these n um b e rs, sequence and function define ζ n , χ n , κ n , λ n and F ( · ) as describ ed in the previous section. F or the resulting f n w e ha ve the follo wing theorem: Theorem 5.1 ( Morvai and Weiss [2 3] ) L et { X n } b e a stationary and er go dic time series takin g values fr o m a finite or c ountably infinite set X . I f the c ondition a l exp e ctation F ( X 0 −∞ ) is almost sur ely c ontinuous then almost sur el y, lim n →∞ f n = F ( ˜ X 0 −∞ ) and lim n →∞    f n − E ( f ( X λ n +1 ) | X λ n 0 )    = 0 . F or arbitr ary δ > 0 , 0 < ǫ 2 < ǫ 1 , let l n = min  n, ma x  1 , ⌊ 2+ δ ǫ 1 − ǫ 2 log 2 n ⌋  . Then λ n < n 2+ δ ǫ 1 − ǫ 2 ( H + ǫ 1 ) eventual ly almost sur ely, and the upp er b ound is a p olynomial wheneve r the stationary and er go dic time series { X n } has finite entr opy r ate H . If the stationary and er go dic time serie s { X n } turns out to b e fin it arily Markovian then lim n →∞ λ n n = 1 p ( ˜ X 0 − K ( ˜ X 0 −∞ )+1 ) < ∞ almo st sur ely . Mor e over, if the stationary and er go d ic time series { X n } turns out to b e indep endent and identic al ly distribute d then λ n = λ n − 1 + 1 eventual ly almost sur el y. 6 Estimation for F init a rily Mark o vian Pro- cesses In this section we broaden the scop e of the estimation que stion that we will discuss and describ e first ho w w ell can w e detect the presence o f a memory 17 w ord in a finitarily Mark o vian pro ces s ( cf. Morv ai and W eiss [25] ). This problem has b een discussed often in the con text o f mo delling pro cesse s. Here w e will sho w ho w it relates to prediction questions. Recall that K w as the minimal len gth o f the c ontext that defines the conditional probabilit y . W e take up the problem of estimating the v alue of K , b oth in the back w ard sen se a nd in the forw ard sense, where o ne observ es success iv e v alues of { X n } for n ≥ 0 and asks for the least v alue K suc h that the conditional distribution of X n +1 giv en { X i } n i = n − K +1 is the same as the conditiona l distribution of X n +1 giv en { X i } n i = −∞ . W e will consider bot h finite and coun tably infinite alphab et size. F or the case of finite a lph ab et finite order Mark o v c hains similar questions ha v e been studied b y B ¨ uhlm an and Wyner in [6]. How ev er, the fact that w e w an t to tr e at coun table alphab ets complicates matters s ignificantly . The p oin t is t ha t while finite alphab et Marko v chains ha v e exp onen tial rat es of con v ergence of empirical distributions, for coun table alphab et Mark ov c hains no univ ersal rates are a v ailable at all. This problem app ears in Morv ai and W eiss [21] where a univ ersal estima- tor for the order of a Mark o v c ha in on a countable state space is give n, and some of the tec hniques that are used in the pro ofs of the results describ ed here hav e their origin in that pap er. W e note in passing, t ha t in Morv ai and W eiss [20] it is sho wn that there is no classification r ule for discriminating the class of finitarily Mark o vian pro ces ses from ot her ergo dic pro cesses. The k ey notion is that of a mem ory w ord whic h can b e defined as follo ws. Definition 6.1 We say that w 0 − k +1 is a memory wo r d if for al l i ≥ 1 , al l y ∈ X , al l z − k − k − i +1 ∈ X i p ( y | w 0 − k +1 ) = p ( y | z − k − k − i +1 , w 0 − k +1 ) pr ovide d p ( z − k − k − i +1 , w 0 − k +1 , y ) > 0 . 18 Define the set W k of those memory words w 0 − k +1 with length k , that is, W k = { w 0 − k +1 ∈ X k : w 0 − k +1 is a memory word } . Our first result is a s olution of the back ward estimation problem, namely determining the v alue of K ( X 0 −∞ ) from observ a tions of increas ing length of the data segmen ts X 0 − n . W e will giv e in the next subsection a unive rsal consisten t estimator whic h will conv erge almost surely to the memory length K ( X 0 −∞ ) fo r an y ergo dic finitarily Mark o vian process on a coun table state space. The detailed pro ofs in Morv ai and W eiss [25] a re pretty ex plicit and giv en some information on the av erage length of a memory w ord and the exten t to w hich the stationary distribution diffuses ov er the s tate space one could extract rates for the con v ergence of the estimators. W e conc entrate ho w ev er, on t he more unive rsal asp ects of the problem. As is usual in t he se kinds of questions , the pro blem of forw ard estimation, namely trying to determine K ( X n −∞ ) from successiv e observ atio ns of X n 0 is more difficult. The stationarity means t ha t res ults in probability can b e carried o v er auto m atically . Ho w ev er, almost sure results presen t serious problems as w e ha v e already said. F or some more results in this circle of ideas of what can b e learned ab out pro cesses b y forw ard observ a tions s ee Ornstein and W eiss [32], D e mbo and P eres [9], Nob el [29], and Csisz´ ar and T alata [8]. Recen tly in Csisz´ ar and T alata [8] the authors define a finite context to b e a memory word w of minimal length, that is, no prop er suffix of w is a memory w o rd. An infin ite con text for a pro cess is an infin ite string with all finite suffix ha ving positive probabilit y but none of them being a memory w ord. They treat there the problem of estimating the en tire con text tree in case the size of the alphab et is finite. F or a b ounded dep th con text tree, the pro ces s is M arko vian, while for an unbounded depth con text tree the univ ersal p oin t wise consistency result there is o btained o nly for the tr uncated trees whic h are again finite in size. This is in contrast to our results whic h deal with infinite a lph ab et size and consistency in estimating memory w ords 19 of arbitrary length. This is what for ces us to consider estimating a t sp ecially c hosen times . In the se cond subsection w e will presen t a sc heme whic h dep end upon a p ositiv e para meter ǫ , and w e guara n tee tha t densit y of times alo ng whic h the estimates are b eing giv en hav e densit y at least 1 − ǫ . The last tw o subsec tions are dev oted to seeing ho w this memory len gth estimation can b e applied to estimating conditional probabilities. W e do this first for finitarily Marko vian pro cess es along a sequence of stopping times which ac hieve d ensit y 1 − ǫ . W e do not know if the ǫ can b e dro pped in this case for the estimation of conditional probabilities. W e can disp ense with ǫ in the Marko vian case. F or this w e use an ear - lier r esult of ours on a univ ersal estimator f or the order of a finite order Mark o v chain on a coun table alphab et in order to estimate the conditional probabilities along a s equence of stopping times of densit y one. 6.1 Bac kw ard Estimation of the Memory Length for Finitarily Mark o vian Pro cesses Let { X n } b e stationary and ergo dic finita r ily Mark ovian with finite or count- ably infinite alphab et. In order to estimate K ( X 0 −∞ ) w e need to define some explicit statistics. The first is a measuremen t of the failure o f w 0 − k +1 to be a memory word. Define ∆ k ( w 0 − k +1 ) = sup 1 ≤ i sup { z − k − k − i +1 ∈X i ,x ∈X : p ( z − k − k − i +1 ,w 0 − k +1 ,x ) > 0 }    p ( x | w 0 − k +1 ) − p ( x | z − k − k − i +1 , w 0 − k +1 )    . Clearly this will v anish precisely when w 0 − k +1 is a memory w ord. W e need to define an empirical v ersion of this based on the observ ation of a finite data segmen t X 0 − n . T o this end first define the empirical v ersion of the conditiona l 20 probabilit y as ˆ p n ( x | w 0 − k +1 ) = # {− n + k − 1 ≤ t ≤ − 1 : X t +1 t − k + 1 = ( w 0 − k +1 , x ) } # {− n + k − 1 ≤ t ≤ − 1 : X t t − k + 1 = w 0 − k +1 } . These empirical distributions, as well as the sets w e are ab out to in tro duce are functions of X 0 − n , but w e suppress the dep e ndence to k eep the nota tion manageable. F or a fixed 0 < γ < 1 let L n k denote the set of strings with length k + 1 whic h app ear more than n 1 − γ times in X 0 − n . That is, L n k = { x 0 − k ∈ X k +1 : # {− n + k ≤ t ≤ 0 : X t t − k = x 0 − k } > n 1 − γ } . Finally , define the empirical version of ∆ k as f o llo ws: ˆ ∆ n k ( w 0 − k +1 ) = max 1 ≤ i ≤ n max ( z − k − k − i +1 ,w 0 − k +1 ,x ) ∈L n k + i    ˆ p n ( x | w 0 − k +1 ) − ˆ p n ( x | z − k − k − i +1 , w 0 − k +1 )    Let us agree b y conv ention that if the smallest of the sets o v er which w e are maximizing is empty then ˆ ∆ n k = 0. Observ e, that b y ergo dicit y , the ergo dic theorem implies that almost s urely the empirical distributions ˆ p con v erge to the true distributions p and so for an y w 0 − k +1 ∈ X k , lim inf n →∞ ˆ ∆ n k ( w 0 − k +1 ) ≥ ∆ k ( w 0 − k +1 ) almost s urely . With this in hand w e can giv e a test for w 0 − k +1 to b e a memory w ord. Let 0 < β < 1 − γ 2 b e arbitrary . L et N T E S T n ( w 0 − k +1 ) = Y E S if ˆ ∆ n k ( w 0 − k +1 ) ≤ n − β and N O o the rwise. Note that N T E S T n dep en ds on X 0 − n . Theorem 6.1 (Morvai and Weiss [25]) Eventual ly almost sur ely, N T E S T n ( w 0 − k +1 ) = Y E S i f and only if w 0 − k +1 is a memory wor d. W e define an estimate χ n for K ( X 0 −∞ ) from samples X 0 − n as follo ws. Set χ 0 = 0 , and for n ≥ 1 let χ n b e the smallest 0 ≤ k < n suc h that N T E S T n ( X 0 − k +1 ) = Y E S if there is suc h and n otherwise. Theorem 6.2 (Morvai and Weiss [25]) χ n = K ( X 0 −∞ ) eventual ly almost sur el y. 21 6.2 F orw ard Estimation of the Memory Length for Fini- tarily Mark o vian Pro cesses Let { X n } b e stationary and ergo dic finita r ily Mark ovian with finite or count- ably infinite alphab et. Define P T E S T n ( w 0 − k +1 )( X n 0 ) = N T E S T n ( w 0 − k +1 )( T n X n 0 ) where T is the left shift o perator. Theorem 6.3 (Morvai and Weiss [25]) Eventual ly almost sur ely, P T E S T n ( w 0 − k +1 ) = Y E S i f and only if w 0 − k +1 is a memory wor d. Define a list of words { w (0) , w (1) , w (2) , . . . , w ( n ) , . . . } suc h tha t all w ords of all lengths a re listed and a word can not precede its suffix. Note that w (0) is the empt y w o r d. No w define sets of indices A i n as fo llo ws. Let A 0 n = { 0 , 1 , . . . , n } a nd for i > 0 define A i n = {| w ( i ) | − 1 ≤ j ≤ n : X j j −| w ( i ) | +1 = w ( i ) } . (1) Let ǫ > 0 be fixed. D efine θ n ( ǫ ) < n to b e the minim al j suc h that    S i ≤ j : P T E S T n ( w ( i ))= Y E S A i n    n + 1 ≥ 1 − ǫ/ 2 (2) and n otherw ise. W e estimate for the length of the memory of X n −∞ lo oking bac kw ards if n ∈ S i ≤ θ n ( ǫ ) ,P T E S T n ( w ( i ))= Y E S A i n . The set of n ’s fo r whic h this holds will b e the set for whic h w e estimate the memory a nd w e denote this set by N . Note tha t the e ve nt n ∈ N depends only o n X n 0 , and th us N can b e though t of as a sequence of stopping time s. W e define for n ∈ N , κ n = min { i ≥ 0 : X n n −| w ( i ) | +1 = w ( i ) , P T E S T n ( w ( i )) = Y E S } . F or n ∈ N define ρ n ( X n 0 ) = | w ( κ n ) | . Note that ρ n , θ n , κ n and N dep end on ǫ , how ev er, w e will not denote this dep en dence on epsilon explicitly . 22 Theorem 6.4 (Morvai and Weiss [25]) L et ǫ > 0 b e fixe d. Then for n ∈ N , ρ n = K ( X n −∞ ) eventual ly almos t sur ely, (3) and lim inf n →∞ |N T { 0 , 1 , . . . , n − 1 }| n ≥ 1 − ǫ. (4) F or n ∈ N , X n n − ρ n +1 app e ars at le ast n − γ times eventual ly almost sur el y. 6.3 F orw ard Estimation of the Conditional Probabilit y for Finitarily Mark o vian Pro cesses Let { X n } b e stationary and ergo dic finita r ily Mark ovian with finite or count- ably infinite alphab et. No w our go al is to estimate the conditional pro babilit y P ( X n +1 = x | X n 0 ) on stopping times in a p oin tw ise sense. Let N b e a sequence of stopping times suc h that ev en tually almost surely X n n − K ( X n −∞ )+1 app ears at le ast n 1 − γ times in X n 0 . Let ρ n b e an y estimate of the length of the memory fr o m samples X n 0 suc h that ρ n − K ( X n −∞ ) → 0 on N . Define our estimate ˆ q n ( x ) of the conditional probability P ( X n +1 = x | X n 0 ) on N as ˆ q n ( x ) = # { ρ n − 1 ≤ i < n : X i i − ρ n +1 = X n n − ρ n +1 , X n +1 = x } # { ρ n − 1 ≤ i < n : X i i − ρ n +1 = X n n − ρ n +1 } . Theorem 6.5 (Morvai a nd Weiss [25]) O n n ∈ N , | ˆ q n ( x ) − P ( X n +1 = x | X n 0 ) | → 0 almost sur ely. Corollary 6.1 F or the stopping times N and estimator ρ n in The or em 6.4, The or em 6.5 holds and the density of N is at le ast 1 − ǫ . 23 6.4 F orw ard Estimation of the Conditional Probabilit y for Mark o v Pro cesses Let { X n } b e a stationary and ergo dic finite or coun tably infinite alphabet Mark o v chain with order K . Let O RD E S T n b e an estimator of the order from samples X n 0 suc h t hat O RD E S T n → K almost surely . Such an estima- tor can b e found e.g. in Morv ai and W eiss [21]. Let n ∈ N if X n n − O RDE S T n +1 app ears at le ast n 1 − γ times in X n 0 . N is a seq uence of stopping times. Let ˆ q n ( x ) = # { O RD E S T n − 1 ≤ i < n : X i i − O RDE S T n +1 = X n n − O RDE S T n +1 , X n +1 = x } # { O RD E S T n − 1 ≤ i < n : X i i − O RDE S T n +1 = X n n − O RDE S T n +1 } . Theorem 6.6 (Morvai a n d Weiss [25]) Assume O RD E S T n e quals the or der eventual ly almost sur ely.Then on n ∈ N , | ˆ q n ( x ) − P ( X n +1 = x | X n n − K ) | → 0 almost sur ely. and lim inf n →∞ |N T { 0 , 1 , . . . , n − 1 }| n = 1 . If the Markov chain turns out to take values fr om a finite set, then N takes as values al l but finitely many p ositive inte gers. 7 Examples Illustrating L i m itatio ns F or the class of all stationary and ergo dic binary Mark ov -chains of some finite order the forw ard estimation problem can b e solv ed. Indeed, if the time series is a Mark o v-c hain of some finite order, w e can es timate the order and count frequencies of blo c ks with length equal to the order. Bailey sho w ed that one can’t te st for b eing in the class, cf. Morv ai and W eiss [20] also. It is conceiv able that one can improv e the result of Morv ai [16] or Morv ai and W eiss [17] so that if the process happ e ns to be Mark o vian then one ev en tually estimates at all times. It h as been shown in Morv ai and W eiss 24 [22] that this is not p ossible. This puts some new restrictions on what can b e ac hiev ed in estimating along s topping times. Theorem 7.1 (Morvai and Weiss [22]) F or any strictly in cr e asing se quenc e of s topp i ng times { λ n } such that for al l stationary an d er go dic binary Marko v- chains with arbitr ary finite or der, eventual ly λ n +1 = λ n + 1 , and for any se quenc e of estimators { h n ( X 0 , . . . , X λ n ) } ther e is a stationa ry and er go dic binary time series { X n } with almos t sur ely c ontinuous c onditional pr ob ability P ( X 1 = 1 | . . . , X − 1 , X 0 ) , such that P  lim sup n →∞ | h n ( X 0 , . . . , X λ n ) − P ( X λ n +1 = 1 | X 0 , . . . , X λ n ) | > 0  > 0 . Remark: Bailey [5] a mo ng other things prov ed that there is no sequence of functions { e n ( X n − 1 0 ) } whic h for all stationary and ergo dic time series, if it turns out to b e a Mark o v-c hain, w ould b e ev en tually 1 and 0 otherwise. (That is, there is no test for the Marko v pro p ert y .) This result do es not imply ours. On the other hand, our result implies Bailey’s. (Indeed, if there we re a test for Mark ov -chains in the ab o v e sense, w e could apply the estimator in Morv ai [16] or Morv ai and W eiss [17] if the time series is not a Marko v-c hain of some finite order, and if the time series is a Marko v-c hain of some finite order w e can estimate the order of the Marko v c hain and coun t frequencies of blo c ks with length equal to the order. Bailey [5] and Ry abk o [33] prov ed less than our theorem. They prov ed the nonexistence of the desired estimator when the estimator should w ork for all stationary and ergo dic binary time series and when a ll λ n = n , that is, when w e alwa ys require go o d prediction. 8 Memory E s timation for Mark o v Pro ce sses In this section we shall examine ho w w ell can one estimate the lo cal memory length for finite order Mark ov c hains. In the case of finite alphab ets this can 25 b e done w ith stopping times that ev entually co v er all time ep o c hs. (Indeed, assume { X n } is a Mark ov c hain taking v alues fro m a finite se t. Assume O RD E S T n estimates the order in a p oin t wise sense from data X n 0 . T hen let ρ n = min { 0 ≤ t ≤ O RD E S T n : P T E S T n ( X n n − t +1 ) = Y E S } if there is suc h t and 0 ot h erwise. Since O RD E S T n ev en tually giv es the righ t order and there are finitelly many p ossible strings w ith length not greater than the order th us ρ n = K ( X n −∞ ) ev en tually almost surely b y Theorem 6.3.) Ho w ev er, as so on as one go es to a countable alphab e t, ev en if the order is kno wn to be tw o and we are just trying to decide whether the X n alone is a memory w ord or not, there is no sequence of stopping times whic h is guar- an teed to succeed ev en tually and whose densit y is one, cf. Morv ai and W eiss [25]. This shows that the ǫ in the preceding sections cannot b e eliminated. Theorem 8.1 ( Morvai and Weiss [25] ) Th er e ar e no strictly incr e asin g se quenc e of stopping times { λ n } and estimators { h n ( X 0 , . . . , X λ n ) } taking the val ues one and two, such that for al l c ountable alphab et Markov chains of or der two: lim n →∞ λ n n = 1 and lim n →∞ | h n ( X 0 , . . . , X λ n ) − K ( X λ n 0 ) | = 0 with pr ob ab i l i ty one. 9 Limitations for Binary Fin i t a ril y Mark o- vian Pro c e sses In the preceding section w e sho w ed that we cannot achiev e densit y one in the forw ard memory length es timation problem ev en in the class of Mark ov c hains o n a coun table alphab et. In this section w e shall sho w something 26 similar in the class of binary (i.e. 0 , 1) v alued finitarily Mark o v pro cesses. W e will assume that the re is giv en a sequ ence of estimators a nd stopping times, ( h n , λ n ) that do succeed to estimate successfully the memory length for binary Mark o v c hains of finite order and construct a finitarily Mar ko vian binary pro cess on whic h the sc heme fails infinitely often. Here is a precise statemen t: Theorem 9.1 ( Morvai and Weis s [25] ) F or any strictly incr e asing se- quenc e of stopping times { λ n } and se quenc e of estimators { h n ( X 0 , . . . , X λ n ) } , such that for al l stationary and er go dic binary Markov chains with arbitr ary finite or der, lim n →∞ λ n n = 1 , and lim n →∞ | h n ( X 0 , . . . , X λ n ) − K ( X λ n 0 ) | = 0 almost sur el y ther e is a stationary, er go dic finitarily Markovian binary time series such that on a set of p ositive me asur e of pr o c ess r e alizations h n ( X 0 , . . . , X λ n ) 6 = K ( X λ n −∞ ) infinitely often. In the final pro cess X n that we constructed in Mor v ai and W eiss [25] w e ha v e P ( K ( X 0 −∞ ) = k d ecay s to zero exp onen tially fast and in particular is summable. It follows that with probabilit y one ev en tually K ( X n 0 ) ≤ n so that the reason for our failure to estimate the order correctly is not coming ab out b ecause w e don’t ev en see the memory word. It is also w orth p oin ting out the densit y of mo ments on which the esti- mator is failing is of densit y zero. It follo ws fairly easily fro m the ergo dic theorem t h at if one is willing to to le rate suc h failures then a straigh tfor ward application of any backw ard estimation sc heme will con v erge outside a set of densit y ze ro. 27 References [1] P . Algo et, ”Univ ersal sc hemes for prediction, gam bling and po r t folio selection,” Annals of Pr ob ability , v ol. 2 0 , pp. 901–941, 1992 . Correction: ibid. vol. 2 3 , pp. 474–478, 19 9 5. [2] P . Algo et, ”The strong lo w of large n umbers for sequen tial decisions under uncertainit y ,” IEEE T r ansactions on I nformation The ory , vol. 40, pp. 609–63 4, 1994. [3] P . Algo et, ”Univ ersal sc hemes for learning the b est nonlinear predictor giv en the infinite past and side information,” IEEE T r ansactions on Information The ory , v ol. 45, pp. 1165–118 5, 1999. [4] K. Azuma, ”W eigh ted sums of certain dependen t random v ariables,” in T ohoku Mathematic al Journal, v ol. 37, pp. 357–367, 1967. [5] D. H. Bailey , Se quential Sc h emes for Cla ssifying and Pr e dicting Er go dic Pr o c esses. Ph. D. thesis, Stanford Unive rsity , 1976. [6] P . B ¨ u hlmann and A. J. Wyner, ”V ariable- length Mark o v c hains,” An- nals of Statistics , vol. 27 , pp. 480–513, 199 9. [7] T. M. Co v er, ”Op en problems in information theory ,” in 1975 I EEE Joint Workshop on I nformation The ory , pp. 35 –36. New Y ork: IEEE Press, 1975. [8] I. Csisz´ ar and Zs. T alata, ”Context tree estimation fo r not necessarily finite memory pro cesses via BIC and MDL,” T o a pp ear in IEEE T r ans- actions on Information The ory ,. [9] A. Dem b o and Y. P eres, A top olo gic al criterion for hyp o t hesis testing Annals of Stat. 22 (1 9 94) 106-117 . 28 [10] M. F eder and N. Merha v, ”Univ ersal prediction” IEEE T r ans. Inform. The ory , v ol. 44, pp. 2124–2147, 1 998. [11] L. Gy¨ orfi, G. Morv ai, and S. Y ak o witz, ” Limits to consisten t on-line forecasting for ergo dic time series,” IEEE T r ansactions on Information The ory , v ol. 44, pp. 886–892, 199 8. [12] S. Kalik ow ”Random Mark o v pro cesses and uniform martingales ,” Isr ael Journal of Mathematics , v o l. 71, pp. 33–54, 1990. [13] S. Kalik ow, Y. Katznelson and B. W eiss. ”Finitarily determinis tic gen- erators for zero en tr op y systems”, Isr ael Journal of Mathematics , vol. 79, pp. 33-45 , 1992. [14] Ph.T. Mak er, ”The ergo dic theorem for a s equence of functions,” Duke Math. J. , vol. 6, pp. 27– 30, 1940 . [15] G. Morv ai ” E stimation of Conditional D is tribution for Stationary Time Series ” PhD Thesis, T ec hnical Univ ersit y of Budap est, 1994. [16] G. Morv ai ”Guessing the o utput of a s tatio nary binary time series” In: F oundatio ns of Statistical Inference, (Eds. Y. Haito vsky , H.R.Lerc he, Y. Rito v), Ph ysik a-V erlag , pp. 207-215, 200 3. [17] G. Morv ai and B. W eiss, ”F orecasting for stationary binar y time series” A cta Applic andae Mathematic ae , v o l. 79, 25–34, 2003. [18] G. Mor v ai and B. W eiss, ”F orward estimation for ergo dic time series” A nn. I .H.Poinc ar´ e Pr ob abilit´ es e t Statistiqo es , vol. 41, 859–870, 20 0 5. [19] G. Morv ai and B. W eiss, ”Inferring the conditio na l me an” T h e ory of Sto c hastic Pr o c esse s , v o l. 11, 112–120, 2005. [20] G. Morv ai and B. W eiss, ”On classifying pro cesse s” B ernoul li , v ol. 11, 523–532, 2005. 29 [21] G. Morv ai a nd B. W eiss, ”Order estimation o f Marko v c hains,” IEEE T r ansactions on Information The ory , v ol. 51, pp. 1496 -1497, 2005 . [22] G. Morv ai and B. W eiss, ”Limitations on in termitten t forecasting” Statistics and Pr ob abi l i ty L etters , vol. 72, 285–290, 2005. [23] G. Morv ai and B. W eiss, ” P rediction f or discrete time series” Pr ob ability The ory and R elate d Fields , v ol. 132, 1–12, 2005. [24] G. Morv ai and B. W eiss, ”In termitten t es timation of stationary time series” T est , v ol. 13, 525– 542, 200 4. [25] G. Morv ai and B. W eiss, ”Estimating the memory for finitarily Marko- vian pro cess es” Ann. I .H.Poinc ar´ e Pr ob abilit ´ es e t Statistiqo es , v ol. 43, pp. 1 5-30, 2007. [26] G. Morv ai, S. Y a k owitz, and P . Algo et, ”W eakly conv ergent nonpara- metric for ecasting of stationa ry time serie s,” IEEE T r ansactions on I n- formation The ory , v ol. 43, pp. 483-498, 19 97. [27] G. Morv ai, S. Y ak owitz, and L. Gy¨ orfi, ” Nonparametric inferences f or ergo dic, stationary time series,” Annals of Statistics. , vol. 24, pp. 370– 379, 1 996. [28] A. Nob el, ” On optimal sequen tial prediction for general pro cesses,” IEEE T r ans. Inform. The ory , v ol. 49 , no. 1, pp. 83–98, 2003. [29] A. Nob el, ”Limits to classification and regress ion estimation from er- go dic pro cesses ,” Annals of Statistics , v ol. 27 pp. 262-273, 199 9. [30] D. S. Ornstein, ”Guessing the next output of a stationary pro ces s,” Isr ael Journal of Mathematics, v ol. 30, pp. 292–296, 197 8. [31] D. S. Ornstein, Er go dic The ory, R andomness, and D ynamic al Systems. Y ale Univ ersit y Pres s, 1974 . 30 [32] D.S. Ornstein and B. W eiss, ”Ho w sampling rev eals a pro c ess,” The A nnals of Pr ob ability , v ol. 18, pp. 905-930, 19 90. [33] B. Y a. Ry abk o, ”Prediction of random sequences and univ ersal co ding,” Pr o b lems of Inform. T r ans., v ol. 24, pp. 87-96, Apr.-June 1988. [34] P .C. Shields, ”Cutting and stack ing: a metho d for constructing station- ary processes,” IEEE T r ansactions on Information The ory, v ol. 37, pp. 1605–161 4, 1991. [35] P .C. Shields, The Er go dic The ory of Discr ete Sample Paths, v olume 13 of Gra dua t e Studies in Mathematics. American Mathematical So ciet y , Pro vidence, 1996. [36] B. W eiss, Sin g le Orbit Dynamics , American Mathematical So ciet y , 2000. 31

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment