Guessing the output of a stationary binary time series

The forward prediction problem for a binary time series $\{X_n\}_{n=0}^{\infty}$ is to estimate the probability that $X_{n+1}=1$ based on the observations $X_i$, $0\le i\le n$ without prior knowledge of the distribution of the process $\{X_n\}$. It i…

Authors: Gusztav Morvai

Guszt´ av Morv ai: Guessing the Outpu t of a Station ary Binary Time Series. In: F oundations of statistical inference ( Shor es h, 200 0), pp. 207 – 215, Con trib. Statist., Ph ysica, Heidelb erg, 2003. Abstract The forw ard prediction problem for a binary time series { X n } ∞ n =0 is to estimate the probabilit y that X n +1 = 1 based on the observ ations X i , 0 ≤ i ≤ n without prior kno wledge of t h e distribution of the pr o cess { X n } . It is known that this is not p ossible if one estimates at all v alues of n . W e presen t a simp le pro cedure whic h will attempt to mak e su c h a prediction infi nitely often at carefully selected stopping times c h osen b y the alg orithm . Th e gro w th rate of the stopp ing times is also exhibited. 1 In tro ducti on T. Cov er in [3] ask ed t wo fundamen tal questions concerning estimation for stationary and ergo dic binary pro cesses. Co v er’s first question w as as follow s. Question 1 Is ther e an estimation sche m e f n +1 for the value P ( X 1 = 1 | X 0 , X − 1 , . . . , X − n ) such that f n +1 dep ends solely on the observe d d ata se gment X 0 , X − 1 , . . . , X − n and lim n →∞ f n +1 ( X 0 , X − 1 , . . . , X − n ) − P ( X 1 = 1 | X 0 , X − 1 , . . . , X − n ) = 0 almost sur ely for al l stationary and er go dic binary time series { X n } ?. This question w as answ ered b y Ornstein [7] b y constructing suc h a sc heme. (See also Bailey [2].) Ornstein’s sc heme is not a simple one and t he pro of of consistency is rather sophisti- cated. A m uch simpler sc heme and pro of of consistency w ere provide d by Morv ai, Y ak ow it z, Gy¨ orfi [6]. ( See also W eiss [12].) Here is Co ver’s second question. Question 2 Is ther e an estima tion sche m e f n +1 for the value P ( X n +1 = 1 | X 0 , X 1 , . . . , X n ) such that f n +1 dep ends solely on the data se gment X 0 , X 1 , . . . , X n and lim n →∞ f n +1 ( X 0 , X 1 , . . . , X n ) − P ( X n +1 = 1 | X 0 , X 1 , . . . , X n ) = 0 almost sur ely for al l stationary and er go dic binary time series { X n } ?. This question was answ ered by Bailey [2] in a negative w a y , that is, he sho wed that there is no suc h sc heme. (Also see Ry abko [10], Gy¨ orfi, Morv ai, Y ak o witz [4] and W eiss [12].) Bailey used the tec hnique of cutting and stacking dev elop ed by Ornstein [8] (see also Shields [11]). Ry abk o ’s construction was based on a function of an infinite state Marko v-c hain. This negativ e result can b e interpreted as follo ws. Consider a weather forecaster whose task it is to predict the probabilit y of the ev en t ’there will b e rain tomorr ow’ giv en the observ atio ns up to the presen t day . Bailey’s result sa ys that the difference b etw een the estimate a nd the true conditional probability cannot ev entually b e small for all stationary w eather pro cesses. The difference will b e big infinitely often. These results sho w that there is a great difference b et we en Q uestions 1 and 2. Question 1 w a s addressed b y Morv ai, Y a ko witz, Algo et [5 ] and a v ery simple estimation sc heme was g iv en whic h satisfies the statemen t in Q uestion 1 in probabilit y instead of a lmost surely . Now consider a less ambitious goal than Q uestion 2: Question 3 I s ther e a se quenc e of stopping times { λ n } and an estimation scheme f n which dep ends on the observe d data se gmen t ( X 0 , X 1 , . . . , X λ n ) such that lim n →∞ ( f n ( X 0 , X 1 , . . . , X λ n ) − P ( X λ n +1 = 1 | X 0 , X 1 , . . . , X λ n )) = 0 almost sur ely for al l stationary binary time series { X n } ? It turns out that t he answ er is a ffirmat iv e and suc h a sche me will b e exhibited b elow. This result can b e in terpreted as if the w eather forecaster can refrain from predicting, that is, he ma y sa y that he do es not w ant to predict to da y , but will predict at infinitely man y t ime instances, and the difference b et wee n the prediction and the t r ue conditional probability will v anish a lmo st surely at the stopping times. 2 F orw ard Estimation for Statio nary Binary Time Se- ries Let { X n } ∞ n = −∞ denote a t wo-sided stationary binar y time series. F or n ≥ m , it will b e con v enien t t o use the notation X n m = ( X m , . . . , X n ). F or k = 1 , 2 , . . . , define the sequences { τ k } and { λ k } recursiv ely . Set λ 0 = 0. Let τ k = min { t > 0 : X λ k − 1 + t t = X λ k − 1 0 } and λ k = τ k + λ k − 1 . (By stationar it y , the string X λ k − 1 0 m ust app ear in t he sequence X ∞ 1 almost surely . ) The k th estimate of P ( X λ k +1 = 1 | X λ k 0 ) is denoted b y P k , and is defined as P k = 1 k − 1 k − 1 X j =1 X λ j +1 (1) 1 F or an ar bitr ary stationary binary time series { Y n } 0 n = −∞ , for k = 1 , 2 , . . . , define the sequenc e ˆ τ k and ˆ λ k recursiv ely . Set ˆ λ 0 = 0. Let ˆ τ k = min { t > 0 : Y − t − ˆ λ k − 1 − t = Y 0 − ˆ λ k − 1 } and let ˆ λ k = ˆ τ k + ˆ λ k − 1 . When there is am biguit y as to whic h time series ˆ τ k and ˆ λ k are to b e applied, we will use the notation ˆ τ k ( Y 0 −∞ ) and ˆ λ k ( Y 0 −∞ ). It will b e useful to define another time series { ˜ X n } 0 n = −∞ as ˜ X 0 − λ k := X λ k 0 for all k ≥ 1. (2) Since X λ k +1 λ k +1 − λ k = X λ k 0 the ab ov e definition is correct. Notice that it is immediate that ˆ τ k ( ˜ X 0 −∞ ) = τ k and ˆ λ k ( ˜ X 0 −∞ ) = λ k . Lemma 1 The two time series { ˜ X n } 0 n = −∞ and { X n } ∞ n = −∞ have identic al distribution, that is, for al l n ≥ 0 , and x 0 − n ∈ { 0 , 1 } n +1 , P ( ˜ X 0 − n = x 0 − n ) = P ( X 0 − n = x 0 − n ) . Proo f First w e pro ve that P ( ˜ X 0 − n = x 0 − n , ˆ λ k ( ˜ X 0 −∞ ) = n ) = P ( X 0 − n = x 0 − n , ˆ λ k ( X 0 −∞ ) = n ) . (3) Indeed, b y (2), ˜ X 0 − ˆ λ k ( ˜ X 0 −∞ ) = X λ k 0 , and it yields P ( ˜ X 0 − n = x 0 − n , ˆ λ k ( ˜ X 0 −∞ ) = n ) = P ( X n 0 = x 0 − n , λ k = n ) , and by stationarit y , P ( X n 0 = x 0 − n , λ k = n ) = P ( X 0 − n = x 0 − n , ˆ λ k ( X 0 −∞ ) = n ) and (3) is pro v ed. Apply (3) in order to get P ( ˜ X 0 − n = x 0 − n ) = ∞ X j = n P ( ˜ X 0 − n = x 0 − n , ˆ λ n ( ˜ X 0 −∞ ) = j ) = ∞ X j = n X x − n − 1 − j ∈{ 0 , 1 } j − n P ( ˜ X 0 − j = x 0 − j , ˆ λ n ( ˜ X 0 −∞ ) = j ) = ∞ X j = n X x − n − 1 − j ∈{ 0 , 1 } j − n P ( X 0 − j = x 0 − j , ˆ λ n ( X 0 −∞ ) = j ) = ∞ X j = n P ( X 0 − n = x 0 − n , ˆ λ n ( X 0 −∞ ) = j ) = P ( X 0 − n = x 0 − n ) 2 and Lemma 1 is pro ved . Since { X n } ∞ n = −∞ is a stationary time series, by Lemma 1 so is { ˜ X n } 0 n = −∞ . Since a statio n- ary time series can a lwa ys b e extended to b e a tw o-sided time series we ha v e also defined { ˜ X n } ∞ n = −∞ . No w w e pro v e the univ ersal consistency of the estimator P k . Theorem 1 F or al l stationary binary time series { X n } and estimator define d in (1), lim k →∞  P k − P ( X λ k +1 = 1 | X λ k 0 )  = 0 a lmost sur e l y. (4) Mor e over, lim k →∞ P k = lim k →∞ P ( X λ k +1 = 1 | X λ k 0 ) = P ( ˜ X 1 = 1 | ˜ X 0 −∞ ) (5) almost sur ely. Proo f P k − P ( X λ k +1 = 1 | X λ k 0 ) = 1 k − 1 k − 1 X j =1 { X λ j +1 − P ( X λ j +1 = 1 | X λ j 0 )] } + 1 k − 1 k − 1 X j =1 { P ( X λ j +1 = 1 | X λ j 0 ) − P ( X λ k +1 = 1 | X λ k 0 ) } = 1 k − 1 k − 1 X j =1 Γ j + 1 k − 1 k − 1 X j =1 (∆ j − ∆ k ) . Observ e that { Γ j , σ ( X λ j +1 0 ) } is a b ounded martingale difference sequence for 1 ≤ j < ∞ . T o see this no t e that σ ( X λ j +1 0 ) is monotone increasing, and Γ j is measurable with resp ect to σ ( X λ j +1 0 ), and E (Γ j | X λ j − 1 +1 0 ) = 0 for 1 ≤ j < ∞ . Now apply Azuma’s exp onen tia l b ound for b ounded mar tingale differences in Azuma [1] to get that for an y ǫ > 0, P         1 ( k − 1) k − 1 X j =1 Γ j       > ǫ   ≤ 2 exp( − ǫ 2 ( k − 1) / 2) . After summing the righ t side o v er k , and app ealing t o the Borel-Cante lli lemma for a seque nce of ǫ ’s tending to zero we get 1 ( k − 1) k − 1 X j =1 Γ j → 0 almost surely . It remains to sho w 1 k − 1 k − 1 X j =1 ∆ j − ∆ k → 0 almost surely . 3 Define p k ,n ( x 0 − n ) = P ( X λ k +1 = 1 | X λ k 0 = x 0 − n , λ k = n ) and (applying ˆ λ k to the time series { ˜ X n } 0 n = −∞ ) ˜ p k ,n ( x 0 − n ) = P ( ˜ X 1 = 1 | ˜ X 0 − ˆ λ k = x 0 − n , ˆ λ k = n ) . No w t he fact that λ k = ˆ λ k and Lemma 1 together imply p k ,n ( x 0 − n ) = ˜ p k ,n ( x 0 − n ) . (6) By (2) and (6 ), ˜ p k ,λ k ( X 0 λ k ) = ˜ p k , ˆ λ k ( ˜ X 0 − ˆ λ k ) . (7) Com bine (6) and (7) in order to get P ( X λ k +1 = 1 | X λ k 0 ) = P ( ˜ X 1 = 1 | ˜ X 0 − ˆ λ k ) . Notice that { P ( ˜ X 1 = 1 | ˜ X 0 − ˆ λ k ) , σ ( ˜ X 0 − ˆ λ k ) } is a b ounded marting a le and so it con v erges almost surely to P ( ˜ X 1 = 1 | ˜ X 0 −∞ ), a nd so do es P ( X λ k +1 = 1 | X λ k 0 ). W e ha v e pro v ed that ∆ j con v erges almost surely . Now T o eplitz lemma yields that 1 k − 1 P k − 1 j =1 (∆ j − ∆ k ) → 0 almost surely . The pro of of Theorem 1 is complete. 3 The Gro w th Rate of the Stopping Times The next result sho ws that the growth of the stopping times { λ k } is rather rapid. Let p ( x 0 − n ) = P ( X 0 − n = x 0 − n ). Theorem 2 L et { X n } b e a stationary and er go dic binary time series. Supp ose that H > 0 wher e H = lim n →∞ − 1 n + 1 E log p ( X 0 , . . . , X n ) is the pr o c ess entr opy. L et 0 < ǫ < H b e arbitr ary. Then for k lar ge enough, λ k ( ω ) ≥ c c · · c almost sur ely, (8) wher e the height of the tower is k − K , K ( ω ) is a finite numb er wh i c h dep ends on ω , and c = 2 H − ǫ . Proo f Since by (2), λ k = ˆ λ k ( ˜ X 0 −∞ ), and b y Lemma 1 the time series { X n } ∞ −∞ and { ˜ X n } ∞ −∞ ha ve iden tical distributions, and hence the same en trop y , it is enough to pro v e the result for ˆ λ k ( ˜ X 0 −∞ ). No w ˆ τ k and ˆ λ k are ev a luated on the pro cess { ˜ X n } 0 n = −∞ . F or 0 < l < ∞ define R ( l ) = min { j ≥ l + 1 : ˜ X − j − l − j = ˜ X 0 − l } . 4 By Ornstein and W eiss [9], 1 l + 1 log R ( l ) → H almost surely . (9) First w e sho w that if H > 0 then for k large enough ˆ τ k +1 > ˆ λ k almost surely . W e arg ue b y con tradiction. Supp ose that ˆ τ k +1 → ∞ and ˆ τ k +1 ≤ ˆ λ k infinitely often. Then ˜ X 0 − ˆ λ k = ˜ X − ˆ τ k +1 − ˆ λ k − ˆ τ k +1 and ˆ τ k +1 ≤ ˆ λ k infinitely often. Hence ˜ X 0 − ˆ τ k +1 +1 = ˜ X − ˆ τ k +1 − ˆ τ k +1 − ˆ τ k +1 +1 infinitely often and R ( ˆ τ k +1 − 1) ≤ ˆ τ k +1 infinitely often. Then b y (9), H = lim k →∞ 1 ˆ τ k +1 log R ( ˆ τ k +1 − 1) ≤ lim k →∞ 1 ˆ τ k +1 log ˆ τ k +1 = 0 pro vided tha t ˆ τ k → ∞ . Now assume that η = sup 0 0 implies that for k larg e enough ˆ τ k +1 > ˆ λ k almost surely a nd hence for k large enough R ( ˆ λ k ) = ˆ τ k +1 almost surely . Hence b y (9), 1 ˆ λ k + 1 log ˆ τ k +1 → H almost surely . Th us f o r almost eve ry ω ∈ Ω t here exists a p ositiv e finite integer K ( ω ) suc h that for k ≥ K ( ω ), 1 ˆ λ k +1 log ˆ τ k +1 > H − ǫ a nd ˆ λ k +1 > ˆ τ k +1 > c ˆ λ k for k ≥ K ( ω ) and the pro of of Theorem 2 is complete. 5 4 Guessin g the Output at St opping Time Instances If the weather forecaster is pressed to say simply will it rain or not tomorrow then w e need a guessing sc heme, rather than a predictor. Define the guess ing sc heme { ¯ X λ k } for the v alues { X λ k +1 } as ¯ X λ k = 1 { P k ≥ 0 . 5 } . Let X ∗ λ k denote the Bay es rule, that is, X ∗ λ k = 1 { P ( X λ k +1 =1 | X λ k 0 ) ≥ 0 . 5 } . Theorem 3 L et { X n } ∞ n = −∞ b e a stationary binary time se rie s. The pr op ose d g uess ing scheme ¯ X λ k works in the aver age at stopping times λ k just as wel l as the Bayes rule, that is, lim n →∞ 1 n n X k =1 1 { ¯ X λ k = X λ k +1 } − 1 n n X k =1 1 { X ∗ λ k = X λ k +1 } ! = 0 (10) almost sur ely. Mor e over, lim k →∞  P ( ¯ X λ k = X λ k +1 | X λ k 0 ) − P ( X ∗ λ k = X λ k +1 | X λ k 0 )  = 0 (11) almost sur ely. Proo f n X k =1 1 { ¯ X λ k = X λ k +1 } − 1 n n X k =1 1 { X ∗ λ k = X λ k +1 } = 1 n n X k =1 h 1 { ¯ X λ k = X λ k +1 } − P ( ¯ X λ k = X λ k +1 | X λ k 0 ) i − 1 n n X k =1 h 1 { X ∗ λ k = X λ k +1 } − P ( X ∗ λ k = X λ k +1 | X λ k 0 ) i + 1 n n X k =1 h P ( ¯ X λ k = X λ k +1 | X λ k 0 ) − P ( X ∗ λ k = X λ k +1 | X λ k 0 ) i = Γ n + Θ n + Ψ n . No w Γ n and Θ n tend to zero since they are av erages of b ounded martingale differences (cf. Azuma [1]). Concerning the third term Ψ n , it is enough to prov e that lim k →∞  P ( ¯ X λ k = X λ k +1 | X λ k 0 ) − P ( X ∗ λ k = X λ k +1 | X λ k 0 )  = 0 almost surely . T o see this recall the result in Theorem 1, lim k →∞ P k = lim k →∞ P ( X λ k +1 = 1 | X λ k 0 ) = P ( ˜ X 1 = 1 | ˜ X 0 −∞ ) 6 almost surely , and a pply this in order to get lim k →∞ [ P ( ¯ X λ k = X λ k +1 | X λ k 0 ) − P ( X ∗ λ k = X λ k +1 | X λ k 0 )] = lim k →∞ { [ P ( P ( ˜ X 1 = 1 | ˜ X 0 −∞ ) 6 = 0 . 5 , ¯ X λ k = X λ k +1 | X λ k 0 ) − P ( P ( ˜ X 1 = 1 | ˜ X 0 −∞ ) 6 = 0 . 5 , X ∗ λ k = X λ k +1 | X λ k 0 )] + [ P ( P ( ˜ X 1 = 1 | ˜ X 0 −∞ ) = 0 . 5 , ¯ X λ k = X λ k +1 | X λ k 0 ) − P ( P ( ˜ X 1 = 1 | ˜ X 0 −∞ ) = 0 . 5 , X ∗ λ k = X λ k +1 | X λ k 0 )] } = 0 . The pro of of Theorem 3 is no w complete. Ac knowledgm ents. The author wishes t o tha nk Benjamin W eiss for helpful discussions and suggestions. This pap er has been written b y the auspices of the Hungarian Nat ional E¨ otv¨ os F und. (Ez a cikk a Magy a r ´ Allami E¨ otv¨ os ¨ Oszt¨ ond ´ ıj t´ amogat´ as´ av al k ´ esz¨ ult.) References [1] K. Azuma, ”W eigh ted sums of certain dep enden t random v ariables,” in T ohoku Mathe- matical Journal, vol. 37, pp. 357–367, 1967. [2] D. H. Ba iley , Sequen tial Sc hemes for Classifying and Predicting Ergo dic Pro cesses. Ph. D. thesis, Stanfo rd Univ ersit y , 1976. [3] T. M. Co ve r, ”Op en problems in information theory ,” in 1975 IEEE Join t W orkshop on Information Theory , pp. 35–36. New Y ork: IEEE Press, 1975. [4] L. Gy¨ orfi, G. Morv ai, and S. Y ak o witz, ”Limits to consisten t o n- line fo recasting for ergo dic time series,” IEEE T ransactions on Informatio n Theory , vol. 44, pp. 886–892, 1998. [5] G. Morv ai, S. Y ak o witz, and P . Algo et, ”W eakly con ve rg en t nonpara metric fo r ecasting of stationary time series ,” IEEE T ransactions o n Information Theory , v ol. 43, pp. 483-498, 1997. [6] G. Morv ai, S. Y ak o witz, and L. G y¨ orfi, ”Nonpar ametric inferences for ergo dic, stationary time series,” Annals of Statistics. , v ol. 2 4, pp. 370– 379, 1996. [7] D. S. Ornstein, ”Guessing the next output of a stationary pro cess,” Israel J. Math., v ol. 30, pp. 29 2 –296, 1978. [8] D. S. O rnstein, Ergo dic Theory , Randomness, and D ynamical Systems. Y ale Univers ity Press, 1974. [9] D. S. Ornstein and B. W eiss , ” En tropy and data compression sc hemes,” IEEE T ransac- tions on Informatio n Theory , vol. 39, pp. 78 – 83, 1993. 7 [10] B. Y a . Ry abko, ” Prediction of ra ndo m sequ ences and univ ersal co ding,” Problems of Inform. T rans., v ol. 24, pp. 87- 96, Apr.-June 19 88. [11] P .C. Shields, ” Cutting and stackin g : a metho d for constructing stationary pro cesses,” IEEE T ransactions on Information Theory , vol. 37, pp. 1605 –1614, 1991. [12] B. W eiss, Single Orbit D ynamics , American Mathematical So ciety , 2 000. 8

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment