Weakly Convergent Nonparametric Forecasting of Stationary Time Series

W eakly Con v ergen t Nonparametric F orecasting of Stationary Time Series Guszt´ av Morv ai, Sidn ey Y ak o witz and P aul Algo et IEEE T ransactions on Information Theory V ol. 43, pp. 483-498, 1997. Abstract The conditional distribution of the next outcome given the inﬁnite p ast of a sta- tionary process can b e inferred from ﬁnite bu t gro wing segmen ts of the past. Sev eral sc hemes are kno wn for constructing p oin t w ise co n sisten t estimates, but they all de- mand pr ohibitiv e amount s of input d ata. In this pap er we consider real-v alued time series and construct conditional d istr ibution estimates that m ak e m uch more eﬃcient use of the input data. The estimates are consistent in a w eak sense, and the qu estion whether they are p oin t wise consisten t is still op en. F or ﬁnite-alphab et p r o cesses one ma y r ely on a univ ersal data compression sc h eme lik e the Lemp el-Ziv algorithm to construct conditional probabilit y mass function estimate s that are consistent in ex- p ected in formation div ergence. Consistency in this strong sense cannot be atta in ed in a universal sense for all stationary pr o cesses with v alues in an inﬁ nite alphab et, but wea k consistency can. Some app lications of the estimates to on-line forecasting, regression and classiﬁcation are d iscussed. 1 I. In tro duction and Ov erview W e are motiv ated by some fundamen tal questions regarding inference of time series that were raised b y T. Co v er [9] and concerning whic h signiﬁcan t progress has been made during the in terve ning ye ars. T he time series is a stationary pro cess { X t } with v a lues in a set X whic h may b e a ﬁnite set, the real line, or a ﬁnite dimensional euclidean space. F or t ≥ 0 let X t = ( X 0 , X 1 , . . . , X t − 1 ) denote the t -past at time t . It is also con ve nient to consider t he outcome X = X 0 , the t -past X − t = ( X − t , . . . , X − 1 ) and the inﬁnite past X − = ( . . . , X − 2 , X − 1 ) at time 0. The true process distribution P is unkno wn a priori but is kno wn to fall in the class P s of stationary distributions on the seq uence space X Z . Co v er’s list of questions included the followin g: giv en that { X t } is a { 0 , 1 } -v alued time series with an unk nown stationary ergo dic distribution P , is it possible to infer estimates ˆ P { X t = 1 | X t } of the conditional probabilities P { X t = 1 | X t } from the pa st X t suc h that [ ˆ P { X t = 1 | X t } − P { X t = 1 | X t } ] → 0 P -almost surely a s t → ∞ ? (1) D. Bailey [5] used the cutting and stac king techn ique of ergo dic theory to pro v e that the answ er is negative . A simple pro of of this negative result is outlined in Prop osition 3 of Ry ab co [30]. Bailey [5] also discusse d a result of Ornstein [2 2] t ha t provides a positive answ er to a less demanding question of Co ver [9 ], namely whether there exist estimates ˆ P { X = 1 | X − t } based on the past X − t suc h that f o r all P ∈ P s , ˆ P { X = 1 | X − t } → P { X = 1 | X − } P -almost surely as t → ∞ . (2) Ornstein constructed estimates ˆ P k { X = 1 | X − λ ( k ) } which dep end on ﬁnite past segmen ts X − λ ( k ) = ( X − λ ( k ) , . . . , X − 1 ) and whic h con ve rge almo st surely to P { X = 1 | X − } for ev ery P ∈ P s . The length λ ( k ) of the data record X − λ ( k ) dep ends on the data itself, i.e. λ ( k ) is a stopping time a dapted to the ﬁltra tion { σ ( X − t ) : t ≥ 0 } . T o get estimates satisfying (2), simply deﬁne ˆ P { X = 1 | X − t } as the estimate ˆ P k { X = 1 | X − λ ( k ) } where k is the largest in teger suc h that ˆ P k { X = 1 | X − λ ( k ) } can b e ev a luated from the data X − t (that is, X − λ ( k ) is a suﬃx of the string X − t but X − λ ( k +1) is no t.) The true conditional probability P { X = 1 | X − t } con verges to P { X = 1 | X − } almost surely b y the ma r tingale con vergence theorem and the estimate ˆ P { X = 1 | X − t } con verges to the same limit, henc e [ ˆ P { X = 1 | X − t } − P { X = 1 | X − t } ] → 0 P -almost surely a nd in L 1 ( P ). (3) An on-line estimate ˆ P { X t = 1 | X t } can b e constructe d at time t fro m the past X t in the same w ay as ˆ P { X = 1 | X − t } w as constructed from X − t . By (3) and stationarit y [ ˆ P { X t = 1 | X t } − P { X t = 1 | X t } ] → 0 in L 1 ( P ) as t → ∞ . (4) Th us the g uessing sc heme ˆ P { X t = 1 | X t } is univ ersally consisten t in the w eak sense of (4 ), although no guessin g sc heme can be univ ersally consisten t in the p oint wise sense of (1). 2 Ornstein’s result can b e generalized when { X t } is a stationary pro ces s with v alues in a complete separable metric (P olish) space X . Algo et [1] constructed estimates ˆ P k ( dx | X − λ ( k ) ) that, with probabilit y one under an y P ∈ P s , conv erge in law to the true conditional distri- bution P ( dx | X − ) of X = X 0 giv en the inﬁnite past. By setting ˆ P ( dx | X − t ) = ˆ P k ( dx | X − λ ( k ) ) for λ ( k ) ≤ t < λ ( k + 1 ), one obtains estimates ˆ P ( dx | X − t ) that almost surely conv erge in la w to the random measure P ( dx | X − ) in the space of probability distributions on X . Th us for any b ounded con tinuous function h ( x ) and any stationar y distribution P ∈ P s , Z h ( x ) ˆ P ( dx | X − t ) → Z h ( x ) P ( dx | X − ) P -almost surely . (5) A m uch simpler estimate ˆ P k ( dx | X − λ ( k ) ) and con v ergence pro of w ere obta ined by Morv ai, Y ak owitz and Gy¨ orﬁ [2 1]. Their estimate ˆ P k { X ∈ B | X − λ ( k ) } of the conditional probability of a subset B ⊆ X has the structure o f a sample mean: ˆ P k { X ∈ B | X − λ ( k ) } = 1 k X 1 ≤ i ≤ k 1 { X − τ ( i ) ∈ B } , (6) where the X − τ ( i ) are samples o f the pro ces s at selected instan ts in the past and λ ( k ) is the smallest in teger t suc h t hat the indices { τ ( i ) : 1 ≤ i ≤ k } can b e inferred from the segmen t X − t . F rom careful reading of [21], one can surmise that λ ( k ) will b e h uge for relativ ely small v alues of the sample size k . Morv ai [20] applied the ergo dic theorem for recurrence times o f Ornstein and W eiss [24] and a r gued that if { X t } is a stat io nary ergo dic ﬁnite-alphab et pro cess with p ositiv e en tropy rate H bits p er sym b ol and C is a constan t suc h that 1 < C < 2 H , then, with probability one, λ ( k ) ≥ C C · · C ev en tually for la rge k , (7) where the heigh t of the exponential tow er is k − k 0 for some n umber k 0 that depends on the pro ces s realization but not on k . T o o ur knowle dg e, none of the strongly- consisten t metho ds hav e b ee n applied to an y data sets, real o r sim ulated. Scarp ellini [31] has applied the metho ds of Bailey [5] and Ornstein [22] to infer the conditional exp ectation E { X τ |{ X s } s ≤ 0 } of the o utcome X τ at some ﬁxed time τ > 0 giv en the inﬁnite past of a stationary real-v a lued contin uous-time pro cess { X t } from past exp erience. The outcomes X t are assumed to b e b ounded in absolute v alue b y some ﬁxed constan t K . Scarp ellini constructs estimates by av eraging samples tak en at a ﬁnite n um b er of regularly spaced instants in the past and prov es that the estimates conv erge almost surely to the desired limit E { X τ |{ X s } s ≤ 0 } . His generalization of Ornstein’s result is not quite straigh tfo r ward, and the diﬃc ulty seems to be caused more b y the con tin uity of the range space [ − K , K ] than b y the contin uit y of the time index t . These works are of considerable theoretical interes t b ecause they p oin t to the limits of what can be ac hiev ed by w ay of time series prediction. Poin t wise consistency can b e 3 attained f or all stationary pro cesses, but the estimates are based on enormous data records. It is hard to say ho w m uch raw data a re really needed to get estimates with reasonable precision. The nonpa r ametric class of a ll stationar y ergo dic pro ces ses is ve ry ric h and can mo del a ll sorts o f complex nonlinear dynamics with lo ng range dep endencies and p eriodic- ities a t man y diﬀ erent t ime scales . It is hop eless to get eﬃcien t estimates with bounds on the con v ergence r a te unless one has a prio ri inf o rmation that winno ws the range of p ossibil- ities to some manageable sub class . In t he lit erature on nonparametric estimation (e.g. see Gy¨ orﬁ, H¨ ardle, Sar da and Vieu [15] and also Marton and Shields [19] ), one imp oses mixing conditions on the time series and then ﬁnds that the standard metho ds ar e cons istent and ac hiev e stated asymptotic rates of con v ergence. These approac hes are preferable to the univ ersal metho ds when one is a ssured of the mixing hypotheses. On the other hand, there is essen tially no metho dology for testing for mixing. In the presen t study w e relax the strong consistency requireme nt and push in the di- rection of greater eﬃc iency . R ather than demanding strong consiste ncy or p oin twise con- v ergence in (5 ) , w e shall b e satisﬁed with we a k consistency or mean con vergenc e in L 1 ( P ). (Note that mean conv ergence is equiv alen t to con ve rg ence in probabilit y b ecause the ran- dom v ariables are uniformly b ounded.) Being mo r e to lera nt in this w ay enables us to signiﬁcan tly reduce the data demands of the a lg orithm. The estimates will again b e de- ﬁned as empirical av erages of sample v alues, but the length of the ra w data segme nt that m ust b e insp ected to collect a giv en num ber of samples will gro w only p olynomially fast in the sample size (when X is a ﬁnite a lpha b et), rather than as a to we r of exponen tia ls in (7). F or pro cesses with v alues in a ﬁnite set X , w eak consistency means that for an y sta- tionary distribution P on X Z and any x ∈ X , the estimate ˆ P ( x | X − t ) = ˆ P { X = x | X − t } will con verge in mean to the true conditional probabilit y P ( x | X − ) = P { X = x | X − } : ˆ P ( x | X − t ) → P ( x | X − ) in L 1 ( P ), for any x ∈ X . (8) There exist estimates that are univ ersally consisten t in a stronger sense. Giv en a univ ersal data compression alg o rithm or a unive rsal parsimonious modeling sc heme f o r stat io nary pro cesses with v alues in the ﬁnite alphab et X , w e shall design estimates ˆ P ( x | X − t ) tha t are consisten t in exp ected infor ma t io n div ergence for all statio nary P . The ex p ectation of the Kullbac k-Leibler div ergence b et we en the conditio nal pro ba bilit y mass function P ( x | X − ) and the estimate ˆ P ( x | X − t ) will v anish in the limit as t → ∞ f o r all P ∈ P s : E P { I ( P X | X − | ˆ P X | X − t ) } → 0 , (9) where I ( P X | X − | ˆ P X | X − t ) = X x ∈X P ( x | X − ) log P ( x | X − ) ˆ P ( x | X − t ) ! . (10) 4 Consistency in exp ected informatio n dive rg ence implies consistency in mean as in (8 ), and is equiv a len t to the requiremen t tha t f o r any stationary P ∈ P s w e hav e mean con ve rg ence log ˆ P ( X | X − t ) → log P ( X | X − ) in L 1 ( P ). (11) The constructions of Ornstein [22] and Morv ai, Y ako witz and Gy¨ orﬁ [21] yield estimates ˆ P ( x | X − t ) suc h that (11) holds univ ersally in the p oint wise sense, but p erhaps not in mean. No estimates ˆ P ( x | X − t ) can b e consisten t in exp ected infor ma t io n div ergence for all sta- tionary pro cesses with v alues in a coun t a ble inﬁnite alphab et, but w eak consistency as in (8) is univ ersally ac hiev able. Barron, Gy¨ orﬁ and v an der Meulen [7] consider an unkno wn distribution P ( dx ) on a n abstract measurable space X and construct estimates from in- dep enden t samples so that the estimates are consisten t in infor mation div ergence and in exp ected information divergenc e whenev er P ( dx ) has ﬁnite Kullback - Leibler div ergence I ( P | M ) < ∞ relativ e to some known probability distribution M ( dx ) on X . In the presen t pap er, the discussion o f estimates that are consisten t in exp ected infor ma t ion divergenc e is limited to the ﬁnite-alphab et case. The organization of the pap er is as follo ws. In Section II w e describ e an algorit hm for constructing estimates ˆ P k ( dx | X − λ ( k ) ) and pro ve w eak consistency for all stationary real-v alued time series. The metho d and its pro of applies to time series with v alues in an y σ -compact Polish space. In Section I I I we t ransform the estimates ˆ P k ( dx | X − λ ( k ) ) in to estim ates ˆ P ( dx | X − n ) b y letting k dep end on n . W e ch o ose an increasing seq uence k ( n ) and deﬁne the estimate ˆ P ( dx | X − n ) as ˆ P k ( n ) ( dx | X − λ ( k ( n )) ) if λ ( k ( n )) ≤ n and as some default measure Q ( dx ) otherwise. If k ( n ) grow s suﬃcien tly slo wly with n then the data requiremen t λ ( k ( n )) will seldom exceed the av ailable length n and the estimates ˆ P ( dx | X − n ) will b e we a kly consisten t j ust like the estimates ˆ P k ( n ) ( dx | X − λ ( k ( n )) ). Section IV is ab o ut mo deling and da t a compression and ab out estimates that are consisten t in exp ected information dive rg ence for stationary pro ce sses with v alues in a ﬁnite alphab et. In Section V, w e shift ˆ P ( dx | X − t ) from time 0 to time t and sho w that the shifted estimates ˆ P ( dx t | X t ) can b e used for seque n tial forecasting or on-line prediction. W e sho w that one can mak e sequen tial decisions based on the shifted estimates ˆ P ( dx t | X t ) so that the av erage loss p er decision con v erges in mean to the minimum long run av erage loss that could b e attained if one could mak e decisions with kno wledge of the true conditional distribution of the next outcome giv en the inﬁnite past at eac h step. In particular, the av erage rate of incorrect guesses in classiﬁcation and the a ve ra ge of the mean squared error in regression con v erge to the minim um that could b e at tained if the inﬁnite past w ere known to begin with. W e w ould lik e to a lert the reader ab out some of our notational con v en tio ns. Only one lev el of subscripts or sup erscripts is allo wed in equations that are em b edde d in the t ext and so w e are often forced to adopt the ﬂat functional notation λ ( k ), λ ( k ( n )) , ℓ ( k ), J ( k ), 5 τ ( k , j ), etc. Ho we ver, the equations sometimes lo ok b etter with nested subscripts and sup erscripts and therefore w e prefer to write λ k , λ k ( n ) , ℓ k , J k , τ k j , etc. in the displa y ed equations. W e hop e that mixing of these nota tional conv en tions will not b e a source of confusion but r a ther will improv e the readabilit y of the pap er. Logarithms and en tr o p y rates are taken in base 2 unless sp eciﬁed otherwise, and exp onen tial growth rates are really doubling rat es. I I. Learning the Conditional Distribution P ( dx | X − ) Let { X t } be a real-v alued statio na ry time serie s. The pro cess distribution is unknown but shift-in v arian t. W e wish to infer the conditional distribution of X = X 0 giv en the inﬁnite past X − from past exp erience. W e sho w that it is v ery easy to construct w eakly consisten t estimates ˆ P k ( dx | X − λ ( k ) ) dep endin g on ﬁnite past da ta segmen ts X − λ ( k ) suc h t ha t for ev ery b ounded contin uous function h ( x ) on X a nd an y statio na ry distribution P ∈ P s , lim k Z h ( x ) ˆ P k ( dx | X − λ ( k ) ) = Z h ( x ) P ( dx | X − ) in L 1 ( P ). (12) The estimates ˆ P k ( dx | X − λ ( k ) ) will b e deﬁned in terms of quan tized vers ions of the pro cess { X t } . Let X denote the real line and let {B k } k ≥ 1 b e an increasing sequence of ﬁnite subﬁelds that asymptotically generate the Bo rel σ - ﬁeld on X . Let x 7→ [ x ] k denote the quan tizer that maps an y p oint x ∈ X to the a t o m of B k that happ ens to contain x . F or any integer ℓ ≥ 1 let [ X − ℓ ] k denote the quantized sequence ([ X − ℓ ] k , . . . , [ X − 1 ] k ). Giv en any in teger J ≥ 1, one ma y searc h bac kw a rds in time and collect J sample s of the pro cess at times when the quantiz ed ℓ -past lo oks exactly like the quantized ℓ - past at time 0. Let λ = λ ( k , ℓ, J ) denote the length of the data segmen t X − λ = ( X − λ , . . . , X − 1 ) that m ust b e insp ected to ﬁnd these J samples and let ˆ P k ,ℓ,J ( dx | X − λ ) denote the empirical distribution of those samples. Then ˆ P k ,ℓ,J ( dx | X − λ ) will b e a go od estimate of P ( dx | X − ) if the sample size J , the context length ℓ and the quan tizer index k ar e suﬃcien tly large. In fact, if k and ℓ a re ﬁxed and the sample size J tends to inﬁnit y then b y the ergo dic theorem, ˆ P k ,ℓ,J ( dx | X − λ ( k, ℓ,J ) ) will con v erge in la w to P ( dx | [ X − ℓ ] k ). If w e now r eﬁne the context by increasing k a nd ℓ , then P ( dx | [ X − ℓ ] k ) will con v erge in law to P ( dx | X − ) by the martinga le conv ergence theorem. The question is how to turn this limit of limits in to a single limit b y letting k , ℓ and J increase sim ulta neously to inﬁnit y . W e m ust make k and ℓ lar ge to reduce the bias and we m ust mak e J large to reduce the v ariance of the estimates. W e will let ℓ and J g r ow with k and show that if ℓ ( k ) and J ( k ) are monoto nically increasing to inﬁnit y then the empirical conditional distribution estimate ˆ P k ( dx | X − λ ( k ) ) = ˆ P k ,ℓ ( k ) ,J ( k ) ( dx | X − λ ( k, ℓ ( k ) ,J ( k )) ) conv erges w eakly to P ( dx | X − ). After this brief outline we no w pro ceed with a detailed dev elopmen t. Let { ℓ k } k ≥ 1 and { J k } k ≥ 1 b e t wo nondecreasing un b ounded sequences of p ositiv e in tegers. W e often write ℓ ( k ) and J ( k ) instead of ℓ k and J k . F or ﬁxed k ≥ 1 let {− τ k j } j ≥ 0 and { ˜ τ k j } j ≥ 0 6 denote the sequences of past and future recurrence times of the pattern [ X − ℓ ( k ) ] k . Th us w e set τ k 0 = ˜ τ k 0 = 0 and for j = 1 , 2 , . . . we inductiv ely deﬁne τ k j = min { t > τ k j − 1 : ([ X − ℓ k − t ] k , . . . , [ X − 1 − t ] k ) = ([ X − ℓ k ] k , . . . , [ X − 1 ] k ) } , (13) ˜ τ k j = min { t > ˜ τ k j − 1 : ([ X − ℓ k ] k , . . . , [ X − 1 ] k ) = ([ X − ℓ k + t ] k , . . . , [ X − 1+ t ] k ) } . (14) The random v ariables τ ( k , j ) = τ k j and ˜ τ ( k , j ) = ˜ τ k j are ﬁnite almost surely b y P oincar´ e’s recurrence theorem for the quantiz ed pro cess { [ X t ] k } , cf. Theorem 6.4.1 of Gr ay [14]. The lengths λ k = λ ( k ) and estimates ˆ P k ( dx | X − λ ( k ) ) are no w deﬁned b y the form ulas λ k = λ ( k ) = ℓ ( k ) + τ ( k , J k ) , (15) ˆ P k ( dx | X − λ k ) = 1 J k X 1 ≤ j ≤ J k δ X − τ ( k,j ) ( dx ) , (16) where δ ξ ( dx ) is the D irac measure that places unit mass at the p oin t ξ ∈ X . Th us for an y Borel set B , the conditional pro babilit y estimate ˆ P k { X ∈ B | X − λ k } = 1 J k X 1 ≤ j ≤ J k 1 { X − τ ( k ,j ) ∈ B } (17) is obtained b y searc hing for the J k most recen t o ccurren ces of the pattern [ X − ℓ ( k ) ] k and calculating the relativ e f requency with which the next realized sym b ols X − τ ( k ,j ) hit the set B . W e shall prov e that ˆ P k ( dx | X − λ ( k ) ) is a w eakly consisten t estimate of P ( dx | X − ). The precise statemen t and the pro of are bro ken do wn in t wo parts. Theorem 1A. F or an y set B in the generating ﬁeld S k B k and any stationary pro cess distribution P ∈ P s w e ha v e mean con vergenc e lim k ˆ P k { X ∈ B | X − λ k } = P { X ∈ B | X − } in L 1 ( P ) . (18) The pro of is somewhat tec hnical and is placed in the App endix. In the second part we argue that the estimators ˆ P k ( dx | X − λ ( k ) ) can b e employ ed to infer the regression function E { h ( X ) | X − } = R h ( x ) P ( dx | X − ) of any b o unded con tinuous function h ( x ) giv en the past. Theorem 1B. Let { X t } b e a real-v alued stationa r y time series. If the ﬁelds B k are gener- ated b y in terv als and the estimator ˆ P k ( dx | X − λ ( k ) ) is deﬁned a s in ( 16) then for any b ounded con tinuous function h ( x ) on X , lim k Z h ( x ) ˆ P k ( dx | X − λ k ) = Z h ( x ) P ( dx | X − ) in L 1 ( P ) . (19) 7 Pro of: Pic k some b ound M suc h that | h ( x ) | ≤ M on X . Giv en ǫ > 0 there exists an in teger κ and a ﬁnite in t erv al K in the ﬁeld B κ suc h that P { X ∈ K } > 1 − ǫ M . (20) If necessary w e increase κ un til κ is suﬃcien tly large so that there exists a B κ -measurable function g ( x ) suc h that | h ( x ) − g ( x ) | ≤ ǫ on K . Assumin g g ( x ) = 0 outside K , we ha ve | h ( x ) − g ( x ) | ≤ f ( x ) = ǫ 1 { x ∈ K } + M 1 { x 6∈ K } . (21) Let ˆ P k and P − b e shorthand for ˆ P k ( dx | X − λ ( k ) ) and P ( dx | X − ). Then     Z h d ˆ P k − Z h dP −     ≤ Z | h − g | d ˆ P k +     Z g d ˆ P k − Z g dP −     + Z | g − h | dP − . (22) The function g ( x ) is a ﬁnite linear com binat ion of indicator functions of B κ -measurable subsets, and The o r em 1A implies that R g d ˆ P k con v erges to R g dP − in L 1 : E     Z g d ˆ P k − Z g dP −     → 0 . (23) The function f ( x ) is B κ -measurable and b o unded, hence R f d ˆ P k con v erges to R f dP − in L 1 and the expectations con v erge: E Z f d ˆ P k → E Z f dP − = E f . (24) Since | h − g | ≤ f and E f ≤ ǫ P { X ∈ K } + M P { X 6∈ K } < 2 ǫ by (20 ) and (21) , it follo ws from (22), (23) a nd (24) that E     Z h d ˆ P k − Z h dP −     ≤ 2 ǫ + ǫ + 2 ǫ ev en tually for large k . (25) Th us E | R h d ˆ P k − R h dP − | → 0, a nd this is the desired conclusion ( 1 9). Theorem 1B holds in general if X is a σ -compact P o lish space and the ﬁelds B k are suitably c ho sen. Indeed, let { K k } k ≥ 1 b e an increasing sequence of compact subsets with union S k K k = X . F or any ﬁxed k one may cov er K k with a ﬁnite collection of open balls ha ving diameter less than ǫ k , where ǫ k ց 0 as k → ∞ . Let B k denote the smallest ﬁeld con taining B k − 1 and the sets B ∩ K k where B ranges ov er all balls in the ﬁnite co ve r of K k . (W e start with the trivial ﬁeld B 0 = {∅ , X } .) Any b ounded contin uous function h ( x ) on X is uniformly contin uous on eac h compact subset of X . If | h ( x ) | ≤ M and ǫ > 0, then f o r suﬃcien tly la rge κ there exists some compact subset K in B κ suc h tha t P { X 6∈ K } ≤ ǫ/ M and h ( x ) oscillates less than ǫ on each atom of B κ that is contained in K . Th us there exists a B κ -measurable function g ( x ) suc h that | h ( x ) − g ( x ) | < ǫ on K and g ( x ) = 0 outside K . W e can then pro ceed as in the pro of of Theorem 1B to prov e that for any b ounded con tinuous function h ( x ), Z h ( x ) ˆ P k ( dx | X − λ ( k ) ) → E { h ( X ) | X − } in L 1 . (26) 8 I I I. T runcation of the S earc h Depth The estimates ˆ P k ( dx | X − λ ( k ) ) are based on ﬁnite but r a ndom length segmen ts of the past. W e shall tr ansform these in to estimates ˆ P ( dx | X − n ) that dep end on ﬁnite past segmen ts with deterministic length but that still ar e w eakly consisten t. The details are somewhat more in volv ed than for the strongly consisten t estimates in Section I. In terms of the empiri- cal conditiona l distribution ˆ P k ,ℓ,J ( dx | X − λ ( k, ℓ,J ) ) t ha t was deﬁned in the o ut line of Section I I, the question is ho w fast k , ℓ a nd J may increase with n so that λ ( k ( n ) , ℓ ( n ) , J ( n )) ≤ n with high probabilit y . The weak consis tency of the estimates ˆ P k ( n ) ,ℓ ( n ) ,J ( n ) ( dx | X − λ ( k ( n ) ,ℓ ( n ) ,J ( n )) ) will not suﬀer if we redeﬁne the estimates by assigning some default measure Q ( dx ) in those rare cases when the searc h depth λ ( k ( n ) , ℓ ( n ) , J ( n )) exceeds the a v ailable record length n . It is diﬃcult to sa y what the optimal growth pat h is for k ( n ), ℓ ( n ) and J ( n ) without prior information ab out the spatial and t emp oral dependency structure of the pro cess. The sp ec ia l case of ﬁnite alphab et pro ce sses is most interes t ing a nd it is simpler b ec a use only 2 of the 3 parameters k , ℓ, J play a role. W e do not need an index for subﬁelds of X b ecause the ob vious c hoice for B k is the ﬁeld of all subsets of X . Also, it is conv enien t to c ho o se the blo c k length ℓ k equal to k so that τ k j is the time for j recurrence s of X − k . In Section A w e r ecall the ergo dic theorem fo r r ecurrence times tha t was deriv ed by Wyner and Ziv [34] and b y Ornstein a nd W eiss [24] for ﬁnite alphab et pro cesses. In Section B w e deﬁne conditional probability mass function estimates ˆ P ( x | X − n ) a nd we pro v e consistency in mean if the blo c k length k ( n ) and the sample size J k ( n ) gro w deterministically and suﬃcien tly slo wly with n . In Section C we discuss generalizations f o r real-v alued pro cesses. A. R ecurrence T imes Let { X t } b e a stationary ergo dic pro ces s with v alues in a ﬁnite set X . Start ing at time τ k 0 = 0, the succ essiv e r ecurrence times τ k j of the k - blo c k X − k are deﬁned as follows: τ k j = inf { t > τ k j − 1 : ( X − k − t , . . . , X − 1 − t ) = ( X − k , . . . , X − 1 ) } . (27) If P { X − k = x − k } > 0 then b y the r esults of Kac [17 ] (see also Willems [3 3], Wyner and Ziv [34]), E { τ k 1 | X − k = x − k } = 1 P { X − k = x − k } . (28) Let H denote the entrop y r a te of the statio nary ergo dic pro cess { X t } in bits p er sy mbol: H = lim k − 1 k E { log P ( X k ) } = lim k − 1 k E { log P ( X − k ) } . (29) Wyner and Ziv [34], Theorem 3, in v o k ed Ka c’s r esult and the Shannon-McMillan-Breiman theorem to prov e that τ k 1 cannot gro w faster than exp onen t ially with limiting rate H 9 (lim sup k k − 1 log τ k 1 ≤ H almost surely). Ornstein and W eiss [24] then argued that τ k 1 will gro w exponentially fast almost surely with limiting rate exactly equal to H : k − 1 log τ k 1 → H almost surely . (30) No w supp ose a sample of size J k is desired. The total time needed to ﬁnd J k = J ( k ) ≥ 1 instances of the pattern X − k is equal to the recurrenc e time τ k J ( k ) . The rat io τ k J ( k ) /J k can b e interpreted as the a v erage in ter-recurrence time: τ k J ( k ) J k = 1 J k X 1 ≤ j ≤ J k ( τ k j − τ k j − 1 ) . (31) W e claim that like τ k j , the av erage in t er-recurrence time τ k J ( k ) /J k cannot grow faster than exp o nen tially with limiting rate H . The pro o f is based on Ka c’s result and the lemma that w as dev elop ed by Algo et and Co v er [3] to giv e a simple pro of of the Shannon- McMillan- Breiman theorem a nd a more general ergo dic theorem fo r t he maxim um exp onential grow th rate of compounded capital in ve sted in a stationary mark et. Theorem 2. Let { X t } b e a stationary ergo dic pro cess with v alues in a ﬁnite set X and with en tropy rate H bits p er sym b ol. If ∆ k = ∆( k ) is a sequence of n umbers suc h that P k 2 − ∆( k ) < ∞ , then for arbitrary J ( k ) = J k > 0 we ha ve log τ k J ( k ) J k ! ≤ − log P ( X − k ) + ∆ k ev en tually for la rge k , (32) and consequen tly lim sup k 1 k log τ k J ( k ) J k ! ≤ H almost surely . (33) Pro of: The in ter-recurrence t imes τ k j − τ k j − 1 are identic a lly distributed with the same con- ditional distribution giv en X − k as the ﬁrst r ecurrence time τ k 1 . By Kac’s result, E { τ k J ( k ) | X − k } P ( X − k ) = J k E { τ k 1 | X − k } P ( X − k ) = J k . (34) (A referee p oin ted out that a result lik e this was also pro v ed b y Ga vish and Lemp el [13].) Th us the r andom v ariable Z k = P ( X − k ) τ k J ( k ) /J k has exp ectation E { Z k } = E ( P ( X − k ) E ( τ k J ( k ) J k     X − k )) = 1 . (35) By the Mark o v inequalit y , P { log Z k > ∆ k } = P { Z k > 2 ∆ k } ≤ 2 − ∆ k E { Z k } = 2 − ∆ k , (36) and by the Borel- Cantelli lemma log Z k ≤ ∆ k ev en tually for lar g er k . This prov es (32). Assertion (3 3) follow s from (32) up on dividing b oth sides b y k and taking the lim sup as 10 k → ∞ . Indeed, − k − 1 log P ( X − k ) → H almost surely b y the Shannon-McMillan-Breiman theorem and one ma y c ho ose ∆ k = 2 log k so that ∆ k /k → 0. It is worth while to observ e tha t Theorem 2 can b e generalized if the pro ce ss { X t } is stationary but not necessarily ergo dic. Let P b e a stationary distribution and let P ω denote the ergo dic mo de of the actual pro cess realizatio n ω . Then by the ergo dic decomposition theorem (see The orem 7.4.1 o f Gray [14]) and the monotone con vergence theorem, P { X − k = x − k } E { τ k J ( k ) | X − k = x − k } = X 1 ≤ t< ∞ tP { X − k = x − k , τ k J ( k ) = t } = X 1 ≤ t< ∞ Z tP ω { X − k = x − k , τ k J ( k ) = t } P ( dω ) = Z X 1 ≤ t< ∞ tP ω { X − k = x − k , τ k J ( k ) = t } P ( dω ) = Z P ω { X − k = x − k } E ω { τ k J ( k ) | X − k = x − k } P ( dω ) = Z J k P ( dω ) = J k . (37) It follows tha t E { P ( X − k ) τ k J ( k ) } = J k and log( τ k J ( k ) /J k ) ≤ − log P ( X − k ) + ∆ k ev en tually for la rge k . (38) The Shannon-McMillan-Breiman theorem for stationar y nonergo dic pro ces ses asserts that P ( X − k ) decreases ex p onen tially fast with limiting rate H ( P ω ), so o ne ma y conclude that lim sup k 1 k log τ k J ( k ) J k ! ≤ H ( P ω ) almost surely . (39) Th us the a ve r age in t er-recurrence time τ k J ( k ) /J k cannot grow f a ster than exp onen tially with limiting rat e H ( P ω ), the en trop y rate of the ergo dic mo de P ω . B. C onditional P robabilit y Mass F unction Estimates In the ﬁnite alphab et case, the general estimator ˆ P k ( dx | X − λ ( k ) ) tha t w as deﬁned in (16) reduces to the conditional pro babilit y mass function estimate ˆ P k ( x | X − λ ( k ) ) = 1 J k X 1 ≤ j ≤ J k 1 { X − τ ( k ,j ) = x } . (40) Here k = ℓ k is the blo c k length and the sample size J k is monotonically increasing. The recurrence times τ k j of the k - blo c k X − k w ere deﬁned inductiv ely for j = 1 , 2 , 3 , . . . in (27). W e c ho ose a slow ly increasing seque nce of blo c k lengths k ( n ) and set ˆ P ( x | X − n ) equal to ˆ P k ( n ) ( x | X − λ ( k ( n )) ) if this estimate can b e computed from t he a v ailable data segme nt X − n . 11 Otherwise, if λ k ( n ) > n , w e truncate the searc h and deﬁne ˆ P ( x | X − n ) a s the default measure Q ( x ) = 1 / | X | . Th us fo r n ≥ 0, w e deﬁne ˆ P ( x | X − n ) = ( ˆ P k ( n ) ( x | X − λ ( k ( n )) ) if λ ( k ( n )) ≤ n , Q ( x ) otherwise. (41) If k ( n ) grows suﬃcien tly slo wly then truncation is a rare ev ent a nd ˆ P ( x | X − n ) coincides most of the time with the w eakly consisten t estimator ˆ P k ( n ) ( x | X − λ ( k ( n )) ). The que stion is ho w fast the blo ck length k ( n ) and the sample size J k ( n ) ma y gro w to get consisten t estimates. T o answ er this question, we use our results ab out r ecurrence times. The inter-recurrenc e times τ k j − τ k j − 1 ha ve the same conditional distribution and hence the same conditio na l expectatio n giv en X − k as the ﬁrst recurrenc e time τ k 1 . The exp ected in ter-recurrence time is b ounded a s follows : E ( τ k J ( k ) J k ) = E { τ k 1 } = X x − k : P { X − k = x − k } > 0 P { X − k = x − k } E { τ k 1 | X − k = x − k } ≤ |X | k . (42) If ǫ k > 0 then b y the Marko v inequality P ( τ k J ( k ) J k > |X | k ǫ k ) ≤ ǫ k . (43) If ǫ k → 0 then P { τ k J ( k ) > J k |X | k /ǫ k } → 0 and if P k ǫ k < ∞ then τ k J ( k ) ≤ J k |X | k /ǫ k ev en tually fo r la r ge k by the Borel-Cantelli lemma. This is similar to (3 2) with ǫ k = 2 − ∆( k ) . Since λ ( k ) = k + τ k J ( k ) , w e see that P { λ ( k ( n )) ≤ n } → 1 as n → ∞ (44) if J k and k ( n ) are c hosen so that for some ǫ k > 0 with ǫ k → 0, k ( n ) + J k ( n ) |X | k ( n ) /ǫ k ( n ) ≤ n ev en tually for lar ge n . (45) It suﬃces that k ( n ) = (1 − ǫ ) log |X | n for some 0 < ǫ < 1 a nd J k = o ( |X | k ǫ/ (1 − ǫ ) ) so that J k ( n ) = o ( n ǫ ). ( No ninteger v alues are rounded down to the nearest in teger, as usual.) W e can b e sligh tly more aggr essiv e. Theorem 3. Let { X t } b e a stationary pro ce ss with v alues in a ﬁnite set X and choose Q ( x ) = |X | − 1 as default measure in (41 ) . If the blo c k length k ( n ) and the sample size J k ( n ) are monoto nically increasing to inﬁnity and satisfy J k ( n ) |X | k ( n ) = O ( n ) , (46) then the es tima t es ˆ P ( x | X − n ) in (41) are consisten t in me a n: ˆ P ( x | X − n ) → P ( x | X − ) in L 1 ( P ) . (47) 12 In particular, the es tima t es ˆ P ( x | X − n ) are consisten t in mean if the block length is k ( n ) = (1 − ǫ ) lo g |X | n and the sample size is J k ( n ) = n ǫ for some 0 < ǫ < 1 . Pro of: If the entrop y ra te H is strictly less than log |X | a nd R is an y constan t suc h that H < R < log |X | t hen b y (33), τ k J ( k ) is asymptotically bo unded b y J k 2 Rk . It follows that τ k ( n ) J ( k ( n )) ≤ J k ( n ) 2 Rk ( n ) ≤ J k ( n ) |X | k ( n ) 2 ( R − log | X | ) k ( n ) = o ( n ) . (48) It is nece ssary for (46) that k ( n ) < log |X | n ev en tually for large n since J k ( n ) → ∞ b y assumption. Th us λ ( k ( n )) = k ( n ) + τ k ( n ) J ( k ( n )) = o ( n ) and λ ( k ( n )) is upp er b ounded by n ev en tually for large n . If H = log |X | then there is no guarantee that w e can collect J k ( n ) samples fro m X − n , but the estimate ˆ P ( x | X − n ) will nev ertheless b e consisten t in mean if the default measure is Q ( x ) = |X | − 1 b ecause the outcomes X t happ en to b e indep enden t iden tically distributed according to this distribution Q ( x ) when H = log |X | . The estimates ˆ P k ( x | X − λ ( k ) ) in (40) are consisten t in the p oint wise sense under cer- tain conditions. F or example, if { X t } is a stationary ﬁnite-state Marko v c hain with or- der K then the empirical estimates ˆ P k ( x | X − λ ( k ) ) are av erages of b ounded random v ari- ables 1 { X − τ ( k ,j ) = x } ( j = 1 , 2 , . . . , J k ) that are conditionally indep enden t and identi- cally distributed g iven X − K when k ≥ K . It follows that the estimates ˆ P k ( x | X − λ ( k ) ) con v erge exp onen tially fast in the n umber of samples J k to the conditional proba bilit y P { x | X − K } = P { x | X − } and therefore the estimates are p oin twise consisten t. It is not kno wn whether the estimates ˆ P k ( x | X − λ ( k ) ) con verge in the p oin t wise sense for all ﬁnite- alphab et stationary time series. If w e know t he en tropy rate H in adv ance w e can make use of it. In this case, w eak consistency is g ua ran teed if k ( n ) = (1 − ǫ )(log n ) / R fo r some R > H and J k ( n ) = n ǫ . Indeed, if H < r < R then λ ( k ( n )) < n ev en tua lly for large n sinc e λ ( k ( n )) = k ( n ) + τ k ( n ) J ( k ( n )) ≤ k ( n ) + J ( k ( n )) 2 r k ( n ) = O (log n ) + n ǫ n (1 − ǫ ) r/R = o ( n ) . (49) If the en tr o p y rate is not know n in adv ance t hen w e m ust b e prepared to deal with the w orst case of nearly maxim um en trop y rate. The estimates will b e wasteful if the entrop y rate is low b ecause they exploit only a small p o r t ion of the a v ailable data segmen t X − n when H < log |X | . If k ( n ) = (1 − ǫ ) log |X | n and J k ( n ) = n ǫ then the length of the useful p ortion is ab out τ k ( n ) J ( k ( n )) ≈ J k ( n ) 2 H k ( n ) = n ǫ +(1 − ǫ ) H/ log |X | = n α , (50) where α = ǫ + (1 − ǫ ) H / log |X | v aries linearly b et w een ǫ < α ≤ 1 as 0 < H ≤ log |X | . 13 The length λ ( k ) = k + τ k J ( k ) of the data record X − λ ( k ) that m ust b e examined to collect J k samples of the pattern X − k gro ws approximately like J k 2 H k , whic h is p olynomial in J k if J k gro ws exp onentially fast with k . Also, the length n of t he segmen t X − n is just p olynomial in the sample size J k ( n ) if J k ( n ) = n ǫ . The strongly consisten t estimates of Morv ai, Y a ko witz and Gy¨ orﬁ [21] are mu c h less eﬃcien t: they collect J samples from a data record whose length gro ws like a to wer of exp onentials in (7). Their samples are very sparse b ecause extremely stringen t demands are pla ced o n the con text where those samples are ta k en. F or the w eakly consisten t estimates o f the presen t study , the demands o n contex t are m uc h less sev ere and so the samples are muc h more abundant although p erhaps less trust w o r t h y . Thus univ ersal prediction is not hop elessly out of computational reach as it migh t see m for an algo r ithm whose input demands g r ow as a tow er o f exp onentials in (7). C. W eak Consistency for Real-v alued Pro cesses When X is the real line or a σ -compact P olish space, the estimate ˆ P k ( dx | X − λ ( k ) ) is deﬁned by the fo r mula in (16). W e now ch o ose a nondecreasing un b ounded seq uence k ( n ) and w e deﬁne ˆ P ( dx | X − n ) as the empirical conditional distribution ˆ P k ( n ) ( dx | X − λ ( k ( n )) ) if t his estimate can be computed from the a v ailable data segme nt X − n . Otherwise, if λ k ( n ) > n , w e truncate the searc h and deﬁne ˆ P ( dx | X − n ) as some default measure Q ( d x ). Th us ˆ P ( dx | X − n ) = ( ˆ P k ( n ) ( dx | X − λ k ( n ) ) if λ k ( n ) ≤ n , Q ( dx ) otherwise. (51) If k ( n ) grows slo wly then truncation is rare and ˆ P ( dx | X − n ) coincides most of the time with the estimator ˆ P k ( n ) ( dx | X − λ ( k ( n )) ) which is weakly consisten t. The question is how slo wly the partition index k ( n ), the blo c k length ℓ ( k ( n )) and the sample size J ( k ( n )) m ust grow with n to get consisten t estimates of P ( dx | X − ). It suﬃces that P { λ ( k ( n )) < n } → 1 . Theorem 4. Let { X t } b e a real-v alued stationary ergo dic time series and choose B k , ℓ k and J k as b efore. Let Ξ k denote the set of atoms o f t he ﬁnite ﬁeld B k and c ho ose a nondecreasing unbounded sequen ce of in tegers k ( n ) and num b ers ǫ k → 0 such that n ≥ ℓ k ( n ) + J k ( n ) | Ξ k ( n ) | ℓ k ( n ) /ǫ k ( n ) ev en tually for la rge n . (52) Then P { n ≥ λ k ( n ) } → 1 as n → ∞ , and the estimates ˆ P ( dx | X − n ) are w eakly consisten t: for ev ery set B in the generating ﬁeld S k B k w e ha v e ˆ P { X ∈ B | X − n } → P { X ∈ B | X − } in L 1 ( P ) , (53) and for ev ery b o unded con tin uous function h ( x ) we hav e Z h ( x ) ˆ P ( dx | X − n ) → Z h ( x ) P ( dx | X − ) in L 1 ( P ) . (54) 14 Pro of: The in ter- recurrence times τ k j − τ k j − 1 ( j = 1 , 2 , 3 , . . . ) are iden tically distributed conditionally giv en the pa t t ern [ X − ℓ ( k ) ] k . By Kac’s resu lt , E { τ k J ( k ) | [ X − ℓ k ] k } = J k E { τ k 1 | [ X − ℓ k ] k } = J k P ([ X − ℓ k ] k ) . (55) It follows tha t E { τ k J ( k ) } = J k E { τ k 1 } = X [ x − ℓ ( k ) ] k P ([ x − ℓ ( k ) ] k ) E { τ k J ( k ) | [ x − ℓ ( k ) ] k } ≤ J k | Ξ k | ℓ ( k ) . (56) (The sum is ta k en ov er [ x − ℓ ( k ) ] k suc h tha t P ([ x − ℓ ( k ) ] k ) = P { [ X − ℓ ( k ) ] k = [ x − ℓ ( k ) ] k } is strictly p ositiv e.) By the Mark o v inequalit y , P { λ k > ℓ k + J k | Ξ k | ℓ ( k ) /ǫ k } = P { τ k J ( k ) > J k | Ξ k | ℓ ( k ) /ǫ k } ≤ E { τ k J ( k ) } J k | Ξ k | ℓ ( k ) /ǫ k ≤ ǫ k . (57) Assertions (53 ) a nd (54) follow from Theorem 1A and 1B b ecaus e P { ℓ k + J k | Ξ k | ℓ k /ǫ k ≥ λ k } → 1 and henc e, in view of assumption (5 2), P { n ≥ ℓ k ( n ) + J k ( n ) | Ξ k ( n ) | ℓ k ( n ) /ǫ k ( n ) ≥ λ k ( n ) } → 1 as n → ∞ . (58) This completes the pro of o f the t heorem. The theorem remains v alid in the stationar y non-ergo dic case. Indeed, let P b e a stationary distribution and let P ω denote the ergo dic mo de of ω . Then one ma y argue as ab ov e that P ω { λ k ≤ ℓ k + J k | Ξ k | ℓ ( k ) /ǫ k } → 1. By the ergo dic decomp osition theorem and Leb esgue’s dominated con v ergence theorem, lim k P { λ k ≤ ℓ k + J k | Ξ k | ℓ ( k ) /ǫ k } = lim k Z P ω { λ k ≤ ℓ k + J k | Ξ k | ℓ ( k ) /ǫ k } P ( dω ) = Z lim k P ω { λ k ≤ ℓ k + J k | Ξ k | ℓ ( k ) /ǫ k } P ( dω ) = Z 1 P ( dω ) = 1 . (59) Th us the conclusions of the theorem a lso hold f o r stationary nonergo dic pro cesses. IV. Th e Information Theoretic P oin t of View In this section we discuss conditional distribution estimates ˆ P ( dx | X − n ) that a re con- sisten t in exp ected information div ergence. Suc h estimates are also weakly consisten t, but the conv erse is not necessarily true. It is p oss ible to construct estimator sequences that are consisten t in exp ected information div ergence for all stationary pro cesses with v alues in a ﬁnite alphabet, but not for all stationary pro cesses with v alues in a coun table inﬁnite alphab et. There are connections with univ ersal gam bling or mo deling sche mes and with 15 univ ersal noiseless dat a compression algorithms for ﬁnite alphab et pro ces ses. F or more information on these sub jects see R issanen and Langdon [28] and Algo et [1]. A. C onsistency in Exp ect ed Information Divergence The Kullbac k-Leibler informat io n div ergence b etw een t wo probability distributions P and Q on a measurable space X is deﬁned as follows : if P is dominated b y Q then I ( P | Q ) = E P ( log dP dQ !) , (6 0) otherwise I ( P | Q ) = ∞ . The v ariational distance is deﬁned as k P − Q k = sup − 1 ≤ h ( x ) ≤ 1     Z h dP − Z h dQ     , (61) where the supremu m is tak en o v er all measurable functions h ( x ) such that | h ( x ) | ≤ 1 . If p = dP /dµ and q = dQ/dµ are the dens it ies of P and Q relativ e to a dominating σ -ﬁnite measure µ then k P − Q k = R | p − q | dµ . Exercise 17 on p. 58 of Cs isz´ ar and K¨ orner [11 ] asserts that log e 2 k P − Q k 2 ≤ I ( P | Q ) . (62) It follows that I ( P | Q ) ≥ 0 with equalit y iﬀ P = Q . Pinsk er [26], pp. 13–15 pro ved the existence of a univ ersal constant Γ > 0 such that I ( P | Q ) ≤ E P (     log dP dQ !     ) ≤ I ( P | Q ) + Γ q I ( P | Q ) . (63) Barron [6] simpliﬁed Pinsk er’s argumen t and prov ed that the constan t Γ = √ 2 is b est p ossible w hen na t ur a l logarithms are used in the deﬁnition of I ( P | Q ). Let { X t } b e a stationary pro cess with v alues in a complete separable metric space X . The div ergence b et w een the true conditional distribution P ( dx | X − ) and an estimate ˆ P ( dx | X − t ) is a nonnegativ e function o f the past X − whic h v anishes iﬀ P ( dx | X − ) = ˆ P ( dx | X − t ) P -a lmost surely . W e sa y tha t the estimates ˆ P ( dx | X − t ) ar e consisten t in in- formation dive rg ence for a class Π of stationary distributions on X Z if for any P ∈ Π, I ( P X | X − | ˆ P X | X − t ) → 0 P -almost surely . (64) W e s a y that ˆ P ( dx | X − t ) is consisten t in exp ected infor ma t io n dive rg ence for the class Π if for any P ∈ Π, E P { I ( P X | X − | ˆ P X | X − t ) } → 0 . (65) Suc h estimates are w eakly consisten t for all distributions in the class Π. Indeed, if h ( x ) is an y bounded measurable function on X with nor m k h k ∞ = sup x | h ( x ) | then | R h ( x ) P ( dx | X − ) − R h ( x ) ˆ P ( dx | X − t ) | ≤ k h k ∞ k P X | X − − ˆ P X | X − t k . (66) 16 Applying the Csisz ´ ar-Kemp erman-K ullbac k inequalit y (62), we see that     Z h ( x ) P ( dx | X − ) − Z h ( x ) ˆ P ( dx | X − t )     2 ≤ 2 k h k 2 ∞ log e I ( P X | X − | ˆ P X | X − t ) . (67) If ˆ P ( dx | X − t ) is consisten t in exp ected information div ergence for Π then R h ( x ) ˆ P ( dx | X − t ) con v erges in L 2 ( P ) and a lso in L 1 ( P ) to R h ( x ) P ( dx | X − ) whenev er P ∈ Π. Supp ose the o utcomes X t are indep endent with iden tical distribution P X on X . Barron, Gy¨ orﬁ and v an der Meulen [7] hav e constructed estimates ˆ P ( dx | X − t ) that are consisten t in information div ergence and in exp ected information div ergence when t he t r ue distribution P X has ﬁnite info r mation div ergence I ( P X | M X ) < ∞ relativ e to some kno wn normalized reference measure M X . Gy¨ orﬁ, P´ ali and v an der Meulen [16] assume that X is the countable set o f in tegers a nd arg ue that for arbitrary conditional probability mass function estimates ˆ P ( x | X − n ), there exis ts some distribution P X with ﬁnite en trop y suc h that I ( P X | ˆ P X | X − n ) = ∞ almost surely for all n . (68 ) Therefore, it is impo ssible to construct estimates ˆ P ( dx | X − t ) tha t are consisten t in infor- mation div ergence or in exp ected information div ergence for all independen t iden tically distributed pro cesses with v alues in a n inﬁnite space. F or stationary pr o cess es with v alues in a ﬁnite alphab et, the constructions of Ornstein [22] and Morv ai, Y ako witz and G y¨ orﬁ [21] yield estimates ˆ P ( x | X − t ) suc h that log ˆ P ( x | X − t ) conv erges a lmost sure ly to log P ( x | X − ). It is still an op en question a s to whether these estimates are consisten t in informa t io n div ergence or whether mo diﬁcations are needed to get suc h consistency . (The diﬃcult y is that small c hanges in ˆ P ( x | X − n ) cause h uge c hanges in log ˆ P ( x | X − n ) when ˆ P ( x | X − n ) is small.) How ev er, it is easy t o construct estimates ˆ P ( x | X − t ) that are consisten t in exp ected information div ergence. B. C onsistent E st imates for Finite-alphabet Pr o cesses Let { X t } b e a stationary pro cess with v alues in a ﬁnite set X . W e shall construct conditional probabilit y mass function estimates ˆ P ( x | X − n ) that are consis tent in exp ected information divergenc e for an y stationary P ∈ P s . Such estimates also con ve rg e to P ( x | X − ) in mean: for an y stationary P ∈ P s and x ∈ X w e ha v e ˆ P ( x | X − n ) → P ( x | X − ) in L 1 ( P ). (69) An observ ation of P erez [25] implies that consistency in expected infor ma t io n dive rgence is equiv alen t to mean consistency of log ˆ P ( X | X − n ). Theorem 5. Let { X t } b e a stationary pro cess with v alues in a ﬁnite alphab et X . A sequence of conditional probability mass function estimates ˆ P ( x | X − n ) is consisten t in ex- p ected information divergenc e iﬀ we ha ve mean conv ergence log ˆ P ( X | X − n ) → log P ( X | X − ) in L 1 . (70) 17 Pro of: Pinsk er’s inequalit y (63) for P ( x | X − ) and ˆ P ( x | X − n ) asserts that I ( P X | X − | ˆ P X | X − n ) ≤ E (     log P ( X | X − ) ˆ P ( X | X − n ) !         X − ) ≤ I ( P X | X − | ˆ P X | X − n ) + Γ q I ( P X | X − | ˆ P X | X − n ) . (71) T a king exp ectations and using conca vit y of the square ro ot function, w e obtain E { I ( P X | X − | ˆ P X | X − n ) } ≤ E     log P ( X | X − ) ˆ P ( X | X − n ) !     ≤ E { I ( P X | X − | ˆ P X | X − n ) } + Γ q E { I ( P X | X − | ˆ P X | X − n ) } (72) b y Jen sen’s inequalit y . This suﬃces to pro ve the theorem. T o construct the es t ima t es ˆ P ( x | X − n ), w e start with probabilit y mass functions Q ( x n ) on the pro duct spaces X n suc h that f o r ev ery stationary distribution P o n X Z , n − 1 I ( P X n | Q X n ) → 0 as n → ∞ . (73) Sev eral metho ds are known for constructing such mo dels Q ( x n ) – see Section C b elow. By Pinsk er’s inequalit y , conv ergence of the means in (73) is equiv alent to mean con v ergence 1 n log P ( X n ) Q ( X n ) ! → 0 in L 1 ( P ). (74) Let now Q ( x | x − t ) denote a shifted cop y of the conditional probabilit y mass function Q ( x t | x t ) that app ears in the c hain rule expans io n Q ( x n ) = Q 0 ≤ t

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment