Nonparametric inference for ergodic, stationary time series

The setting is a stationary, ergodic time series. The challenge is to construct a sequence of functions, each based on only finite segments of the past, which together provide a strongly consistent estimator for the conditional probability of the nex…

Authors: G. Morvai, S. Yakowitz, L. Gyorfi

G. Morv ai, S. Y ak o witz, and L. Gy¨ orfi: Nonparametric inference for ergo dic, sta tionary time series. Ann. Stati st. 24 (1 996), no. 1, 370 –379. Abstract The setting is a statio nary , ergo dic time series. Th e c hallenge is to construct a sequence of functions, eac h based on only finite segmen ts of the past, whic h together pro vide a strongly consisten t estimato r f or the conditional p robabilit y of the next observ ation, giv en the infi nite past. Ornstein ga v e such a construction for the case that th e v alues are from a finite set, and recen tly Algoet e xtended the sc heme to time series with coord in at es in a P olish sp ac e. The pr esen t stu dy relate s a differen t solution to the c hallenge. The algorithm is simple and its verificati on is f airly transparent. Some extensions to r eg ression, pattern recognition, and on-line forecasting are m entioned. 1 In tro duction In this section, w e give brief o v erview o f the situation with resp ect to non- parametric inference under the most lenien t mixing conditions. Imp etus f or this line of study follows Roussas (1969) a nd Rosen blatt (1970) who ex- tended ideas in the nonparametric regression literature for i.i.d. v ariables to g iv e a theory adequate fo r sho wing, for example, that fo r { X i } a real Mark o v sequence, under Do eblin-like assumptions , the ob vious k ernel for e- caster is an asymptotically normal estimator of the conditional exp ectation E ( X 0 | X − 1 = x ). In the 1980’s, there was an explosion of w orks whic h sho w ed consistency in v arious senses for nonparametric auto-regression and densit y estimators under more and more general mixing assumptions (e.g., Castellana and Leadb etter (1986), Collom b (1985), Gy¨ orfi (1981) , and Masry (198 6)). The monograph by Gy¨ orfi et al. (1989 ) give s supplemen tal information ab out nonparametric estimation fo r dep enden t series. Suc h striving for generalit y stems from the inconv enience of mixing con- ditions; satisfactory statistical tests are not a v ailable. Some recen t dev el- opmen ts ha ve succeeded in disp osing of these conditions alt ogether. In the Mark o v case, aside from some smo othness assumptions, it is enough that an in v arian t la w exist to get the usual p oint wise asymptotic normalit y of k ernel regression (Y ako witz (1989)). In case of Harris recurrence but no in v arian t la w, one can still attain a.s. p oin t wise con v ergence of a nearest- neigh bo r regression algorithm in whic h the neigh borho o d is c hosen in adv ance and observ ations contin ue un til a prescribed n um b er of p oin ts fall in to that neigh bo r ho o d ( Y ak o witz (1993 )). Pushing b ey ond the Mark o v h yp othesis , by a histogram estimate (Gy¨ orfi et al. (1989) ) or a recursiv e-type estimator (Gy¨ orfi and Masry (1990)), one can infer the marginal densit y of an ergo dic stationary t ime serie s provided only that t here exist an absolutely con tin uous transition density . Here the limit ma y ha ve been attained; it is no w kno wn (G y¨ orfi et al. (1989) and Gy¨ o rfi and Lugosi (1992 ), resp ec tiv ely) that without the conditional densit y 1 assumption, the histogram estimator and the k ernel and recursiv e k ernel estimates fo r the marginal densit y are not generally consisten t. The situation with resp ect to (auto - ) r egression is more inclusiv e for er- go dic, stationary sequences . In a landmark pap er, follow ing dev elopmen ts b y Ornstein (1978) f o r the case tha t the time series v alues are from a finite set, for time series with v alues in a P olish space, Algo et (19 92, § 5) has provide d a data-driv en distribution function construction F n ( x | X − 1 , X − 2 , . . . ) whic h a.s. con v erges in distribution to P ( X 0 ≤ x | X − 1 , X − 2 , . . . ) = P ( X 0 ≤ x | X − ) , where X − = ( X − 1 , X − 2 , . . . ). The go a l of the presen t study is to relate a simpler rule the consistency of whic h is easy to establish. In concluding sections, it is noted that a s a result o f these dev elopments , one ha s a consisten t r egr ession estimate in the b ounded time-series case, and implications to problems of pattern recognition and on- line forecasting are men tioned. It is to b e conceded that our algor ithm, as w ell as those of Algo et’s and Ornstein’s, can b e exp ected to require v ery la rge data segmen ts for a cceptable precision. As a final general commen t, w e note that the assumption of ergo dicit y ma y b e relaxed somewhat. Th us in view of Sections 7.4 and 8.5 of Gra y (1988), one sees that a nonergo dic stationary pro cess has an ergo dic decom- p osition. With pr o babilit y one, a realization of t he time series falls in to an in v arian t ev en t on whic h the pro cess is ergo dic and stationary . The n one ma y apply the dev elopmen ts of this study to that eve n t as though it we re the pro cess univ erse. Th us the analysis here also remains v alid for stationary nonergo dic pro cesses. Our analysis is restricted to the case that the co or- dinates of the time series are real, but it is eviden t that the pro ofs extend directly to the v ector-v alued case. In view of Theorem 2.2 of Billingsley (1968, p. 14) it will b e clear tha t the formu las and deriv ations to follow also hold if the X ′ i s a re in a Polish space. 2 2 Estimation of condit ional d istributions Let X = { X n } denote a real-v alued doubly infinite stationary ergo dic time series. Let X − 1 − j = ( X − j , X − j +1 , . . . , X − 1 ) b e not a tion for a data segmen t in to the j-past, where j may b e infinite. F or a Borel set C o ne wishes to infer the conditional probabilit y P ( C | X − ) = P ( X 0 ∈ C | X − 1 −∞ ) . The a lgorithm to b e promoted here is iterativ e on an index k = 1 , 2 , . . . F or eac h k , the data -driv en estimate of P ( C | X − ) requires only a segmen t of finite (but random) length of X − . One may pro ceed by simply rep eating t he estimation pro cess for k =1,2,. . . , un til a given finite data record no lo ng er suffices fo r the demands of the algorithm. The go al of the study will b e to sho w that a.s. con v ergence can b e attained. That is, our estimation is strongly consisten t in the top ology of w eak conv ergence. The estimation algor it hm is now rev ealed in the simple con text of binary sequence s, and afterw ards, we sho w alterations nec essary f or more general pro cesse s. Define the sequences λ k − 1 and τ k recursiv ely ( k = 1 , 2 , . . . ). Put λ 0 = 1 and let τ k b e the time b et w een the o ccurrence of the pattern B ( k ) = ( X − λ k − 1 , . . . , X − 1 ) = X − 1 − λ k − 1 at time − 1 and the last o ccurrence of the same pattern prior to time − 1. More precisely , let τ k = min { t > 0 : X − 1 − t − λ k − 1 − t = X − 1 − λ k − 1 } . Put λ k = τ k + λ k − 1 . 3 The observ ed vec tor B ( k ) a.s. tak es a v alue ha ving p ositiv e probability; th us b y ergo dicit y , with pro babilit y 1 t he string B ( k ) m ust app ear infinitely often in the sequence X − 2 −∞ . One denotes the k th estimate of P ( C | X − ) b y P k ( C ) , and defines it to b e P k ( C ) = 1 k X 1 ≤ j ≤ k 1 C ( X − τ j ) . (1) Here 1 C is the indicator function for C . F or the general case, w e use a sub-sigma-field structure motiv ated b y Algo et (1992, Section 5.2) , whic h is more general. Let P k = { A k ,i , i = 1 , 2 , . . . , m k } b e a sequence of finite par titions of the real line by (finite o r infinite) right semi-closed in terv als suc h that σ ( P k ) is an increasing sequence of finite σ -algebras that asymptotically g ene rate the Borel σ -field. Let G k denote the corresp onding quantizer: G k ( x ) = A k ,i if x ∈ A k ,i . The ro le o f the feature vec tor in (1) is no w pla y ed b y the discrete quan tit y , B ( k ) = ( G k ( X − λ k − 1 ) , . . . , G k ( X − 1 )) = G k ( X − 1 − λ k − 1 ) . No w τ k = min { t > 0 : G k ( X − 1 − t − λ k − 1 − t ) = G k ( X − 1 − λ k − 1 ) } . Again, ergo dicity implies that B ( k ) is almost surely to b e found in the se- quence G k ( X − 2 −∞ ), and with this generalization of no tation, the k th estimate of P ( C | X − ) is still provide d b y formula (1). As in Algo et’s construct, the estimate P k is calculated f rom observ ations of r a ndom size. Here the random sample size is λ k . T o obtain a fixed sample size t > 0 v ersion, let κ t b e the maxim um of integers k for whic h λ k ≤ t . Put ˆ P − t ( C ) = P κ t ( C ) . (2) 4 Theorem 1 Under the stationary er go dic assump tion r e gar ding { X n } and under the e stimator c onstructs (1) and (2) de scrib e d ab ove, lim k →∞ P k ( · ) = P ( · | X − ) a . s ., (3) and lim t →∞ ˆ P − t ( · ) = P ( ·| X − ) a . s ., (4) in the we ak top olo g y of distributions. Pro of. T o begin with, assume that for some m , C ∈ σ ( P m ). The first c hore is to show that a.s., P k ( C ) → P ( C | X − ) . F or k > m w e ha v e that P k ( C ) − P ( C | X − ) = 1 k X 1 ≤ j ≤ m [1 C ( X − τ j ) − P ( X − τ j ∈ C | G j − 1 ( X − 1 − λ j − 1 ))] + ( k − m ) k 1 ( k − m ) X m D x, if − D ≤ x ≤ D − D , if x < − D 7 Then R k = Z xP k ( dx ) = Z φ ( x ) P k ( dx ) → Z φ ( x ) P ( dx | X − ) = Z xP ( dx | X − ) = E ( X 0 | X − ) . b ecause of Theorem 1 and the fact that conv ergence in distribution implies the con v ergence of in tegrals of the bo unded contin uous function φ with re- sp ect to the actual distributions (Billingsley (1968)). Thus the pro of of (8) is complete. The pro of of (9) fo llo ws in the same w ay; just put ˆ P − t in place of P k . The estimates ˆ R − t con v erge almost surely to E ( X 0 | X − ) and are uni- formly b ounded so | ˆ R − t − E ( X 0 | X − 1 − t ) | → 0 also in mean. Motiv ated b y Bailey (1976 ), consider the estimator ˆ R t ( ω ) = ˆ R − t ( T t ω ) whic h is defined in terms of ( X 0 , . . . , X t − 1 ) in the same w ay as ˆ R − t ( ω ) was defined in terms of ( X − t , . . . , X − 1 ). ( T denotes the left shift op erator. ) The estimator ˆ R t ma y b e view ed as an on-line predictor of X t . This predictor has sp e- cial significance not o nly because of p o ten tial applications, but additionally b ecause Bailey (1976) prov ed that it is imp ossible to construct estimators ˆ R t suc h that alw a ys ˆ R t − E ( X t | X t − 1 0 ) → 0 almost surely . An immediate consequenc e of Corollar y 1 is that con v ergence in probabilit y is v erified. That is, the shift transformatio n T is measure preserving hence con v ergence ˆ R − t − E ( X 0 | X − 1 − t ) → 0 in L 1 implies con vergenc e ˆ R t − E ( X t | X t − 1 0 ) → 0 in L 1 and in probability . 4 P atte rn reco gnition Consider the 2 -class pattern recognition problem with d -dimensional feature v ector X 0 and binary v alued lab el Y 0 . Let D − = ( X − 1 −∞ , Y − 1 −∞ ) b e the data. In conv en tional pattern recognition problems ( X 0 , Y 0 ) and D − are indep en- den t, so the b es t p ossible decis ion based on X 0 and based on ( X 0 , D − ) are the same. Here assum e tha t { ( X i , Y i ) } is a doubly infinite stationary and 8 ergo dic sequence. The classification problem is to decide on Y 0 for give n data ( X 0 , D − ) in order to minimize the pr o babilit y of misclassifi cation. The Ba y es decision g ∗ is the b est p ossible one. L et η ( X 0 , D − ) b e the a p o steriori probabilit y of Y 0 = 1 (regression function): η ( X 0 , D − ) = P ( Y 0 = 1 | X 0 , D − ) = E ( Y 0 | X 0 , D − ) . Then g ∗ ( X 0 , D − ) = 1 if η ( X 0 , D − ) ≥ 1 / 2 and 0 ot herwise. F or an arbitrary appro ximation η k = η k ( X 0 , D − ) put g k = g k ( X 0 , D − ) = 1 if η k ≥ 1 / 2 and 0 otherwise. Then it is easy to see (cf. Devroy e and Gy¨ o rfi (1985), Chapter 10) that 0 ≤ P ( g k 6 = Y 0 | X 0 , D − ) − P ( g ∗ ( X 0 , D − ) 6 = Y 0 | X 0 , D − ) ≤ 2 | η k − η ( X 0 , D − ) | . (10) The estimation is a sligh t mo dification of (1). Define the sequence s λ k − 1 and τ k recursiv ely ( k = 1 , 2 , . . . ). Put λ 0 = 1 and τ k b e the time betw een the o ccurrence of the pattern B ( k ) = ( G k ( X − λ k − 1 ) , Y − λ k − 1 , . . . , G k ( X − 1 ) , Y − 1 , G k ( X 0 )) at time 0 a nd the last o ccurrence of the same pa t t ern in D − . More precisely , τ k = min { t > 0 : G k ( X − t − λ k − 1 − t ) = G k ( X 0 − λ k − 1 ) , Y − 1 − t − λ k − 1 − t = Y − 1 − λ k − 1 ) } . Put λ k = τ k + λ k − 1 . The observ ed v ector B ( k ) a.s. tak es a v alue of p ositiv e probabilit y; thus b y ergo dicity B ( k ) has o ccurred with probability 1. One denotes the k th estimate of η ( X 0 , D − ) b y η k , a nd defines it to b e η k = 1 k X 1 ≤ j ≤ k Y − τ j . (11) 9 Corollary 2 Under the stationary er go dic assumption r e gar ding the pr o c ess { ( X n , Y n ) } and under the estimator c onstruct (1 1) de scrib e d ab ove, P ( g k 6 = Y 0 | X 0 , D − ) → P ( g ∗ ( X 0 , D − ) 6 = Y 0 | X 0 , D − ) a . s . (12) Pro of. Because of (10), we get (12) f rom η k → η ( X 0 , D − ) a . s ., the pro of of whic h is similar to the pro of of Theorem 1. Remark. It is also p ossible to construct a ve rsion of t his estimate with fixed sample size t > 0 in the same w a y as in ( 2 ) a nd (7). 5 App end ix In the sequel, we use the nota tion of Section 2. Lemma 1 Under the stationary er g o dic assumption r e gar ding { X n } , for j = 1 , 2 , . . . , P ( X − τ j ∈ C | G j − 1 ( X − 1 − λ j − 1 )) = P ( X 0 ∈ C | G j − 1 ( X − 1 − λ j − 1 )) . Pro of. First of all, note that b y definition, σ ( G j − 1 ( X − 1 − λ j − 1 )) = F j − 1 = σ ( { G j − 1 ( X − 1 − m ) = b − 1 − m , λ j − 1 = m } ; b − 1 − m , m = 1 , 2 , . . . ) , where b − 1 − m is an m -v ector of sets fr o m the finite partition P j − 1 . Note a lso that B = { G j − 1 ( X − 1 − m ) = b − 1 − m , λ j − 1 = m } are the (coun table man y) generating atoms o f F j − 1 , so w e ha v e to show t ha t for any atom B the following equalit y holds: P ( B ∩ { X − τ j ∈ C } ) = P ( B ∩ { X 0 ∈ C } ) . 10 λ j − 1 is a stopping time, B is an m -dimensional cylinder set, whic h means tha t b − 1 − m determines whether λ j − 1 6 = m (in whic h case B = ∅ and the stateme n t is trivial) or λ j − 1 = m and then B = { G j − 1 ( X − 1 − m ) = b − 1 − m } . F or j = 1 , 2 , . . . let ˜ τ j = min { 0 < t : G j ( X − 1+ t − λ j − 1 + t ) = G j ( X − 1 − λ j − 1 ) } . No w T − l [ B ∩ { τ j = l , X − l ∈ C } ] = T − l [ { G j − 1 ( X − 1 − m ) = b − 1 − m , G j ( X − 1 − l − m − l ) = G j ( X − 1 − m ) , G j ( X − 1 − t − m − t ) 6 = G j ( X − 1 − m ) , 0 < t < l, X − l ∈ C } ] = { G j − 1 ( X − 1+ l − m + l ) = b − 1 − m , G j ( X − 1 − m ) = G j ( X − 1+ l − m + l ) , G j ( X − 1 − t + l − m − t + l ) 6 = G j ( X − 1+ l − m + l ) , 0 < t < l, X 0 ∈ C } = { G j − 1 ( X − 1+ l − m + l ) = b − 1 − m , G j ( X − 1 − m ) = G j ( X − 1+ l − m + l ) , G j ( X − 1+ t − m + t ) 6 = G j ( X − 1+ l − m + l ) , 0 < t < l, X 0 ∈ C } = { G j − 1 ( X − 1 − m ) = b − 1 − m , G j ( X − 1 − m ) = G j ( X − 1+ l − m + l ) , G j ( X − 1+ t − m + t ) 6 = G j ( X − 1 − m ) , 0 < t < l, X 0 ∈ C } = B ∩ { ˜ τ j = l , X 0 ∈ C } , where T denotes the left shift op erator. By stationa r ity , it fo llows that P ( B ∩ { X − τ j ∈ C } ) = ∞ X l =1 P ( B ∩ { τ j = l , X − l ∈ C } ) = ∞ X l =1 P ( T − l [ B ∩ { τ j = l , X − l ∈ C } ]) = ∞ X l =1 P ( B ∩ { ˜ τ j = l , X 0 ∈ C } ) = P ( B ∩ { X 0 ∈ C } ) , 11 and the pro of of Lemma 1 is complete. Ac kno wledgem en ts The authors thank P . Algo et f or his commen ts, suggestions, and encour- agemen t. Suggestions b y the r eferees hav e b een helpful. The second author ’s w ork has b ee n supp o rted, in part, by NIH gr a n t No. R01 A12 9426. References [1] Ash, R . (1 9 72) R e a l Analysis and Pr o b ability . Academic Press . [2] Algo et, P . (1992 ). Univ ers al sc hemes for prediction, gambling and p ort- folio selection. Annals of Pr ob a bility , 20 , pp. 901- 941. [3] Azuma, K. (1967 ). W eigh ted sums of certain dep enden t random v a r i- ables. T ohoku Mathema tic al Journal , 37 , pp. 357 - 367. [4] Bailey D. (1976). Sequen t ia l Sc hemes for Classifying and Predicting Er- go dic Pro cesses. Ph.D. thesis, Stanfor d University. [5] Billingsley , P . (1968). Conver genc e of Pr ob ab ility Me asur es . Wiley , New Y ork. [6] Castellana, J. V., a nd Leadbetter, M. R. (1986). On smo othed proba- bilit y densit y estimation for stationar y pro cesses. Sto chastic Pr o c esses and their Applic ation , 21 , pp. 179- 193. [7] Collom b, G. (1985 ). Nonparametric t ime series ana lys is and prediction: uniform almost sure con v ergence. Statistics , 2, pp. 197-3 0 7. [8] Co v er T. (1975). Open Problems in Information Theory . 1975 IEEE- USSR Joint Works hop on Information The ory pp. 3 5-36. [9] Devro y e, L. and G y¨ orfi, L. (1985). Nonp ar ametric Density Estimation: The L 1 -View . Wiley , New Y ork. 12 [10] Gra y , R. (1 988) Pr ob ability, R andom Pr o c esses, a nd Er go dic Pr op erties . Springer-V e rlag, New Y ork. [11] Gy¨ o rfi, L. (198 1 ). Strong consisten t densit y estimation from ergo dic sam- ple. J. Multivariate Analysis , 11 , pp. 81-8 4. [12] Gy¨ o rfi, L., Ha erdle , W., Sarda, P ., and Vieu, Ph. ( 1 989) Nonp ar ametric Curve Estimation fr om Time Series, Springer V erlag, Berlin. [13] Gy¨ o rfi, L. and Lugosi, G. (1992). Kernel densit y estimation f rom ergo dic sample is not univ ers ally consisten t. Comp ut ational Statistics and D ata A nalysis , 14 , pp. 437-442. [14] Gy¨ o rfi, L. and Masry , E. (199 0). The L 1 and L 2 strong consistency of recursiv e k ernel densit y estimation from t ime series. I EEE T r ans. on Information The ory , 36 , pp. 531- 539. [15] Masry , E. (1986) . Recurs iv e probabilit y densit y estimation for we akly dep ende n t stationary pro cess es. IEEE T r ans . Information The o ry , I T- 32 , pp 249- 2 54. [16] Ornstein, D. (197 8). Guessing t he next output of a stationary pr o cess. Isr ael J. of Math. , 30 , pp. 2 92-296. [17] Rosen blatt, M. (1 970). Densit y estimates and Mark o v sequences. In Nonp ar a metric T e chniques in Statistic al Infer enc e , M. Puri, Ed. Lon- don: Cam bridge Univ ersit y , pp. 199-2 10. [18] Roussas, G. (1969 ). Non-para metric estimation of the transition dis- tribution of a Mark ov pro cesses. A nnals of Inst. Statist. Math. bf 21, pp.73-87. [19] Stout, W. F. (1974). A lmost Sur e Conver genc e, Academic Press, New Y ork. 13 [20] Y ako witz S. (1989 ). Nonparametric densit y a nd r egr ession estimation for Mark o v sequences without mixing assumptions. J. Multivariate A naly- sis , 30 , pp. 124-136. [21] Y ako witz S. (1993 ) . Nearest neighbor regression estimation for n ull- recurren t Mark ov time series. Sto chastic Pr o c esses a nd their Appli c a- tions , 37 , pp. 3 11-318. 14

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment