Strongly consistent nonparametric forecasting and regression for stationary ergodic sequences
Let $\{(X_i,Y_i)\}$ be a stationary ergodic time series with $(X,Y)$ values in the product space $\R^d\bigotimes \R .$ This study offers what is believed to be the first strongly consistent (with respect to pointwise, least-squares, and uniform dista…
Authors: S. Yakowitz, L. Gyorfi, J. Kieffer
Strongly-Consisten t Nonparametric F orecasting and R egres- sion for Stationary Ergo dic Sequences Sidney Y ako witz, L´ aszl´ o Gy¨ o r fi, John Kieffer and Guszt´ a v Morv ai J. Multiv ariate Anal. 71 (1999), no. 1, 24 – 41. Abstract Let { ( X i , Y i ) } b e a stationary ergo dic time series with ( X , Y ) v alues in the pro duct space R d N R. This study offers what is b e liev ed to b e the first strongly consisten t (with resp e ct to p oin t wise, least-squares, and uniform distance) algorithm for inferring m ( x ) = E [ Y 0 | X 0 = x ] under the presumption that m ( x ) is unifo r mly Lipsc hit z con tinuous . Auto- regression, or forecasting, is a n impo r t a n t sp ecial case, and as suc h our work exten ds the literature of nonparametric, nonlinear forecasting b y circumv en ting customary mixing assumptions. The w ork is motiv ated b y a time series mo del in sto c hastic finance and b y p erspectiv es of its con tributio n to the issues o f univ ersal t ime series estimation. i 1 In tro duction Nonparametric r egr ession has b een applied to a v ariet y of con texts, in particular to time series mo deling and prediction. The presen t study con tributes to the metho dology by sho wing how a regression function can b e consisten tly inferred from time series data under no pro cess a ssumptions b ey ond stationarity and ergo dicit y . (A Lipsc hitz condition on the regression function itself will b e imp ose d.) T ow ard sho wing how our metho dology can impinge on an established researc h ar ea, w e g iv e one substan tive application to a practical pro ble m in sto c hastic finance: Man y w orks, such as t he Chapter entitled “Some Recen t Dev elopmen ts in Inv estmen t Researc h” of the prominen t text [5], arg ue fo r the need to mov e b ey ond the Blac k-Sc holes sto chastic differen tial equation. This and ot her studies suggest the so-called ARCH and GARC H extensions as a promising direction. The review of this approac h by Bollerslev et a l. [6] cites a litan y of unresolv ed issues. Of particular relev ance is the discussion of the need to accoun t for p ersisten cy of the v ariance (Sections 2.6 and 3 .6). (AR CH and GARCH mo dels can b e long-range dep end en t for certain ranges o f parameters. In these cases, statistical analysis is delicate [8].) The basic idea b ehind the ARCH/GAR CH se tup is t hat one m ust allo w the asset v olatility (v ariance) to c hange dynamically , and p erhaps (GARCH) to dep end on curren t and past v ola tilit y v alues. The review [6] do cumen ts (p. 30) t ha t sev eral authors hav e applied nonparametric and semiparametric regr ession, with some success, t o infer the AR CH functions from data. These metho ds can fa il if fair ly stringen t mixing conditions are not in force. Masry and Tjostheim [21], b ecause of their rigorous consideration of consistency , sets the stag e for appreciating the p oten tial of the presen t inv estigatio n. They prop ose that b oth the asset dynamics and v olatility of a nonlinear AR CH series 1 b e inferred from nonparametric classes of regression functions. By imp osing some fairly sev ere assumptions, whic h w ould b e tricky to v alida te fr om data, these authors a re a ble to assure that the ARCH pro cess is strongly mixing (with exp onen tially decreasing pa- rameter) and consequen tly standard k ernel tec hniques are applicable. On another a ve nue to w ard asset series mo de lling, decades a go, Mandelbrot suggested that fractal pro cess es should b e conside red in this context. F ractals hav e b ee n of in terest to theorists and mo dellers alik e in part b ecause they can displa y p ers istency . In his 1999 study , “A Multifractal W alk do wn W a ll Street,” [20] Mandelbrot argues that conv en tional mo dels for p ortfolio theory ignore soaring volatilit y , and that is akin to a mar iner ig noring the p ossibilit y of a ty pho on on the basis of the observ ation that w eather is mo derate 95% of the time. Suc h p ersistence as exhibited in t he mo dels of finance calls in to question whether v ar io us pro cesses of in terest are actually strongly mixing, a consistency requiremen t for conv entional nonpara metric regression tec hniques. W e mention parenthe tically t ha t telecomm unications mo delers are increasingly turning tow ard long-range-dep enden t pro- cesses (e.g., [28] and [37]) As mentioned, the primary con tribution of t he presen t pap er is an algorithm whic h is demonstratably consisten t without imp osition of mixing assumptions. The implication is that pro cess assumptions suc h as in [21] ar e not required for our algorithm. The price paid for this flexibilit y is that con vergenc e rates and asymptotic normality cannot b e assured. This av en ue is worth y of exploration, neve rtheless, b ecause the limits of pro cess inference are clarified, and as a practical matter, future work might lead to metho ds whic h are reasonably efficien t if the pro cess does satisfy mixing assumptions, but simultaneous ly assures con vergence when mixing fails. The algo rithm is of the series-expansion t yp e. The fo undat io nal idea (after Kieffer 2 [17]) is tha t sometimes it is p ossible to b ound the error of igno ring the series tail, a nd additionally assure that the leading co efficien ts are consisten t ly estimated. Sp ecific con- structs are given for a partition- t yp e estimator (Section 2) and for a k ernel series (Section 3). W e close this in tro duction with a surv ey of the literature o f nonparametric estimation for stationary series without mixing h yp otheses. Let Y b e a real-v alued random v ariable and let X b e a d -dimensional random v ector ( i.e., the observ ation or co-v ariate). W e do not assume a nything ab out the distribution of X . As is customary in regression and fo rec asting, the main aim of the analysis here is to minimize the mean-squared error : min f E (( f ( X ) − Y ) 2 ) o ver some space of real- v alued functions f ( · ) defined on the range of X . This minim um is ac hieve d b y the regression function m ( x ), whic h is defined to b e the conditional distri- bution of Y giv en X : m ( x ) = E ( Y | X = x ) , (1) assuming the exp ectation is w ell-defined, i.e., if E | Y | < ∞ . F or each measurable function f one has E (( f ( X ) − Y ) 2 ) = E (( m ( X ) − Y ) 2 ) + E (( m ( X ) − f ( X )) 2 ) = E (( m ( X ) − Y ) 2 ) + Z ( m ( x ) − f ( x )) 2 µ ( dx ) , where µ stands for the distribution of the observ atio n X . The second term on the righ t hand side is called excess error or integrated squared error for the function f , whic h is giv en t he notation J ( f ) = Z ( m ( x ) − f ( x )) 2 µ ( dx ) . (2) 3 Clearly , the mean squared erro r fo r f is close to that of the optimal regression function only if the excess error J ( f ) is close to 0. With r esp ect to the statistical problem o f regression estimation, let ( X 1 , Y 1 ) , . . . , ( X n , Y n ) . . . b e a stationary ergo dic time series with marginal comp onen t denoted as ( X , Y ). W e study p oin t wise, L 2 ( µ ) , a nd L ∞ con v ergence of the regression estimate m n to m . The estim ator m n is called we ak ly universal ly c o nsistent if J ( m n ) → 0 in probabilit y fo r all distributions of ( X , Y ) with E | Y | 2 < ∞ . In the con text of indep enden t iden tically-distributed (i.i.d.) pairs ( X , Y ) , Stone [35] first p ointe d out in 1977 that there exist w eakly univ ersally con- sisten t estimators. Similarly , m n is called str ongly universal ly c onsistent if J ( m n ) → 0 a.s. for all distributions of ( X , Y ) with E | Y | 2 < ∞ . F ollo wing pioneering pap ers by Roussas [31] and Rosen blatt [30], a large b o dy of literature has accum ulated on consistency and asymptotic no rmalit y when the samples are correlated. In dev elopmen ts b elo w, w e will employ the notatio n, x n m = ( x m , . . . , x n ) , presuming that m ≤ n. The theory of nonparametric regression is of significance in time series analysis b e- cause, by considering samples { ( X n n − q , X n +1 ) } in place of the pairs { ( X n n − q , Y n ) } , the re- gression problem is transformed into the for e c a st ing (or auto-r e gr essio n ) problem. Th us, in f orecas ting, we are asking f or the conditional expectation o f the next observ ation, giv en the q − past, with q a p ositiv e integer, or p erhaps infinit y . As men tio ne d, nearly all the w orks on consisten t statistical metho ds for forecasting h yp othesize mixing conditions, whic h a re a ssumptions ab out ho w quick ly dep endenc y atten uates a s a function of time separation of t he observ ables. Under a v ariety of mixing assumptions, k ernel and partitioning estimators a re consisten t, and hav e attractiv e rate 4 prop erties. The mono graph by Gy¨ orfi et. al. [14] giv es a cov erage of the literature of nonparametric inference for dep enden t series. In t ha t work, the partition estimator is sho wn to b e strongly consisten t, pro vided | Y | is a.s. b ounded, under φ − mixing and, with some pro visos, under α − mixing. A draw bac k to m uc h of the literature on nonparametric forecasting is that mixing conditions are unv erifiable by av ailable statistical pro cedures . Consequen tly , some in v estigators ha v e examined the problem: Let { X i } b e a r e al v ector-v alued stationary ergo dic sequence. F ind a f o rec ast- ing algorithm whic h is prov ably consisten t in some sense. Of course, some additional h yp otheses regarding smo othness of t he auto- regress io n function and momen t prop erties of the v ariables will b e allo w ed, but additional assump- tions ab out atten uation of dep endency are ruled out. A f or e c asting algorithm for m ( X − 1 − p ) = E [ X 0 | X − 1 − p ] here means a rule giving a sequence { m n } of n umbers suc h t hat for eac h n , m n is a measureable function determined en tirely by the data segmen t X − 1 − n . F or X binary , Ornstein [27] provided a (complicated) strongly-consiste n t estimator of E [ X 0 | X − 1 −∞ ] . Algo et [1] extended this approac h to ac hiev e con v ergence ov er real-v alued time series and in this and [2], connected the univ ersal forecasting problem with funda- men tal issues in p ortfolio and gam bling analysis as well as data compression. Morv ai et al. [22] offered a nother algorithm ac hieving strong consistency in the ab o v e sense. Their 5 algorithm is easy to describ e and analyze, and suc h analysis sho ws, unfo r t una t e ly , that its data requiremen ts make it infeasible [23 ]. On the negative side, Bailey [4] and R y a bk o [32] ha ve pro ven t ha t ev en ov er binary pro cess es, there is no strongly consisten t estimator for the dynamic problem of inferring E [ X n +1 | X n 0 ], n = 0 , 1 , 2 , . . . . W e men tio n that for a real ve ctor-v alued Mark o v series with a stationary transition la w, a strongly-consisten t estimator is a v ailable for inferring m ( x ) = E [ X 0 | X − 1 = x ] under the h yp othesis that the sequence is Harris recurren t [38]. Admittedly this is a dep end ency conditio n, but the marginal (i.e., in v ariant) law need not exist: P ositiv e recurrence is not h yp othesized . It is difficult to imagine a Mark ov condition w eak er than Harris recurrence under whic h statistical inference is assured. It is to b e noted that there are we akly-consisten t estimators for t he moving regression problem E [ X n +1 | X n 0 ], n = 0 , 1 , 2 , . . . . It turns out that univers al co ding a lgorithms (e.g. [39]) o f the information theory literature can b e con v erted to weakly -univ ersally consisten t algorithms when the co ordinate space is finite. Morv ai e t al [25] ha ve giv en a weak ly- consisten t (and p oten tially computationally feasible) regression estimator f or the moving regression problem when X tak es v alues from the set of real n umbers. That w or k offers a synopsis of the literature of w eakly consisten t estimation for stationary and ergo dic time series. All the studies we ha v e cited on consistency without mixing assumptions rely on a lg orithms whic h do not fall into a ny of the traditional classes (partitioning, ke rnel, nearest neigh b or) men tioned in connection with i.i.d. regression. F rom this p oin t on, { ( X i , Y i ) } will represen t a time series with ( X , Y ) v alues in R d N R whic h is statio nary a nd ergo dic, and suc h that E | Y i | < ∞ . In Section 2, w e establish by means of a v aria t io n on the partitioning metho d, that we ha v e a.s. con v ergence p oin t wise, and, in the case of b ounded supp ort, in uniform distance, pr ovided that the regression 6 function m ( x ) = E [ Y 0 | X 0 = x ] satisfies a Lipsc hitz condition and a b ound on the Lipsc hitz constan t is know n in adv ance. If furthermore | Y | is know n to b e b ounded (but p erhaps the b ound itself is not kno wn), then our algorithm con v erges in L 2 ( µ ). Section 3 pro vides analogous results f o r a truncated k ernel-ty p e estimate. In summary , we miss our goa l of p oin t wise strong univ ersal consistency only in that w e mus t restrict attention to regression functions satisfying a uniform Lipsc hitz condition and the user m ust ha ve a b ound to the Lipsc hitz constan t. ¿F rom coun ter-examples in Gy¨ orfi et al . [16] one sees that some restrictions are needed. Recen tly w e ha v e obtained an imp ortan t preprin t by Nob el et al. [26] whic h b ears similarities with the presen t inv estigation. That study gives an a lg orithm for the long- standing problem of densit y estimation of the mar g inal of a stationary sequence . Some- what analoguous to o ur conditions, Nob el at al. require that the densit y function b e of b ounded v ariation. The a lg orithm itself is based on differen t principles fro m the presen t pap er. In the pap er [24] by G. Morv ai, S. Kulk arni, and A. Nob el, the ideas in [26] w ere extended for regression estimation. 2 T runcated partitioning estimation Let ( X i , Y i ) ∞ i =1 b e an ergo dic stat io nary random sequence with E | Y | < ∞ . No w we attac k the pr o ble m of estimating the regression function m ( x ) by combining partitioning estimation with a series expansion. Let P k = { A k ,i i=1,. . . } be a nested cubic part it io n o f R d with v o lum e ( 2 − k − 2 ) d . Define A k ( x ) to b e the pa r tition cell of P k in to whic h x falls. T ak e M k ( x ) := E ( Y | X ∈ A k ( x )) . (3) 7 One can sho w that M k ( x ) → m ( x ) (4) for µ - almost all x ∈ R d . (T o see this, notice that { M k ( X ) , σ ( A k ( X )) k = 1 , 2 , . . . } is a martingale, E | Y | < ∞ implies sup k =1 , 2 ,... E | M k ( X ) | < ∞ and hence the martingale con v ergence theorem can b e applied to ac hiev e the desired result (4), cf. Ash [3] pp. 292.) F or k ≥ 2 let ∆ k ( x ) = M k ( x ) − M k − 1 ( x ) . (5) Our analysis is motiv at e d b y the represen ta t ion, m ( x ) = M 1 ( x ) + ∞ X k =2 ∆ k ( x ) = lim k →∞ M k ( x ) (6) for µ -almost all x ∈ R d . No w let L > 0 b e an arbitrary p ositiv e num b er. F or integer k ≥ 2 define ∆ k ,L ( x ) = sign( M k ( x ) − M k − 1 ( x )) min( | M k ( x ) − M k − 1 ( x ) | , L 2 − k ) . (7) Define m L ( x ) := M 1 ( x ) + ∞ X i =2 ∆ i,L ( x ) . (8) Notice that | ∆ i,L ( x ) | ≤ L 2 − i , and hence m L ( x ) is w ell defined for all x ∈ S , where S stands for the supp ort o f µ defined as S := { x ∈ R d : µ ( A k ( x )) > 0 fo r all k ≥ 1. } (9) By Co ver and Hart [7], µ ( S ) = 1. 8 The crux of the truncated pa rtitioning estimate is inference of the terms M 1 ( x ) and ∆ i,L ( x ) for i = 2 , 3 , . . . in (8). D efine ˆ M k ,n ( x ) := P n j =1 Y j 1 { X j ∈ A k ( x ) } P n j =1 1 { X j ∈ A k ( x ) } . (10) If P n j =1 1 { X j ∈ A k ( x ) } = 0 , then tak e ˆ M k ,n ( x ) = 0 . Now for k ≥ 2, define ˆ ∆ k ,n,L ( x ) = sign( ˆ M k ,n ( x ) − ˆ M k − 1 ,n ( x )) min( | ˆ M k ,n ( x ) − ˆ M k − 1 ,n ( x ) | , L 2 − k ) (11) and for N n a non-decreasing un b ounded sequence of p ositiv e in tegers, define the estimator ˆ m n,L ( x ) = ˆ M 1 ,n ( x ) + N n X k =2 ˆ ∆ k ,n,L ( x ) . (12) Theorem 1 L et { ( X i , Y i ) } b e a stationary er go d ic time se ries with E | Y i | < ∞ . Assume N n → ∞ . Then alm ost sur e ly, for a l l x ∈ S ˆ m n,L ( x ) → m L ( x ) . (13) If the supp ort S o f µ is a b ounde d subset of R d then almost sur ely sup x ∈ S | ˆ m n,L ( x ) − m L ( x ) | → 0 . (14) If either (i) | Y | ≤ D < ∞ almost sur ely ( D ne e d not b e known) or (ii) µ is of b ounde d supp ort then Z ( ˆ m n,L ( x ) − m L ( x )) 2 µ ( dx ) → 0 . (15) Pro of First w e prov e that almost surely , fo r all x ∈ S , and for a ll k ≥ 1 , lim n →∞ | ˆ M k ,n ( x ) − M k ( x ) | = 0 . ( 16) By the ergo dic theorem, as n → ∞ , a.s., P n j =1 1 { X j ∈ A k,i } n → P ( X ∈ A k ,i ) = µ ( A k ,i ) . 9 Similarly , P n j =1 1 { X j ∈ A k,i } Y j n → E ( Y 1 { X ∈ A k,i } ) = Z A k,i m ( z ) µ ( dz ) , whic h is finite since E | Y | is finite. Since there are countably man y A k ,i , almost surely , for all A k ,i ∈ ∪ v P v for whic h µ ( A k ,i ) > 0: P n j =1 1 { X j ∈ A k,i } Y j P n j =1 1 { X j ∈ A k,i } → E ( Y | X ∈ A k ,i ) . Since for eac h x ∈ S , µ ( A k ( x )) > 0 and f o r some index i , A k ( x ) = A k ,i , we hav e prov ed (16). P articularly , almost surely , for all x ∈ S , and for all k ≥ 2, ˆ M 1 ,n ( x ) → M 1 ( x ) (17) and ˆ ∆ k ,n,L ( x ) → ∆ k ,L ( x ) . (18) Let in teger R > 1 b e arbitr a ry . Let n b e so large that N n > R . F or all x ∈ S , | ˆ m n,L ( x ) − m L ( x ) | ≤ | ˆ M 1 ,n ( x ) − M 1 ( x ) | + N n X k =2 | ˆ ∆ k ,n,L ( x ) − ∆ k ,L ( x ) | + ∞ X k = N n +1 | ∆ k ,L ( x ) | ≤ | ˆ M 1 ,n ( x ) − M 1 ( x ) | + R X k =2 | ˆ ∆ k ,n,L ( x ) − ∆ k ,L ( x ) | + ∞ X k = R +1 ( | ˆ ∆ k ,n,L ( x ) | + | ∆ k ,L ( x ) | ) ≤ | ˆ M 1 ,n ( x ) − M 1 ( x ) | + R X k =2 | ˆ ∆ k ,n,L ( x ) − ∆ k ,L ( x ) | + 2 L ∞ X k = R +1 2 − k ≤ | ˆ M 1 ,n ( x ) − M 1 ( x ) | + R X k =2 | ˆ ∆ k ,n,L ( x ) − ∆ k ,L ( x ) | + L 2 − ( R − 1) . (19) By (17) and (1 8), almost surely , for all x ∈ S , | ˆ M 1 ,n ( x ) − M 1 ( x ) | + R X k =2 | ˆ ∆ k ,n,L ( x ) − ∆ k ,L ( x ) | → 0 . (20) 10 By (19), almost surely , for all x ∈ S , lim sup n →∞ | ˆ m n,L ( x ) − m L ( x ) | ≤ L 2 − ( R − 1) . (21) Since R w as arbit r a ry , (13) is pro ved. No w w e pro v e (14). Assume the supp ort S of µ is b ounded. Let A k denote the set of h yp er-cub es from partition P k with nonempty in tersection with S . That is, define A k = { A ∈ P k : A ∩ S 6 = ∅} . (22) Since S is b ounded, A k is a finite set. F or A ∈ P k let a ( A ) b e the cen t er of A . Then almost surely , sup x ∈ S | ˆ M 1 ,n ( x ) − M 1 ( x ) | + R X k =2 | ˆ ∆ k ,n,L ( x ) − ∆ k ,L ( x ) | ! ≤ max A ∈A 1 | ˆ M 1 ,n ( a ( A )) − M 1 ( a ( A )) | + R X k =2 max A ∈A k | ˆ ∆ k ,n,L ( a ( A )) − ∆ k ,L ( a ( A )) | (23) → 0 (24) k eeping in mind that only finitely many terms are inv olv ed in t he maximization op eration. The rest of the pro of go es virtually as b efore. No w w e prov e (1 5). | ˆ m n,L ( x ) − m L ( x ) | 2 ≤ 2 | ˆ M 1 ,n ( x ) − M 1 ( x ) | 2 + | M 1 ( x ) + N n X k =2 ˆ ∆ k ,n,L ( x ) − m L ( x ) | 2 ! . If condition (i) holds, then fo r the fir st term w e hav e dominat ed con ve rgence | ˆ M 1 ,n ( x ) − M 1 ( x ) | 2 ≤ (2 D ) 2 , and for the second one, to o: | M 1 ( x ) + N n X k =2 ˆ ∆ k ,n,L ( x ) − m L ( x ) | ≤ ∞ X k =2 ( | ˆ ∆ k ,n,L ( x ) | + | ∆ k ,L ( x ) | ) ≤ L, 11 and th us (15) fo llo ws b y Leb esgue’s dominated conv ergence theorem, 0 = Z lim n →∞ | ˆ m n,L ( x ) − m L ( x ) | 2 µ ( dx ) = lim n →∞ Z | ˆ m n,L ( x ) − m L ( x ) | 2 µ ( dx ) almost surely . If condition (ii) holds then (15 ) follo ws from (14 ). ✷ Corollary 1 Assume m ( x ) is Lipschitz c ontinuous with Lipschitz c onstant C . With the choic e of L ≥ C √ d , for al l x ∈ S , m L ( x ) = m ( x ) and The or em 1 holds with m L ( x ) r eplac e d by m ( x ) . Pro of Since m ( x ) is Lipsc hitz with constant L/ √ d , for x ∈ S , | M k ( x ) − m ( x ) | ≤ | R A k ( x ) m ( y ) µ ( dy ) µ ( A k ( x )) − m ( x ) | ≤ 1 µ ( A k ( x )) Z A k ( x ) | m ( y ) − m ( x ) | µ ( dy ) ≤ 1 µ ( A k ( x )) Z A k ( x ) ( L/ √ d )(2 − k − 2 √ d ) µ ( dy ) = L 2 − k − 2 and M k ( x ) → m ( x ). F or x ∈ S we get | M k ( x ) − M k − 1 ( x ) | ≤ | M k ( x ) − m ( x ) | + | m ( x ) − M k − 1 ( x ) | ≤ L 2 − k − 2 + L 2 − k − 1 < L 2 − k . Th us m ( x ) = M 1 ( x ) + P ∞ k =2 ∆ k ( x ) and ∆ k ,L ( x ) = ∆ k ( x ) fo r all x ∈ S . Hence for all x ∈ S , m L ( x ) = M 1 ( x ) + ∞ X k =2 ∆ k ,L ( x ) = M 1 ( x ) + ∞ X k =2 ∆ k ( x ) = m ( x ) and Corollary 1 is pro ved. 12 ✷ Remark 1 . If there is no truncation, that is if L = ∞ , then ˆ m n = ˆ M N n ,n . In this case, ˆ m n is the standard partitioning estimate (defined, for example in [14]). It is kno wn that there is an ergo dic pro cess ( X i , Y i ) with L ipschitz con tinuous m ( x ) with constan t C = 1 suc h that a classical partitioning estimate is not ev en weakly consis ten t. (cf. Gy¨ orfi, Morv ai, Y ako witz [16 ]). Remark 2. Our consistency is not unive rsal, how ever, since m is h yp othesized to b e Lipsc hitz con tinuous. Remark 3. N n can b e data dep ende n t, pro vided N n → ∞ a.s. Remark 4. The metho dology here is applicable to linear auto-regr essiv e pro cesses. Let { Z i } b e i.i.d. random v a r ia ble s with E Z = 0 and V ar ( Z ) < ∞ . D efi ne W n +1 = a 1 W n + a 2 W n − 1 + . . . + a K W n − K +1 + Z n +1 (25) where P K i =1 | a i | < 1 . Equation (25) yields a stat io nary ergo dic solution. Assume K ≤ d . Let Y n +1 = W n +1 , and X n +1 = ( W n , . . . , W n − d +1 ). No w m ( X n +1 ) = E ( Y n +1 | X n +1 ) = E ( W n +1 | W n , . . . , W n − d +1 ) = a 1 W n + a 2 W n − 1 + . . . a K W n − K +1 . The regression function m ( x ) is Lipsc hitz con tinuous with constan t C = 1, since for x = ( x 1 , . . . , x d ) and z = ( z 1 , . . . , z d ), | m ( x ) − m ( z ) | ≤ K X i =1 | a i || x i − z i | ≤ max 1 ≤ i ≤ d | x i − z i | ≤ k x − z k . 3 T runcated k ernel estimation Let K ( x ) b e a non-negat ive con tinuous k ernel function with b 1 { x ∈ S 0 ,r } ≤ K ( x ) ≤ 1 { x ∈ S 0 , 1 } , 13 where 0 < b ≤ 1 and 0 < r < 1. ( S z ,r denotes the closed ball around z with radius r .) Cho ose h k = 2 − k − 2 and M ∗ k ( x ) = E ( Y K ( X − x h k )) E ( K ( X − x h k )) = R m ( z ) K ( z − x h k ) µ ( dz ) R K ( z − x h k ) µ ( dz ) . (26) Let ∆ ∗ k ( x ) = M ∗ k ( x ) − M ∗ k − 1 ( x ) . (27) As a motiv ation, w e note that Devro y e [9] yields (4), and therefore (6), to o. No w for k ≥ 2, define ∆ ∗ k ,L ( x ) = sign( M ∗ k ( x ) − M ∗ k − 1 ( x )) min( | M ∗ k ( x ) − M ∗ k − 1 ( x ) | , L 2 − k ) . (28) Define m ∗ L ( x ) := M ∗ 1 ( x ) + ∞ X i =2 ∆ ∗ i,L ( x ) . (29) Put ˆ M ∗ k ,n ( x ) := P n j =1 Y j K ( X j − x h k ) P n j =1 K ( X j − x h k ) where w e use the con v en tion that 0 / 0 = 0. Now for k ≥ 2, intro duce ˆ ∆ ∗ k ,n,L ( x ) = sign( ˆ M ∗ k ,n ( x ) − ˆ M ∗ k − 1 ,n ( x )) min( | ˆ M ∗ k ,n ( x ) − ˆ M ∗ k − 1 ,n ( x ) | , L 2 − k ) (30) and ˆ m ∗ n,L ( x ) = ˆ M ∗ 1 ,n ( x ) + N n X k =2 ˆ ∆ ∗ k ,n,L ( x ) . (31) 14 Redefine the supp ort S of µ as S := { x ∈ R d : µ ( S x, 1 /k ) > 0 f o r all k ≥ 1 } . (32) By Co ver and Hart [7], µ ( S ) = 1. Theorem 2 L et { ( X i , Y i ) } b e a stationary er go d ic time se ries with E | Y i | < ∞ . Assume N n → ∞ . Then alm ost sur e ly, for a l l x ∈ S , ˆ m ∗ n,L ( x ) → m ∗ L ( x ) . (33) If the supp ort S o f µ is a b ounde d subset of R d then almost sur ely sup x ∈ S | ˆ m ∗ n,L ( x ) − m ∗ L ( x ) | → 0 . (34) If either (i) | Y | ≤ D < ∞ almost sur ely ( D ne e d not b e known) or (ii) µ is of b ounde d supp ort then Z ( ˆ m ∗ n,L ( x ) − m ∗ L ( x )) 2 µ ( dx ) → 0 . (35) Pro of W e first prov e that (1 6) holds with ˆ M ∗ k ,n and M ∗ k . Let g k ,n ( x ) = 1 n n X j =1 Y j K X j − x h k and g k ( x ) = E Y K X − x h k . Similarly put f k ,n ( x ) = 1 n n X j =1 K X j − x h k and f k ( x ) = E K X − x h k . 15 W e ha v e to sho w that almost surely , for all k ≥ 1, and for all x ∈ S , b oth g k ,n ( x ) → g k ( x ) and f k ,n ( x ) → f k ( x ). Consider g k ,n ( x ) with k fixed. Let Q ⊆ R d denote the set of v ectors with rational co ordinates. (Note tha t the set Q has coun tably man y elemen ts.) By the ergo dic theorem, almost surely , for all r ∈ Q , g k ,n ( r ) → g k ( r ) . Let δ > 0 b e arbitrary . Let in tegers Z − 1 > M > 0 b e so large that E | Y | 1 { X / ∈ S 0 ,M } < δ . By ergo dicit y , almost surely , sup x / ∈ S 0 ,Z | g k ,n ( x ) | ≤ 1 n n X i =1 | Y i | 1 { X i / ∈ S 0 ,M } → E | Y | 1 { X / ∈ S 0 ,M } < δ . Since K h k ( x ) = K ( x h k ) is con tin uous and K h k ( x ) = 0 if k x k > h k and hence K h k ( x ) is uniformly con tin uous on R d . Define U k ( u ) = sup x,z ∈ R d : k x − z k≤ u | K h k ( x ) − K h k ( z ) | . Let B δ ⊆ S 0 ,Z ∩ Q b e a finite subset of v ectors with ra tional co ordinates suc h t ha t sup x ∈ S 0 ,Z min r ∈ B δ U k ( k x − r k ) < δ. F or x ∈ S 0 ,Z , let r ( x ) denote one of the closest rational v ector r ∈ B δ to x . Now sup x ∈ S 0 ,Z | g k ,n ( x ) − g k ,m ( x ) | ≤ sup x ∈ S 0 ,Z | g k ,n ( x ) − g k ,n ( r ( x )) | + sup x ∈ S 0 ,Z | g k ,n ( r ( x )) − g k ,m ( r ( x )) | + sup x ∈ S 0 ,Z | g k ,m ( r ( x )) − g k ,m ( x ) | ≤ δ 1 n n X i =1 | Y i | + max r ∈ B δ | g k ,m ( r ) − g k ,n ( r ) | + δ 1 m n X i =1 | Y i | . Com bining the results, by the erg o dic theorem, for almost all ω ∈ Ω, there exists N ( ω ) suc h that for all m > N , and n > N , sup x ∈ R d | g k ,n ( x ) − g k ,m ( x ) | ≤ sup x ∈ S 0 ,Z | g k ,n ( x ) − g k ,m ( x ) | 16 + sup x / ∈ S 0 ,Z | g k ,n ( x ) − g k ,m ( x ) | ≤ 2 δ E | Y | + 3 δ . Since δ was arbitrary , for almost all ω ∈ Ω, f or ev ery ǫ > 0, there exists an in teger N ǫ ( ω ) suc h that for all m > N ǫ ( ω ), n > N ǫ ( ω ): sup x ∈ R d | g k ,n ( x ) − g k ,m ( x ) | < ǫ. (36) As a conseq uence, almost surely , the sequence of functions { g k ,n } ∞ n =1 con v erges uniformly . Since all g k ,n are contin uous, the limit f unction must b e also con tinuous. Since almost surely , for all r ∈ Q , g k ,n ( r ) → g k ( r ), and by the Leb es g ue dominated conv ergence g k is con tinuous, the limit function must b e g k . Since there are coun t a bly many k , a lmos t surely , for all k ≥ 1, sup x ∈ R d | g k ,n ( x ) − g k ( x ) | → 0 . The same argumen t implies that almost surely , for all k ≥ 1, sup x ∈ R d | f k ,n ( x ) − f k ( x ) | → 0 . W e hav e prov ed (16). The rest of the pro of of (33) go es as in the pro of of Theo- rem 1. Now w e pro ve (34). Since no w, by assumption, the supp o rt is bo unde d, and since it is closed, and hence it is compact. Now note that there m ust exist an ǫ > 0 suc h that inf x ∈ S f k ( x ) > ǫ . (Otherwise, there w ould b e a seque nce x i ∈ S suc h tha t lim inf i →∞ f k ( x i ) = 0. Con tinuit y on a compact set w ould imply that there would b e an x ∈ S suc h that f k ( x ) = 0 in con tr adiction to the hypothesis t hat x ∈ S . ) By uniform con v ergence, for large n , inf x ∈ S f k ,n ( x ) > ǫ/ 2. Th us sup x ∈ S g k ,n ( x ) f k ,n ( x ) − g k ( x ) f k ( x ) ≤ sup x ∈ S g k ,n ( x )( f k ( x ) /f k ,n ( x )) − g k ( x ) f k ( x ) 17 ≤ 1 ǫ sup x ∈ S f k ( x ) f k ,n ( x ) | g k ,n ( x ) − g k ( x ) | + | g k ( x ) | f k ( x ) f k ,n ( x ) − 1 ≤ 2 ǫ 2 sup x ∈ S | g k ,n ( x ) − g k ( x ) | + sup x ∈ S | g k ( x ) | 2 ǫ sup x ∈ S | f k ,n ( x ) − f k ( x ) | → 0 . Th us almost surely , fo r all k ≥ 1, sup x ∈ S | ˆ M ∗ k ,n ( x ) − ˆ M ∗ k ( x ) | → 0 . Almost surely , for a rbitrary integer R > 2, sup x ∈ S | ˆ M ∗ 1 ,n ( x ) − M ∗ 1 ( x ) | + R X k =2 | ˆ ∆ ∗ k ,n,L ( x ) − ∆ ∗ k ,L ( x ) | ! → 0 . The rest of the pro of go es exactly as in Theorem 1. ✷ Corollary 2 Assume m ( x ) is Lipschitz c ontinuous with Lipschitz c onstant C . With the choic e of L ≥ C for al l x ∈ S , m ∗ L ( x ) = m ( x ) and The or e m 2 holds with m ∗ L ( x ) substitute d by m ( x ) . Pro of Since m ( x ) is Lipsc hitz with constant C , for x ∈ S , | M ∗ k ( x ) − m ( x ) | ≤ | R m ( z ) K ( z − x h k ) µ ( dz ) R K ( z − x h k ) µ ( dz ) − m ( x ) | ≤ R | m ( z ) − m ( x ) | K ( z − x h k ) µ ( dz ) R K ( z − x h k ) µ ( dz ) ≤ C h k ≤ L 2 − k − 2 , therefore | M ∗ k ( x ) − M ∗ k − 1 ( x ) | < L 2 − k . The rest of the pro of go es as in Corollary 1 . ✷ 18 4 Conclusions This contribution is part of a long-standing endea v o r of the authors to extend nonpara- metric fo rec asting metho dology to the most lenien t assumptions p ossible. The presen t w ork do es push in to new territory: strong consistency fo r finite regression under a Lips- c hitz assum ption. The computational asp ects hav e not b een explored, but the algor it hm s are so close to their traditional partitioning and ke rnel coun terparts that it is eviden t that they could b e implemen ted and in fact, might b e comp etitive. The fundamen tal f o rm ula ( 8 ) leading to the truncated histogram approach was mo- tiv a t e d b y a represen ta t io n used in a r elat ed but non- constructive setting by K ieffer [17]. The essence is to see that an infinite-dimensional nonpar a me tric space ma y sometimes b e decomp ose d in to sums of terms in finite dimensional spaces , with t a ils of the summations b eing a priori asymptotically b ounded ov er the regression class of in terest. Through differen t device s, t w o ideas for obtaining suc h tail b ounds f or the partition and ke rnel metho ds ha v e b een presen ted. Our con tr ibutio n has b een to apply t he idea with Lipsc hitz con tin uity assuring the negligibilit y . Th us, results here are fundamen tally in tert wined with the Lipsc hitz b ounds. P erhaps other useful expansions are p ossible . The in terplay of finite subspaces and a priori b ounded ta ils has pro v en a bit delicate. Sections 2 and 3 presen t different attac ks to the error- b ounding problem. T he ob vious nearest-neigh b or estimator did not yield to this tec hnique b ecause the radii a re random and do not necessarily decrease rapidly enough to a s sure b ounded tails. The dev ice whic h w as successful here ma y find o t he r applications. Eviden tly , a similar inv estigation could b e carried out for regression classes ha ving F ourier expansions with co effic ien ts v anishing sufficien tly quic kly . It is well-kno wn (e.g., [33]) that univ ersal conv ergence rates under the generality 19 of mere ergo dicit y do not exist. An av en ue whic h w ould b e worth exploring is that of ada pting univ ersal algorithms, suc h a s explored and referenced here, so that they asymptotically attain the fastest p o s sible con v ergence if, unkno wn to the stat is tician, the time series happ ens to fall into a mixing class. The design should b e suc h that consistency is still assured if mixing rates do not hold. References [1] P . H. Algo et, Univ ersal sc hemes for prediction, gam bling and p ortfolio selection, Annals Probab., v ol 20, pp. 901–9 41, 1 9 92. Corr ection: ibid. , v ol. 2 3 , pp. 474–478, 1995. [2] P . H. Algo et, The strong law of larg e n um b ers for sequen tial decisions under uncer- tain ty , IEEE T rans. Inform. Theory , vol. 40, pp. 609 –634, May 1994. [3] R. B. Ash, Real Analysis and Proba bility . Academic Press, 1 9 72. [4] D. H. Bailey , Sequen tial Sch emes for Classifying and Predicting Ergo dic Pro cesses. Ph. D. thesis, Stanford Unive rsit y , 1976. [5] Z. Bo die, A. Kane, A, and A. Marcus, Inv estmen ts, 3rd ed. Irwin, Chicago 199 6 . [6] T. Bollerslev, R. Y. Chou, and K. F. Kroner, ARCH mo deling in finance , Journal of Econometrics, 52, pp. 5-59, 199 2. [7] T. M. Co v er and P . Hart Nearest neigh b or pattern classification , IEEE T ransactions on Information Theory , IT-1 3 , pp. 21–27, 1967. [8] R. Da vis and T. Mik osc h, The sample auto correlations of heav y-tailed pro cesse s with applications to ARCH, Annals of Statistics, 26, pp. 2049 - 2080, 1999. 20 [9] L. Devro ye , On the almost eve r y where con ve rgence of nonparametric regression func- tion estimates , Annals of Statistics, 9, pp. 131 0–1319, 1 9 81. [10] L. D ev ro ye and L. Gy¨ orfi, Distribution-free exp onen tial b ound on the L 1 error of par- titioning estimates of a regression function , In: Pro ceedings of the F ourth P annonian Symp osiu m on Mathematical Statistics, eds. F. K onec n y , J. Mogy or´ o di, W. W ertz, pp. 67–76 , Ak ad´ emiai Kiad´ o, Budap est, Hungary , 1983. [11] L. Devro ye and A. Krzy ˙ zak, An equiv alence theorem for L 1 con v ergence of the ke rnel regression estimate , Journal of Sta tis tical Planning and Inference, 23, pp. 71–8 2 , 1989. [12] L. Devro y e and T. J. W agner, Distribution-free consistency results in nonparametric discrimination and regression function estimation , Annals of Statistics, 8, pp. 23 1– 239, 1980 . [13] W. Greblic ki, A. Krzy ˙ zak and M. P awlak, Distribution-free p oin t wise consistenc y of k ernel regression estimate , Annals of Statistics, 12, pp. 1570–1575 , 1984. [14] L. Gy¨ orfi, W. H¨ ardle, P . Sarda and Ph. Vieu, Nonparametric Curve Estimation from Time Series. Berlin: Springer-V erlag, 1989 . [15] L. G y¨ orfi, Univ ersal consistencies of a regression estimate for un b ounded regression functions , In: Nonparametric F unctional Estimation, ed. G . R ouss a s , pp. 329–338 , NA TO ASI Series, Kluw er, Berlin, 1991. [16] L. Gy¨ orfi, G . Morv ai and S. Y ak owitz, Limits to consisten t on-line forecasting for ergo dic t ime series , IEEE T ransactions on Information Theory , IT-44, pp. 88 6-892, 1998. 21 [17] J. Kieffer, Estimation of a con v ex r eal par a me ter of an unknow n information source , The Annals o f Probabilit y , 7, 882-8 8 6, 1979. [18] S.R. Kulk arni and S.E. P osner, Ra tes of conv erg ence of nearest neighbour estimation under arbitrary sampling , IEEE T ransaction on Information Theory , IT-4 1 , pp, 102 8 - 1039, 1995. [19] A. Krzy ˙ zak and M. P a wlak, Distribution-free consistency of a nonpara me tric k ernel regression estimate and classification , IEEE T ra nsactions on Info r ma t io n Theory , IT-30, pp. 78– 81, 19 84. [20] B. B. Mandelbrot, A Multifractal W alk down W a ll Street, Scien tific American, XXX, pp. 70–73 , 1999. [21] E. Masry and D. Tjostheim, Nonparametric estimation and iden tification of nonlinear AR CH time series , Econometric Theory , 11, pp. 258-289 , 1995 . [22] G. Morv a i, S. Y a ko witz, and L. Gy¨ orfi, Nonpar a me tric inferences for ergo dic, sta- tionary time series, Annals of Stat istics, 24, pp. 370-3 79, 1996. [23] G. Morv ai, Estimation of Conditional Distributions for Stationary Time Series. Ph. D. Thesis, T ec hnical Univ ersity of Budap est, 1 996. [24] G. Morv a i, S. Kulk arni and A. Nob el, Regression estimation from an individual stable sequence, Statistics 33, no. 2, pp. 99–1 18, 19 99. [25] G. Morv ai, S. Y a ko witz, and P . Algo et, W eakly conv ergen t nonparametric forecasting of stationary time series, IEEE T rans. Inform. Theory , 43, pp. 483 –498, March, 1997. [26] A. Nob el, G. Morv ai and S. Kulk arni, D e nsit y estimation from an individual n umer- ical sequence, IEEE T ra ns. Inform. Theory , 4 4 , pp. 537-541, Marc h, 1998. 22 [27] D. Ornstein, Guessing the next output of a stationary pro cess, Israel J. of Math., 30 , pp. 292- 296, 1978 . [28] S. Resnic k, Hea vy tail mo deling and teletraffic data, Ann. Math. Statist., 25 , pp. 1805-18 49, 1997. [29] P . Robinson, Time series with strong dep endence, in Adv ances in Econometrics: Sixth W orld Congress , C.A. Sims, ed., V ol. I, Cam bridge Univ ersit y Press, pp. 47- 95, 1994. [30] M. Rosen blatt, D en sit y estimates and Mark o v sequenc es. In Nonp ar ametric T e ch- niques in Statistic al Infer enc e , M. Puri, Ed. London: Cam bridg e Univ ersit y , pp. 199-210 , 1970. [31] G. Roussas, Non-parametric es timation of the transition distribution of a Mark o v pro cess es , Annals of Inst. Statist. Math. 21, pp.73-8 7 , 1969 . [32] B. Y a . Ryabk o, Prediction of random sequences and univ ersal co ding, Pro blems of Inform. T rans., 24, pp. 87-96, Apr.-June 1988. [33] P . Shields, Univ ersal r edundancy rates don’t exist, IEEE T rans. Inform. Theory , 39, pp. 520- 524, 1993 . [34] C. Spiegelman and J. Sac ks, Consisten t windo w estimation in nonparametric regres- sion , Annals of St a tistic s, 8, pp. 240 –246, 198 0 . [35] C. J. Stone, Consisten t nonpar ame tric regression , Annals of Statistics, 8, pp. 1348 – 1360, 1977. [36] G. S. W at son, Smo oth regression analysis , Sankh y a Series A, 26, pp. 359–372 , 1964. 23 [37] W. Willinger, M. S. T aqqu, W. E. Leland, and D . V. Wilson, Self- s imilarit y in high-sp eed pa c ket traffic-analysis and mo deling on ethernet traffic measureme n ts, Statistical Science, 10 , pp. 67- 85, 1995. [38] S. Y ak o witz, Nearest neighbor regression estimation f or n ull-recurren t Mark ov time series , Sto c hastic Pro cess es and their Applications, 37 , pp. 311-318, 1993 . [39] J. Ziv and A. Lemp el, Compression of individual sequences by v ariable rate co ding, IEEE T rans. Inform. Theory , IT-24 , pp. 530- 536, Sept. 19 78. 24
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment