Estimating $beta$-mixing coefficients
The literature on statistical learning for time series assumes the asymptotic independence or ``mixing' of the data-generating process. These mixing assumptions are never tested, nor are there methods for estimating mixing rates from data. We give an…
Authors: Daniel J. McDonald, Cosma Rohilla Shalizi, Mark Schervish (Carnegie Mellon University)
Estimating β -mixing co efficien ts Daniel J. McDonald Cosma Rohilla Sha lizi Mark Sc hervish Carnegie Mellon Universit y Carnegie Mellon Univ ersity Santa F e Institute Carnegie Mellon Universit y Abstract The literature on statistical learning for time series assumes the asymptotic independence or “mixing’ of the data-generating proces s. These mixing assumptions ar e never tested, nor ar e there metho ds for estimating mixing rates from data. W e give an estimator for the β -mixing rate based on a single stationa ry sample pa th and sho w it is L 1 -risk consisten t. 1 In tr oduct ion Relaxing the as sumption of indep endence is a n active area of r e search in the statistics and machine learning literature. F or time series, indep endence is replaced by the a symptotic indep endence o f even ts far a part in time, or “mixing”. Mixing conditions make the de- pendenc e of the future on the past explicit, q uan tifying the decay in dep endence a s the future moves farther from the past. There are many definitions of mixing of v arying s trength with matching dep endence co effi- cients (see [ 8 , 6 , 3 ] for reviews), but most of the results in the learning liter ature fo cus on β -mixing or abs olute regular ity . Roug hly spea king (see Definition 2.1 b elo w for a precise s ta temen t), the β -mixing co efficient at lag a is the total v a riation distance b et ween the actual joint distribution of even ts sepa rated by a time steps and the pro duct of their marg ina l distributions , i.e., the L 1 distance from indep endence. Numerous r e sults in the sta tistical machine learning literature rely on knowledge of the β -mixing co effi- cients. A s Vidyasagar [ 24 , p. 4 1] notes, β - mixing is “just right” for the extension of I ID res ults to de- pendent data, and s o recent work has consistently fo cused on it. Meir [ 14 ] derives generaliza tion er ror bo unds for nonpara metric methods base d on mo del se- App earing in Pro ceedings of the 14 th International Con- ference on Artificial Intelligence and Statistics (A IST A TS) 2011, Ft. Lauderdale, FL, USA. Copyrigh t 2011 by th e authors. lection via structura l ris k minimizatio n. Bara ud et al. [ 1 ] study the finite sample risk per formance of p e- nalized lea st squa res reg ression e s timators under β - mixing. Lozano et al. [ 1 2 ] ex amine regula r ized b o o st- ing alg orithms under a bsolute regula r it y a nd prov e consistency . K arandik ar and Vidyasagar [ 11 ] consider “probably approximately corr e ct” learning algor ithms, proving that P AC alg orithms for I ID inputs remain P AC with β -mixing inputs under some mild condi- tions. Ralaivola et al. [ 19 ] derive P AC b ounds for ranking statistics and classifiers using a decompos ition of the dep endency graph. Finally , Mo hri a nd Ros- tamizadeh [ 15 ] der iv e stabilit y b ounds for β -mixing inputs, g eneralizing existing stability results for IID data. All thes e results assume not just β -mixing, but known mixing co efficients. In particular , the risk b ounds in [ 14 , 15 ] a nd [ 19 ] are incalculable without knowl- edge of the rates . This k no wle dg e is never av ailable. Unless resear c her s are willing to assume spec ific v al- ues for a sequence of β -mixing c o efficients, the results men tioned in the previous parag raph are generally use- less when confro n ted with da ta .T o illustra te this defi- ciency , consider Theorem 18 of [ 15 ]: Theorem 1.1 (Briefly) . Assume a le arning algorithm is λ -s t able. Then, for any sample of size n dr awn fr om a st ationary β - mixing distribution, and ǫ > 0 P ( | R − b R | > ǫ ) ≤ Γ( n, λ, ǫ, a, b ) + β ( a )( µ n − 1) wher e n = ( a + b ) µ n , Γ has a p articular fun ct ional form, and R − b R is the differ enc e b et we en the true risk and t he empiric al risk. Ideally , one co uld use this r esult for mo del selection or to control the size of the generaliz ation er ror of comp eting prediction algo rithms (supp ort vector ma- chines, supp ort vector reg ression, and kernel ridge re - gressio n a r e a few of the many algorithms known to satisfy λ -stability). How ever the b ound dep ends ex- plicitly on the mix ing co efficient β ( a ). T o ma ke mat- ters worse, there a re no metho ds for estimating the β -mixing co efficien ts. According to Meir [ 14 , p. 7 ], “there is no efficient prac tical a ppr oach known at this Estimating β -mixing co efficients stage for es tima tio n of mixing pa rameters.” W e b egin to rectify this problem by deriv ing the first method for estimating these co efficien ts . W e pr o ve that our esti- mator is consistent for arbitrar y β -mixing pro cesses. In addition, we derive r ates of co n vergence for Markov approximations to these pro cesses. Application of statistica l lea rning res ults to β - mixing data is hig hly desira ble in applied work. Many com- mon time series mo dels a re known to be β - mixing, and the rates o f decay are k no wn given the true pa- rameters of the pro cess. Among the pro cesses for which such knowledge is a v ailable are ARMA mo d- els [ 16 ], GARCH mo dels [ 4 ], and cer tain Mar k ov pro- cesses — see [ 8 ] for an ov er v iew of such r esults. T o our knowledge, only Nob el [ 1 7 ] appro ac hes a s o lution to the problem of estimating mixing rates by giving a metho d to distinguish b etw een differen t p olynomial mixing r ate reg imes thro ugh hypothesis testing. W e present the first metho d for estimating the β - mixing co efficients for statio na ry time serie s data. Sec- tion 2 defines the β -mix ing co efficient and states our main results on convergence rates and consistency for our es tima to r. Section 3 gives an in termediate re s ult on the L 1 conv erg ence of the histogram estimator with β -mixing inputs. Section 4 prov es the main res ults from § 2 . Section 5 c oncludes a nd lays out some av- enues for future research. 2 Estimation of β -mixing In this section, w e pre sen t one of man y equiv ale nt def- initions of abs olute reg ularit y and s tate our main re- sults, deferring pro of to § 4 . T o fix no tation, let X = { X t } ∞ t = −∞ be a s equence of random v a riables wher e each X t is a meas urable func- tion fro m a proba bilit y s pace (Ω , F , P ) into a measur- able spa ce X . A blo c k of this random sequence will be given by X j i ≡ { X t } j t = i where i and j are integers, and may b e infinite. W e use similar no tation for the sigma fields generated by thes e blo cks and their jo in t distributions. In pa r ticular, σ j i will denote the sigma field gener ated b y X j i , and the joint distribution of X j i will b e denoted P j i . 2.1 Definitions There are man y equiv alent definitions of β -mixing (see for instance [ 8 ], or [ 3 ] as well as Meir [ 14 ] o r Y u [ 27 ]), how ever the most intuitiv e is that given in Doukhan [ 8 ]. Definition 2. 1 ( β -mixing) . F or e ach p ositive int e- ger a , the t he co efficient o f a bsolute r egularity , or β - mixing co efficient , β ( a ) , is β ( a ) ≡ sup t P t −∞ ⊗ P ∞ t + a − P t,a T V (1) wher e || · || T V is the total variatio n norm, and P t,a is the joint distribution of ( X t −∞ , X ∞ t + a ) . A sto chastic pr o c ess is said t o b e absolutely reg ular , or β - mixing , if β ( a ) → 0 as a → ∞ . Lo osely sp eaking, Definition 2.1 says that the co effi- cient β ( a ) measur e s the total v aria tion distance b e- t ween the joint distribution of r andom v ariables sea- parted b y a time units and a distribution under whic h random v aria bles se parated by a time units ar e in- depe ndent. The supremum ov er t is unnecessary for stationary random pr oces s es X which is the only case we consider here. Definition 2 .2 (Stationarity) . A se quenc e of r an- dom variables X is stationa ry when al l its fi nite- dimensional distributions ar e invariant over time: for al l t and al l non-ne gative inte gers i and j , t he r andom ve ctors X t + i t and X t + i + j t + j have t he same distribution. Our main r esult require s the metho d of blo c k ing used by Y u [ 26 , 27 ]. The purp ose is to tra ns form a seq uence of dep enden t v ar iables into subsequence of nea rly I I D ones. Cons ide r a sample X n 1 from a stationary β - mixing sequence with density f . Let m n and µ n be non-negative int egers such tha t 2 m n µ n = n . No w di- vide X n 1 int o 2 µ n blo c ks of each leng th m n . Identify the blo c k s as follows: U j = { X i : 2( j − 1) m n + 1 ≤ i ≤ (2 j − 1) m n } , V j = { X i : (2 j − 1) m n + 1 ≤ i ≤ 2 j m n } . Let U b e the entire seq uence of o dd blo cks U j , and let V b e the se q uence of e ven blo cks V j . Finally , let U ′ be a sequence o f blo cks which ar e indep endent of X n 1 but s uc h that each blo ck has the same distribution a s a blo ck from the o r iginal sequence: U ′ j D = U j D = U 1 . (2) The blo c ks U ′ are now an IID block sequence, so sta n- dard results apply . (See [ 27 ] for a more rigoro us analy- sis of blo c king.) With this structure , we ca n state our main result. 2.2 Results Our main result emerg es in tw o stag es. First, we rec- ognize that the distribution o f a finite sample dep ends only on finite-dimens io nal distributions. This leads to an estimator of a finite-dimensio na l version of β ( a ). Next, we let the finite-dimension incr ease to infinity with the size o f the obs e r v e d sample. Daniel J . M cDonald, Cosma Rohilla Shalizi, Mark Schervish F or p ositiv e integers t , d , and a , define β d ( a ) = P t t − d +1 ⊗ P t + a + d − 1 t + a − P t,a,d T V , (3) where P t,a,d is the joint distribution of ( X t t + d +1 , X t + a + d − 1 t + a ). Also, let b f d be the d -dimensional histogram estimator of the joint density of d cons e c - utive observ atio ns, and let b f 2 d a be the 2 d -dimensional histogram estimator of the joint densit y of tw o sets of d conse c utiv e obser v atio ns sepa rated b y a time points. W e construct an estimator o f β d ( a ) based o n these tw o histograms . 1 Define b β d ( a ) = 1 2 Z b f 2 d a − b f d ⊗ b f d (4) W e show that, by allowing d = d n to g row with n , this estimator will conv erge on β ( a ). This can b e seen most clea rly by b ounding the L 1 -risk o f the estimato r with its estimation and approximation error s: | b β d n − β ( a ) | ≤ | b β d n − β d n | + | β d n − β ( a ) | . The first ter m is the err or of es timating β d ( a ) with a random sample of data. T he s e c ond term is the non- sto c ha stic er ror induced by approximating the infinite dimensional co efficien t, β ( a ), with its d - dimensional counterpart, β d ( a ). Our fir st theorem in this s ection establishes co nsis- tency of b β d n ( a ) as an estimator of β ( a ) for all β -mixing pro cesses pro vided d n increases at an appropriate rate. Theorem 2.4 g iv es finite sa mple bo unds on the esti- mation err o r while some mea sure theoretic arg umen ts contained in § 4 show that the appr o x imation er ror m ust go to zero as d n → ∞ . Theorem 2. 3. L et X n 1 b e a sample fr om an arbitr ary β -mixing pr o c ess. L et d n = O (ex p { W (log n ) } ) wher e W is the L amb ert W function. 2 Then b β d n ( a ) P − → β ( a ) as n → ∞ . A finite sample b ound for the approximation erro r is the first step to establis hing co nsistency for b β d n . This result gives conv er gence rates for estima tio n of the fi- nite dimens ional mixing co efficient β d ( a ) and also for Marko v pro cesse s of known order d , since in this ca se, β d ( a ) = β ( a ). Theorem 2.4. Consider a sample X n 1 fr om a station- ary β -mix ing pr o c ess. L et µ n and m n b e p ositive inte- 1 While it is clearly p ossible to replace h istogra ms with other choice s of density estimators (most notably KDEs), histograms in this case are more convenien t theoretically and computationally . See § 5 for more details. 2 The Lambert W function is d efined as the (mul- tiv alued) invers e of f ( w ) = w exp { w } . Thus, O (ex p { W (log n ) } ) is bigger than O ( log log n ) but smaller than O (log n ). S ee for example Corless et al. [ 5 ]. gers s u ch that 2 µ n m n = n and µ n ≥ d > 0 . Then P ( | b β d ( a ) − β d ( a ) | > ǫ ) ≤ 2 exp − µ n ǫ 2 1 2 + 2 exp − µ n ǫ 2 2 2 + 4( µ n − 1) β ( m n ) , wher e ǫ 1 = ǫ/ 2 − E h R | b f d − f d | i and ǫ 2 = ǫ − E h R | b f 2 d a − f 2 d a | i . Consistency o f the estimator b β d ( a ) is guar a n teed only for cer tain choices of m n and µ n . Clearly µ n → ∞ and µ n β ( m n ) → 0 as n → ∞ are necessar y conditions. Consistency als o requires co n vergence of the histogram estimators to the target densities. W e leav e the pro of of this theore m for section 4 . As a n example to show that this bo und can go to zero with prop er choices of m n and µ n , the following corolla r y proves consistency for first order Markov pro cesses. Consistency of the estimator for higher or der Marko v pro cesses ca n b e prov en s imilarly . These pro cesses are algebraica lly β - mixing as s ho wn in e.g. Nummelin and T uominen [ 18 ]. Corollary 2. 5. L et X n 1 b e a sample fr om a fi rst or der Markov pr o c ess with β ( a ) = β 1 ( a ) = O ( a − r ) . Then under the c onditions of The or em 2.4 , b β 1 ( a ) P − → β ( a ) . Pr o of. Rec all that n = 2 µ n m n . Then, 4( µ n − 1) β ( m n ) = 4 µ n β ( m n ) + 4 β ( m n ) = K 1 n m n m − r n + K 2 m − r n → 0 if m n < n 1 / (1+ r ) for constants K 1 and K 2 . In this case, we hav e that the e x ponential ter ms a re less than exp ( − K 3 nǫ 2 j n 1 / (1+ r ) ) = exp n − K 3 n r / (1+ r ) ǫ 2 j o , for j = 1 , 2 and a co nstan t K 3 . Ther efore, b oth ex p o- nent ial ter ms go to 0 as n → ∞ . Proving Theor em 2 .4 r equires showing the L 1 con- vergence o f the histogram density estimator with β - mixing da ta. W e do this in the next se c tio n. 3 L 1 con vergence of histograms Conv erg ence of density es timators is thoro ughly stud- ied in the statistics and machine learning literature. Early pap ers on the L ∞ conv erg ence of kernel densit y estimators (KDEs ) include [ 25 , 2 , 21 ]; F reedman and Diaconis [ 9 ] lo ok sp ecifically at histogra m estimato rs, Estimating β -mixing co efficients and Y u [ 26 ] considered the L ∞ conv erg ence of K DE s for β -mix ing data and s ho ws that the o ptimal IID ra tes can b e a ttained. Devr oye and Gy¨ orfi [ 7 ] arg ue that L 1 is a mor e appropr iate metric for s tudying density esti- mation, a nd T r a n [ 22 ] proves L 1 consistency of KDEs under α - and β - mixing. As far as we a re a ware, ours is the first pro of of L 1 conv erg ence for histogr ams under β -mixing. Additionally , the dimens io nalit y of the tar get density is analo gous to the order of the Mar k ov a pproxima- tion. Therefore, the conv e rgence rates w e g iv e ar e asymptotic in the bandwidth h n which shrinks a s n increases, but also in the dimension d which incr eases with n . Even under these asymptotics, histogram esti- mation in this sense is not a hig h dimensional problem. The dimension of the target density considered here is on the order of exp { W (log n ) } , a rate somewhere b e- t ween log n and log log n . Theorem 3.1 . If b f is t he histo gr am estimator b ase d on a (p ossibly ve ctor value d) sample X n 1 fr om a β - mixing se quenc e with stationary density f , then for al l ǫ > E h R | b f − f | i , P Z | b f − f | > ǫ ≤ 2 exp − µ n ǫ 2 1 2 + 2( µ n − 1) β ( m n ) (5) wher e ǫ 1 = ǫ − E h R | b f − f | i . T o prov e this result, we use the blo c king metho d of Y u [ 27 ] to transform the dependent β -mixing into a se- quence of nea rly indep endent blo cks. W e then apply McDiarmid’s inequality to the blo c k s to derive asymp- totics in the bandwidth of the histog ram as w e ll as the dimension o f the target dens it y . F or completeness , we state Y u’s blo c king r esult a nd McDiarmid’s inequality befo re pr o v ing the doubly asymptotic histog ram con- vergence for I ID data. Combining these lemmas a llows us to derive rates of co n vergence for histograms base d on β -mixing inputs. Lemma 3.2 (Lemma 4.1 in Y u [ 27 ]) . L et φ b e a me a- sur able function with r esp e ct to t he blo ck se quenc e U uniformly b ounde d by M . Then, | E [ φ ] − ˜ E [ φ ] | ≤ M β ( m n )( µ n − 1) , (6) wher e the first exp e ctation is with r esp e ct to t he dep en- dent blo ck se qu en c e, U , and ˜ E is with r esp e ct to the indep en dent se quen c e, U ′ . This lemma essentially giv es a method o f applying IID results to β -mixing data . Because the dep e ndenc e de- cays as we increase the sepa ration b et ween blo cks, widely spaced blo cks are nea rly indep enden t of each other. In par ticula r, the differ ence b et ween exp ecta- tions ov er these near ly indep endent blocks and exp ec- tations o ver blo cks which ar e actua lly indep enden t can be controlled b y the β -mixing co efficient. Lemma 3. 3 (McDiarmid Inequality [ 13 ]) . L et X 1 , . . . , X n b e indep endent r andom variables, with X i taking values in a set A i for e ach i . Su pp ose that t he me asur able function f : Q A i → R satisfies | f ( x ) − f ( x ′ ) | ≤ c i whenever the ve ctors x and x ′ differ only in the i th c o or dinate. Then for any ǫ > 0 , P ( f − E f > ǫ ) ≤ exp − 2 ǫ 2 P c 2 i . Lemma 3.4. F or an IID sample X 1 , . . . , X n fr om some density f on R d , E Z | b f − E b f | dx = O 1 / q nh d n (7) Z | E b f − f | dx = O ( dh n ) + O ( d 2 h 2 n ) , (8) wher e b f is the histo gr am estimate using a grid with sides of length h n . Pr o of of L emm a 3.4 . Let p j be the pro babilit y of falling into the j th bin B j . Then, E Z | b f − E b f | = h d n J X j =1 E 1 nh d n n X i =1 I ( X i ∈ B j ) − p j h d ≤ h d n J X j =1 1 nh d n v u u t V " n X i =1 I ( X i ∈ B j ) # = h d n J X j =1 1 nh d n q np j (1 − p j ) = 1 √ n J X j =1 q p j (1 − p j ) = O ( n − 1 / 2 ) O ( h − d/ 2 n ) = O 1 / q nh d n . F or the seco nd claim, c onsider the bin B j centered at c . Le t I b e the union o f all bins B j . Assume the following: 1. f ∈ L 2 and f is absolutely contin uous on I , with a.e. pa rtial deriv a tiv es f i = ∂ ∂ y i f ( y ) 2. f i ∈ L 2 and f i is a bs olutely contin uous o n I , with a.e. pa rtial deriv a tiv es f ik = ∂ ∂ y k f i ( y ) 3. f ik ∈ L 2 for all i , k . Daniel J . M cDonald, Cosma Rohilla Shalizi, Mark Schervish Using a T aylor expansion f ( x ) = f ( c ) + d X i =1 ( x i − c i ) f i ( c ) + O ( d 2 h 2 n ) , where f i ( y ) = ∂ ∂ y i f ( y ). Therefor e, p j is g iv en by p j = Z B j f ( x ) dx = h d n f ( c ) + O ( d 2 h d +2 n ) since the integral of the s e c ond term over the bin is zero. This means that for the j th bin, E b f n ( x ) − f ( x ) = p j h d n − f ( x ) = − d X i =1 ( x i − c i ) f i ( c ) + O ( d 2 h 2 n ) . Therefore, Z B j E b f n ( x ) − f ( x ) = Z B j − d X i =1 ( x i − c i ) f i ( c ) + O ( d 2 h 2 n ) ≤ Z B j − d X i =1 ( x i − c i ) f i ( c ) + Z B j O ( d 2 h 2 ) = Z B j d X i =1 ( x i − c i ) f i ( c ) + O ( d 2 h 2+ d n ) = O ( dh d +1 n ) + O ( d 2 h 2+ d n ) Since ea ch bin is b ounded, w e can sum over all J bins. The num b er of bins is J = h − d n by definition, s o Z | E b f n ( x ) − f ( x ) | dx = O ( h − d n ) O ( dh d +1 n ) + O ( d 2 h 2+ d n ) = O ( dh n ) + O ( d 2 h 2 n ) . W e can now prov e the main result of this s e ction. Pr o of of The or em 3.1 . Let g b e the L 1 loss of the his- togram estimator, g = R | f − b f n | . Here b f n ( x ) = 1 nh d n P n i =1 I ( X i ∈ B j ( x )) wher e B j ( x ) is the bin con- taining x . Let b f U , b f V , and b f U ′ be histog rams based on the blo c k sequence s U , V , and U ′ resp ectively . Clear ly b f n = 1 2 ( b f U + b f V ) . Now, P ( g > ǫ ) = P Z | f − b f n | > ǫ = P Z f − b f U 2 + f − b f V 2 > ǫ ! ≤ P 1 2 Z | f − b f U | + 1 2 Z | f − b f V | > ǫ = P ( g U + g V > 2 ǫ ) ≤ P ( g U > ǫ ) + P ( g V > ǫ ) = 2 P ( g U − E [ g U ] > ǫ − E [ g U ]) = 2 P ( g U − E [ g U ′ ] > ǫ − E [ g U ′ ]) = 2 P ( g U − E [ g U ′ ] > ǫ 1 ) , where ǫ 1 = ǫ − E [ g U ′ ]. Her e , E [ g U ′ ] ≤ ˜ E Z | b f U ′ − ˜ E b f U ′ | dx + Z | ˜ E b f U ′ − f | dx, so by Lemma 3.4 , as lo ng a s for µ n → ∞ , h n ↓ 0 and µ n h d n → ∞ , then for all ǫ there ex is ts n 0 ( ǫ ) such that for a ll n > n 0 ( ǫ ), ǫ > E [ g ] = E [ g U ′ ]. No w a pplying Lemma 3.2 to the exp e ctation of the indica tor o f the even t { g U − E [ g U ′ ] > ǫ 1 } g iv es 2 P ( g U − E [ g U ′ ] > ǫ 1 ) ≤ 2 P ( g U ′ − E [ g U ′ ] > ǫ 1 ) + 2( µ n − 1) β ( m n ) where the pr obabilit y on the r igh t is for the σ -field gen- erated by the indep e nden t blo ck sequence U ′ . Since these blo c ks ar e indep enden t, showing that g U ′ sat- isfies the b ounded differences requirement allows for the application of McDiarmid’s inequality 3.3 to the blo c ks . F or any t wo blo ck sequences u ′ 1 , . . . , u ′ µ n and ¯ u ′ 1 , . . . , ¯ u ′ µ n with u ′ ℓ = ¯ u ′ ℓ for all ℓ 6 = j , then g U ′ ( u ′ 1 , . . . , u ′ µ n ) − g U ′ ( ¯ u ′ 1 , . . . , ¯ u ′ µ n ) = Z | b f ( y ; u ′ 1 , . . . , u ′ µ n ) − f ( y ) | dy − Z | b f ( y ; ¯ u ′ 1 , . . . , ¯ u ′ µ n ) − f ( y ) | dy ≤ Z | b f ( y ; u ′ 1 , . . . , u ′ µ n ) − b f ( y ; ¯ u ′ 1 , . . . , ¯ u ′ µ n ) | dy = 2 µ n h d n h d n = 2 µ n . Therefore, P ( g > ǫ ) ≤ 2 P ( g U ′ − E [ g U ′ ] > ǫ 1 ) + 2( µ n − 1) β ( m n ) ≤ 2 exp − µ n ǫ 2 1 2 + 2( µ n − 1) β ( m n ) . Estimating β -mixing co efficients 4 Pro ofs The pro of of Theor em 2.4 relies on the triangle in- equality and the r elationship b e tween to ta l v ar iation distance a nd the L 1 distance b etw een densities . Pr o of of The or em 2.4 . F or a ny probabilit y measures ν and λ defined o n the same probability s pa ce with a sso- ciated dens ities f ν and f λ with res p ect to so me domi- nating mea sure π , || ν − λ || T V = 1 2 Z | f ν − f λ | d ( π ) . Let P b e the d -dimensio nal s tationary distribution of the d th order Ma r k ov pro cess, i.e. P = P t t − d +1 = P t + a + d − 1 t + a in the notation o f equation 3 . Let P a,d be the jo in t distribution of the biv aria te random pr o cess created b y the initial pro cess and its e lf separ ated by a time steps. B y the triangle inequality , we can upp er bo und β d ( a ) for any d = d n . Le t b P and b P a,d be the distributions ass o ciated with histogr am estimators b f d and b f 2 d a resp ectively . Then, β d ( a ) = || P ⊗ P − P a,d || T V = P ⊗ P − b P ⊗ b P + b P ⊗ b P − b P a,d + b P a,d − P a,d T V ≤ P ⊗ P − b P ⊗ b P T V + b P ⊗ b P − b P a,d T V + b P a,d − P a,d T V ≤ 2 P − b P T V + b P ⊗ b P − b P a,d T V + b P a,d − P a,d T V = Z | f d − b f d | + 1 2 Z | b f d ⊗ b f d − b f 2 d a | + 1 2 Z | f 2 d a − b f 2 d a | where 1 2 R | b f d ⊗ b f d − b f 2 d a | is our es timator b β d ( a ) a nd the remaining terms are the L 1 distance betw een a density estimator and the target density . Thus, β d ( a ) − b β d ( a ) ≤ Z | f d − b f d | + 1 2 Z | f 2 d a − b f 2 d a | . A similar arg umen t starting from β d ( a ) = || P ⊗ P − P a,d || T V shows that β d ( a ) − b β d ( a ) ≥ − Z | f d − b f d | − 1 2 Z | f 2 d a − b f 2 d a | , so we hav e that β d ( a ) − b β d ( a ) ≤ Z | f d − b f d | + 1 2 Z | f 2 d a − b f 2 d a | . Therefore, P β d ( a ) − b β d ( a ) > ǫ ≤ P Z | f d − b f d | + 1 2 Z | f 2 d a − b f 2 d a | > ǫ ≤ P Z | f d − b f d | > ǫ 2 + P 1 2 Z | f 2 d a − b f 2 d a | > ǫ 2 ≤ 2 exp − µ n ǫ 2 1 2 + 2 exp − µ n ǫ 2 2 2 + 4( µ n − 1) β ( m n ) , where ǫ 1 = ǫ/ 2 − E h R | b f d − f d | i and ǫ 2 = ǫ − E h R | b f 2 d a − f 2 d a | i . The pro of of Theorem 2.3 requir es tw o steps whic h are given in the following Lemmas. The first s pecifies the histogram bandwidth h n and the rate a t which d n (the dimensionality o f the target densit y) go es to infinit y . If the dimensionality of the target dens it y were fixed, we could achiev e rates of conv er gence similar to those for histograms based on IID inputs. Howev er, we wish to allow the dimensionality to grow with n , s o the ra tes are muc h slow er as shown in the following lemma. Lemma 4.1. F or the histo gr am estimator in L emma 3.4 , let d n ∼ exp { W (log n ) } , h n ∼ n − k n , with k n = W (log n ) + 1 2 log n log n 1 2 exp { W (log n ) } + 1 . These choic es le ad t o the optimal r ate of c onver genc e. Pr o of. L e t h n = n − k n for some k n to b e determined. Then we wan t n − 1 / 2 h − d n / 2 n = n ( k n d n − 1) / 2 → 0, d n h n = d n n − k → 0, and d 2 n h 2 n = d 2 n n − 2 k → 0 a ll as n → ∞ . Ca ll these A , B , and C . T aking A a nd B first g iv es n ( k n d n − 1) / 2 ∼ d n n − k n ⇒ 1 2 ( k n d n − 1) log n ∼ log d n − k n log n ⇒ k n log n 1 2 d n + 1 ∼ log d n + 1 2 log n ⇒ k n ∼ log d n + 1 2 log n log n 1 2 d n + 1 . (9 ) Similarly , combining A and C g iv e s k n ∼ 2 log d n + 1 2 log n log n 1 2 d n + 2 . (10) Daniel J . M cDonald, Cosma Rohilla Shalizi, Mark Schervish Equating ( 9 ) and ( 10 ) and solving for d n gives ⇒ d n ∼ exp { W (log n ) } where W ( · ) is the Lamber t W function. P lugging ba c k int o ( 9 ) g iv e s that h n = n − k n where k n = W (log n ) + 1 2 log n log n 1 2 exp { W (log n ) } + 1 . It is also necessary to show that as d g rows, β d ( a ) → β ( a ). W e now prove this r esult. Lemma 4.2. β d ( a ) c onver ges t o β ( a ) as d → ∞ . Pr o of. B y stationa rit y , the supremum ov er t is un- necessary in Definition 2.1 , so without loss of gen- erality , let t = 0. Let P 0 −∞ be the distr ibution on σ 0 −∞ = σ ( . . . , X − 1 , X 0 ), and le t P ∞ a be the distr ibu- tion on σ ∞ a +1 = σ ( X a +1 , X a +2 , . . . ). Let P a be the distribution on σ = σ 0 −∞ ⊗ σ ∞ a +1 (the pro duct s igma- field). Then we can rewrite Definition 2.1 using this notation as β ( a ) = sup C ∈ σ | P a ( C ) − [ P 0 −∞ ⊗ P ∞ a ]( C ) | . Let σ 0 − d +1 and σ a + d a +1 be the sub- σ -fields of σ 0 −∞ and σ ∞ a +1 consisting of the d -dimensional cylinder sets for the d dimensions clo sest together. Let σ d be the pro d- uct σ - field of these t wo. Then we ca n rewrite β d ( a ) as β d ( a ) = sup C ∈ σ d || P a ( C ) − [ P 0 −∞ ⊗ P ∞ a ]( C ) | . (11) As such β d ( a ) ≤ β ( a ) for all a a nd d . W e can rewrite ( 11 ) in terms of finite-dimensional marg ina ls: β d ( a ) = s up C ∈ σ d | P a,d ( C ) − [ P 0 − d ⊗ P a + d a ]( C ) | , where P a,d is the restrictio n of P to σ ( X − d , . . . , X 0 , X a , . . . , X a + d ). Because of the nested nature of thes e sigma-fields, we hav e β d 1 ( a ) ≤ β d 2 ( a ) ≤ β ( a ) for all finite d 1 ≤ d 2 . Therefo re, for fixed a , { β d ( a ) } ∞ d =1 is a mono tone increas ing sequence which is bounded ab o ve, a nd it converges to s ome limit L ≤ β ( a ). T o show that L = β ( a ) r equires s o me additional steps. Let R = P a − [ P 0 −∞ ⊗ P ∞ a ], which is a signed mea- sure o n σ . Let R d = P a,d − [ P 0 − d ⊗ P a + d a ], which is a sig ned mea sure on σ d . Decomp ose R into p ositive and neg ativ e par ts as R = Q + − Q − and simila rly for R d = Q + d − Q − d . Notice that since R d is constructed using the margina ls of P , then R ( E ) = R d ( E ) for a ll E ∈ σ d . Now since R is the difference of probability measures, we must hav e that 0 = R (Ω) = Q + (Ω) − Q − (Ω) = Q + ( D ) + Q + ( D c ) − Q − ( D ) − Q − ( D c ) (1 2) for all D ∈ σ . Define Q = Q + + Q − . Let ǫ > 0. Le t C ∈ σ b e such that Q ( C ) = β ( a ) = Q + ( C ) = Q − ( C c ) . (13) Such a set C is guara n teed by the Hahn de c o mposi- tion theo rem (letting C ∗ be a set which attains the supremum in ( 11 ), we c a n throw aw ay any s ubs ets with negative R mea s ure) a nd ( 12 ) assuming without loss of generality that P a ( C ) > [ P 0 −∞ ⊗ P ∞ a ]( C ). W e can use the field σ f = S d σ d to appr o x imate σ in the sense that, fo r a ll ǫ , we can find A ∈ σ f such that Q ( A ∆ C ) < ǫ/ 2 (se e Theo rem D in Ha lmos [ 10 , § 13] or Lemma A.2 4 in Schervish [ 20 ]). Now, Q ( A ∆ C ) = Q ( A ∩ C c ) + Q ( C ∩ A c ) = Q − ( A ∩ C c ) + Q + ( C ∩ A c ) by ( 13 ) since A ∩ C c ⊆ C c and C ∩ A c ⊆ C . Ther efore, since Q ( A ∆ C ) < ǫ/ 2, we hav e Q − ( A ∩ C c ) ≤ ǫ/ 2 (14) Q + ( A c ∩ C ) ≤ ǫ/ 2 . Also, Q ( C ) = Q ( A ∩ C ) + Q ( A c ∩ C ) = Q + ( A ∩ C ) + Q + ( A c ∩ C ) ≤ Q + ( A ) + ǫ / 2 since A ∩ C and A c ∩ C are contained in C and A ∩ C ⊆ A . Ther e fo re Q + ( A ) ≥ Q ( C ) − ǫ/ 2 . Similarly , Q − ( A ) = Q − ( A ∩ C ) + Q − ( A ∩ C c ) ≤ 0 + ǫ/ 2 = ǫ/ 2 since A ∩ C ⊆ C a nd Q − ( C ) = 0 by ( 14 ). Finally , Q + d ( A ) ≥ Q + d ( A ) − Q − d ( A ) = R d ( A ) = R ( A ) = Q + ( A ) − Q − ( A ) ≥ Q ( C ) − ǫ/ 2 − ǫ/ 2 = Q ( C ) − ǫ = β ( a ) − ǫ. Estimating β -mixing co efficients And s ince β d ( a ) ≥ Q + d ( A ), we hav e that for all ǫ > 0 there ex ists d s uch that for all d 1 > d , β d 1 ( a ) ≥ β d ( a ) ≥ Q + d ( A ) ≥ β ( a ) − ǫ. Thu s, w e must hav e that L = β ( a ), so that β d ( a ) → β ( a ) as desired. Pr o of of The or em 2.3 . By the triangle inequality , | b β d n ( a ) − β ( a ) | ≤ | b β d n ( a ) − β d n ( a ) | + | β d n ( a ) − β ( a ) | . The first term on the right is bo unded by the re- sult in Theo rem 2.4 , where we hav e shown that d n = O (exp { W (log n ) } ) is slow enough for the histo g ram estimator to remain co nsisten t. That β d n ( a ) d n →∞ − − − − → β ( a ) follows from Lemma 4.2 . 5 Discussion W e hav e shown that our estimator o f the β -mixing co efficien ts is co nsisten t for the true co efficients β ( a ) under some co nditions on the da ta generating pro cess. There are numerous results in the statistics and ma- chine learning literatures which a ssume knowledge of the β -mixing co efficients, yet as far as we know, this is the firs t estimator for them. An abilit y to e stimate these co efficien ts will allow resea rc hers to apply e x - isting results to dep enden t data witho ut the need to arbitrar ily assume their v alue s . Despite the obvious utilit y of this estimator, as a consequence o f its novelt y , it comes with a num b er of p otential extensio ns which warrant care ful exploratio n as well as some drawbacks. The r eader will no te that Theor em 2.3 do es no t pro- vide a conv er gence rate. The r ate in Theorem 2.4 ap- plies only to the difference b et ween ˆ β d ( a ) and β d ( a ). In order to pr o vide a r ate in Theor em 2.3 , we w ould need a better understanding of the non-sto c ha stic con- vergence of β d ( a ) to β ( a ). It is not immediately clear that this q uan tity can conv erge at any well- defined ra te. In par ticular, it s eems likely tha t the rate of co n vergence dep ends on the tail of the sequence { β ( a ) } ∞ a =1 . Several other mixing and w ea k-dependence coefficients also hav e a to tal-v a riation flavor, p erhaps most no- tably α -mixing [ 8 , 6 , 3 ]. None of them hav e estimators, and the s a me trick might well work for them, to o. The use o f histogra ms ra ther than kernel dens ity es- timators for the joint and margina l densities is some- what surprising and not en tirely necessary . As men- tioned ab ov e, T ran [ 22 ] proved that KDEs are co n- sistent for estima ting the stationa ry density of a time series with β -mixing inputs, so o ne c ould just replace the histog rams in our e s itmator with KDE s . How ever, KDEs suffer fro m tw o ma jor iss ues. Theore tica lly , we need an ana logue of the double asymptotic re s ults prov en for histograms in Lemma 3.4 . In particular , we need to estimate incr e a singly higher dimensional densities as n → ∞ . This do es no t cause a pr oblem of small- n -la rge- d s inc e d is chosen as a function o f n , how ever it will lead to increasing ly higher dimensional int egration. F o r histo grams, the integral is alwa ys triv- ial, but in the cas e of K DEs, the numerical accur acy of the in tegration algo rithm b ecomes incr easingly im- po rtan t. This issue could swamp a n y efficiency ga ins obtained thr o ugh the use of kernels. Ho wever, this question certainly warrants further inv estigation. The main drawback o f an estimator ba sed o n a den- sity estimate is its complexity . The mixing co efficients are functionals of the joint and marginal distributions derived from the sto ch astic pro cess X , how ever, it is unsatisfying to estimate densities and solve integrals in order to estimate a single num b er. V apnik’s main prin- ciple for solving problems us ing a r estricted amount of information is When solving a g iv en proble m, try to av oid solving a mor e gener a l problem as an inter- mediate step [ 23 , p. 30]. This principle is clearly viola ted here, but p erhaps o ur seed will precipitate a more aesthetica lly pleas ing so - lution. References [1] Ba raud, Y., Comte, F., and Viennet, G. (2001), “Adaptive estimation in autoregres sion or β -mixing regres s ion via mo del selection,” A n nals of st atistics , 29, 83 9–875. [2] Bickel, P . and Rosenblatt, M. (1973 ), “On Some Global Measures of the Deviations of Density F unc- tion Es timates,” The Annals of St atistics , 1, 1071 – 1095. [3] Br adley , R. C. (2 005), “Bas ic P r oper ties o f Str ong Mixing Conditions. A Survey and So me Op en Ques- tions,” Pr ob ability Surveys , 2, 107–1 44. [4] Car rasco, M. and Chen, X. (200 2), “Mixing and Moment Prop erties of V arious GAR CH and Sto c has tic V olatility Mo dels,” Ec onometric The ory , 18, 17 –39. [5] Cor less, R., Gonnet, G., Hare, D., J effrey , D., and Knuth, D. (19 96), “O n the Lambert W F unctio n,” A dvanc es in Computational Mathematics , 5 , 329– 359. [6] Dedeck er, J ., Doukha n, P ., L ang, G., Leon R., J. R., Louhichi, S., and Prie ur, C. (2007 ), We ak Daniel J . M cDonald, Cosma Rohilla Shalizi, Mark Schervish Dep endenc e: With Examples and A ppli c ations , vol. 190 of L e ctu r e Notes in Statistics , Springer V er lag, New Y or k. [7] Devroye, L. and Gy¨ orfi, L. (1 985), Nonp ar amet- ric Density Estimation: The L 1 View , Wiley , New Y ork. [8] Doukha n, P . (1994), Mixing: Pr op erties and Exam- ples , vol. 85 of L e ct ur e Notes in Statistics , Spring er V erlag, New Y o rk. [9] F re e dman, D. and Diaconis, P . (1 981), “On the Maximum Deviation Betw een the Histo gram and the Underlying Density ,” Pr ob ability The ory and R elate d Fields , 58, 1 3 9–167. [10] Ha lmos, P . (1 9 74), Me asur e The ory , Gradua te T exts in Mathematics, Spr inger-V erlag, New Y or k. [11] K arandik ar, R. L. and Vidyasagar, M. (200 9), “Proba bly Approximately Co rrect Learning w ith Beta-Mixing Input Sequences,” submitted for pub- lication. [12] Lo zano, A., K ulk ar ni, S., and Sc hapir e, R. (2006), “ Con vergence and Consistency o f Regular- ized Bo osting Algo rithms with Stationar y Beta- Mixing Obser v a tions,” A dvanc es in Neur al Informa- tion Pr o c essing Systems , 1 8, 819 . [13] McDia rmid, C. (1989), “On the Method of Bounded Differences,” in Surveys in Combinatorics , ed. J. Siemons, vol. 1 41 of L ondon Mathematic al So- ciety L e ctu re Note Series , pp. 148–1 88, Cambridge Univ ersity Press. [14] Meir , R. (2000), “Nonpa rametric Time Ser ies Pre- diction Thro ugh Adaptive Mo del Selection,” Ma- chine L e arning , 3 9, 5– 3 4. [15] Mo hri, M. and Rostamiza deh, A. (2010), “Stabil- it y B o unds for Stationar y ϕ -mixing and β -mixing Pro cesses,” Journ al of Machine L e arning Rese ar ch , 11, 78 9–814. [16] Mo kk a dem, A. (1988), “Mixing prop erties of ARMA pro cesses,” Sto chastic pr o c esses and their applic ations , 29, 309 – 315. [17] No bel, A. (20 06), “Hyp othesis T es ting for F ami- lies o f Er godic Pro cesses,” Bernoul li , 12 , 251– 269. [18] Nummelin, E. and T uominen, P . (19 82), “Ge- ometric Er g odicity of Harris Recurrent Marko v Chains with Applications to Renew al Theory ,” Sto chastic Pr o c esses and Their Applic ations , 12, 187–2 02. [19] Ra laiv ola, L., Szafra nski, M., and Stempfel, G. (2010), “ Chromatic P AC-Ba yes Bo unds for No n- IID Data: Applicatio ns to Ranking and Statio nary β -Mixing Pro cesses ,” Journal of Machine L e arning R ese ar ch , 11, 1927– 1956. [20] Schervish, M. (1995 ), The ory of Statistics , Springer Series in Statistics, Spr inger V erlag, New Y ork. [21] Silverman, B. (1978), “ W eak and Str o ng Uniform Consistency of the K ernel Estimate of a Densit y and its Deriv atives,” The Annals of Statistics , 6, 177 – 184. [22] T ran, L. (198 9), “The L 1 Conv erg ence o f Ker ne l Densit y Estimates under Dependence ,” The Cana- dian Journ al of Statistics/L a R evue Canadienne de Statistique , 1 7, 19 7 –208. [23] V a pnik, V. (2000 ), The Natur e of Statistic al L e arning The ory , Statistics for Engineer ing and In- formation Science, Springer V erlag, New Y or k , 2nd edn. [24] Vidyasaga r , M. (1997 ), A The ory of L e arning and Gener alization: With Applic ations to Neur al Net- works and Contr ol Systems , Springer V erlag, Berlin. [25] W o o dro ofe, M. (1967), “On the Maximum Devi- ation of the Sample Densit y ,” The Annals of Math- ematic al St atistics , 38 , 47 5–481. [26] Y u, B. (19 93), “Density Estimation in the L ∞ Norm for Dep enden t Data with Applica tions to the Gibbs Sampler,” Annals of St atistics , 21 , 711–73 5. [27] Y u, B. (19 94), “Rates of Con vergence for E mpiri- cal P ro cesses of Stationary Mixing Sequences,” The Annals of Pr ob ability , 2 2, 94– 116.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment