Nonasymptotic bounds on the mean square error for MCMC estimates via renewal techniques

The Nummellin's split chain construction allows to decompose a Markov chain Monte Carlo (MCMC) trajectory into i.i.d. "excursions". RegenerativeMCMC algorithms based on this technique use a random number of samples. They have been proposed as a promi…

Authors: Krzysztof Latuszynski, Blazej Miasojedow, Wojciech Niemiro

Nonasymptotic bounds on t he mea n squar e erro r f or MCMC estimates via r enewal techniques Krzysztof Łatuszy ´ nski, Bła ˙ z ej Miasojedow and W ojciech Niemiro Abstract The Nummellin ’ s split chain construction allo ws to decompo se a Markov chain Monte Carlo (MCMC) trajectory into i.i.d. “excur sions”. Re generative MC MC algorithm s based on this tech nique use a random numb er of samples. They have been propo sed as a promising alternative to usual fixed length simulation [ 24, 3 2, 13]. In this note we der i ve nonasymptotic bou nds on the mean square error (MSE) of regenerative MCMC estimates via techn iques of renew al theory and sequen tial statistics. These results are applied to construct confidenc e intervals. W e then focus on tw o cases of par ticular interest: chains satisfying the Doeblin cond ition and a ge- ometric drift con dition. A vailable explicit nonasym ptotic results are comp ared for different schemes of MCMC simulation. 1 Intr oduction Consider a typ ical MCMC setting, where π is a p robability distribution on X and f : X → R a Borel measurable fun ction. The objecti ve is to compu te (estimate) the integral Krzysztof Łatuszy ´ nski Department of Statistics Univ ersity of W arwick CV4 7AL, Cov entry , UK, e-mail: latuch@gmail.com Bła ˙ zej Miasojedow Institute of Applied Mathematics and Mechanics Univ ersity of W arsaw Banacha 2, 02-097 W arszaw a, Poland, e-mail: bmia@mimuw .edu.pl W ojciech Niemiro Facu lty of Mathematics and Computer Science Nicolaus Copernicus Univ ersity Chopina 12/18, 87-100 T oru ´ n, Poland, e-mail: wniemiro@gmail.com 1 2 Krzysztof Łatuszy ´ nski, Bła ˙ zej Miasojedo w and W ojci ec h Niemiro θ : = π f = Z X π ( d x ) f ( x ) . (1) Assume that direct simu lation fro m π is intractable . Therefore one uses an ergodic Markov chain with transition kernel P and stationar y distribution π to sample ap- proxim ately fro m π . Numerous com putational p roblems from Bayesian in ference, statistical physics o r combinato rial enu meration fit into th is setting. W e refer to [31, 29, 9] for theory and application s of MCMC. Let ( X n ) n ≥ 0 be the M arkov chain in qu estion. T y pically one discar ds an initial part of the trajecto ry (called burn-in, say o f length t ) to redu ce bias, one simulates the chain for n furthe r steps and one approxim ates θ with an ergodic a verag e: ˆ θ fix t , n = 1 n t + n − 1 ∑ i = t f ( X i ) . (2) The fixed numbers t a nd n are the parameter s of the alg orithm. Asymptotic v alid ity of (2) is en sured by a Stron g Law of Large Num bers and a Centr al Limit Theore m (CL T). Under appropriate regularity con ditions [3 1, 4], it ho lds t hat √ n ( ˆ θ fix t , n − θ ) → N ( 0 , σ 2 as ( f )) , ( n → ∞ ) , (3) where σ 2 as ( f ) is called the asymp totic v ar iance. In contrast with the asymp totic the- ory , explicit nonasymptotic error bo unds for ˆ θ fix t , n appear t o be very difficult to deriv e in practically meaning ful problems. Regenerativ e simulation offers a way to get arou nd some of the difficulties. Th e split ch ain constructio n introdu ced in [2, 27] (to b e describe d in Section 2) allows for partitionin g the tra jectory ( X n ) n ≥ 0 into i.i.d. ran dom tours (excursions) between consecutive regeneration times T 0 , T 1 , T 2 , . . . . Rand om variables Ξ k ( f ) : = T k − 1 ∑ i = T k − 1 f ( X i ) (4) are i.i.d. for k = 1 , 2 , . . . ( Ξ 0 ( f ) can have a different distrib ution). Myk land et al. in [24] suggested a practically relev an t recipe for id entifying T 0 , T 1 , T 2 , . . . in simula- tions (formu la (2) in Section 2 ). This resolves the burn-in pr oblem since one can just igno re the part until the first regeneration T 0 . One can also sto p the simula tion at a regeneration time, say T r , and simulate r full i.i.d. tours, c.f. Section 4 of [32]. Thus one estimates θ by ˆ θ reg r : = 1 T r − T 0 T r − 1 ∑ i = T 0 f ( X i ) = ∑ r k = 1 Ξ k ( f ) ∑ r k = 1 τ k , (5) where τ k = T k − T k − 1 = Ξ k ( 1 ) are the len gths of excursions. The n umber of tour s r is fixed and the total simulation effort T r is rando m. Since ˆ θ reg r in volves i. i.d. rand om variables, classical tools seem to be sufficient to analyse its behaviour . Asymptoti- cally , (5) is equiv alen t to (2) because Nonasymptotic bounds for MCMC 3 √ rm ( ˆ θ reg r − θ ) → N ( 0 , σ 2 as ( f )) , ( r → ∞ ) , where m : = E τ 1 . Now r m = E ( T r − T 0 ) , the expected length of the trajectory , plays the role of n . However , our attempt at no nasymptotic analysis in Subsectio n 3.1 reveals unexpected difficulties: our bo unds inv o lve m in th e denomin ator and in most practically relev an t situations m is unkn o wn . If m is known then instead of (5 ) one can use an un biased estimator ˜ θ unb r : = 1 rm r ∑ k = 1 Ξ k ( f ) , (6) Quite unexpected ly , (6) is not equivalent to (5) , even in a weak asymp totic sense. The standard CL T for i.i.d. summands yields √ rm ( ˜ θ unb r − θ ) → N ( 0 , σ 2 unb ( f )) , ( r → ∞ ) , where σ 2 unb ( f ) : = V ar Ξ 1 ( f ) / m is in gen eral different from σ 2 as ( f ) . W e introdu ce a new regenerati ve-sequential simulation scheme, fo r which better nonasym ptotic results can be deriv ed. Namely , we fix n a nd defin e R ( n ) : = min { r : T r > T 0 + n } . The estimator is defined as ˆ θ reg-se q n : = 1 T R ( n ) − T 0 T R ( n ) − 1 ∑ i = T 0 f ( X i ) = ∑ R ( n ) k = 1 Ξ k ( f ) ∑ R ( n ) k = 1 τ k . (7) W e thus gen erate a rando m numb er of tours as well as a random number of samples. Our appro ach is based on inequalities for the mean square error, MSE : = E ( ˆ θ − θ ) 2 . Bounds on the MSE can be used to construct fixed precision confidence inte rv a ls. The goal is to obtain an estimator ˆ θ which satisfies P ( | ˆ θ − θ | ≤ ε ) ≥ 1 − α , (8) for given ε and α . W e comb ine the MSE bo unds with th e so called “median trick” [15, 26]. One ru ns MCMC rep eatedly and computes th e m edian of indep endent estimates to boost the level of confiden ce. In ou r paper, the median trick is used in conjunc tion with regenerative simulation. The organizatio n of the paper is th e following. I n Section 2 we recall the split chain constru ction. Nonasym ptotic bounds for regenerative estimators defined by (5), (6) and (7 ) are derived in Section 3. Deriv ation of more explicit bound s which in volve only comp utable quantities is deferred to Sections 5 and 6, where we co n- sider classes of ch ains particularly importan t in the MCMC con te xt. An analo gous 4 Krzysztof Łatuszy ´ nski, Bła ˙ zej Miasojedo w and W ojci ec h Niemiro analysis of the non -regenerati ve schem e (2 ) w as consider ed in [20] and (in a differ - ent setting and using different meth ods) in [33]. In Section 4 w e discuss th e median trick . The resulting confidence in terv als are compare d with asymptotic results based on the CL T . In Section 5 we consider Do eblin chains, i.e . u niformly ergod ic chains th at sat- isfy a one step m inorization condition. W e compare regene rati ve estimators (5), (6) and (7). Moreover , we also consider a perfect sampler a vailable for Doeblin chains, c.f. [35, 14]. W e show tha t confide nce intervals based o n the median trick can ou t- perfor m those obtained via exponential inequalities for a single run simulation. In Section 6 we proceed to ana lyze geom etrically ergod ic Mar ko v chain s, assum- ing a drift condition tow ards a sma ll set. W e briefly compar e regenerativ e schemes (5) and (7) in this setting (the u nbiased estimator (6) cannot b e used, becau se m is unknown). 2 Regenerativ e Simulation W e describe the setting more precisely . Let ( X n ) n ≥ 0 be a Markov chain with tran- sition kernel P on a Polish space X with stationary distribution π , i.e. π P = π . Assume P is π -irredu cible. The r e generation /split co nstruction of Num melin [27] and Athreya and Ne y [2] rests on the following assump tion. Assumption 2.1 (Small Set) Ther e exist a Bor el set J ⊆ X o f positive π measur e, a number β > 0 and a pr obability measur e ν such that P ( x , · ) ≥ β I ( x ∈ J ) ν ( · ) . Under Assump tion 2.1 we can define a b i variate Markov chain ( X n , Γ n ) o n the space X × { 0 , 1 } in the following way . V ariable Γ n − 1 depend s only on X n − 1 via P ( Γ n − 1 = 1 | X n − 1 = x ) = β I ( x ∈ J ) . The rule of transition from ( X n − 1 , Γ n − 1 ) to X n is giv en by P ( X n ∈ A | Γ n − 1 = 1 , X n − 1 = x ) = ν ( A ) , P ( X n ∈ A | Γ n − 1 = 0 , X n − 1 = x ) = Q ( x , A ) , where Q is the norm alized “residual” kernel gi ven by Q ( x , · ) : = P ( x , · ) − β I ( x ∈ J ) ν ( · ) 1 − β I ( x ∈ J ) . Whenever Γ n − 1 = 1, the chain regenerates at mo ment n . Th e regeneration epochs are T 0 : = min { n : Γ n − 1 = 1 } , T k : = min { n > T k − 1 : Γ n − 1 = 1 } . Nonasymptotic bounds for MCMC 5 The rando m tours defined by Ξ k : = ( X T k − 1 , . . . , X T k − 1 , τ k ) , where τ k = T k − T k − 1 , (9) are independen t. W ith out loss of gener ality , we assume that X 0 ∼ ν ( · ) , unless stated otherwise. Un der this a ss umption , all the tours Ξ k are i.i. d. for k > 0. W e there fore put T 0 : = 0 and simplify notation. In the s equel symbols P and E withou t su bscripts refer to the chain started at ν . If th e in itial distribution ξ is oth er than ν , it will be explicitly indicated by writing P ξ and E ξ . Notation m = E τ 1 stands through out the paper . W e assume that we a re able to identify regeneration times T k . Mykland et al. pointed out in [24] that actual sampling from Q can be a voided. W e can gene rate the chain using transition pr obabability P and then recover the regeneration ind icators via P ( Γ n − 1 = 1 | X n , X n − 1 ) = I ( X n − 1 ∈ J ) β ν ( d X n ) P ( X n − 1 , d X n ) , where ν ( d y ) / P ( x , d y ) de note the Rado n-Nikodym d eri vativ e (in practice, the ratio of densities). Myk land’ s trick has bee n established in a num ber of practically r ele vant families (e.g. h ierarchical linear m odels) and specific Markov chains im plementa- tions, such as block Gibbs samplers or variable-at-a-time chain s, see [1 7, 25]. 3 General r esults for r egenerative estimators Recall th at f : X → R is a measurable fu nction an d θ = π f . W e consider bloc k sums Ξ k ( f ) defined by (4 ). The general Kac theo rem states that the mean occupation time during one tour is propo rtional to the stationary measure (Theo rem 10 .0.1 in [23] or Equation s (3.3.4 ), (3.3.6 ), (3.4.7 ) and (3.5.1) in [28]). This yields m = 1 β π ( J ) , E Ξ 1 ( f ) = m π f = m θ . From n o w on we assume that E Ξ 1 ( f ) 2 < ∞ and E τ 2 1 < ∞ . For a discussion of these assumptions in the MCMC context, see [13]. Let ¯ f : = f − π f and defin e σ 2 as ( f ) : = E Ξ 1 ( ¯ f ) 2 m , (10) σ 2 τ : = V ar τ 1 m . (11) Remark 1. Un der Assump tion 2.1, fin iteness of E Ξ 1 ( ¯ f ) 2 is a sufficient and neces- sary co ndition fo r the CL T to hold fo r Markov chain ( X n ) n ≥ 0 and fun ction f . This fact is proved in [4] in a m ore gen eral setting. For o ur purposes it is imp ortant to note that σ 2 as ( f ) in (10 ) is indeed the asymptotic varianc e which appears in the CL T . 6 Krzysztof Łatuszy ´ nski, Bła ˙ zej Miasojedo w and W ojci ec h Niemiro 3.1 Results for ˆ θ reg r W e are to boun d the estimation error which can be expressed as follows: ˆ θ reg r − θ = ∑ r k = 1  Ξ k ( f ) − θ τ k  ∑ r k = 1 τ k = ∑ r k = 1 d k T r . (12) where d k : = Ξ k ( f ) − θ τ k = Ξ k ( ¯ f ) . The refore, for any 0 < δ < 1 , P ( | ˆ θ reg r − θ | > ε ) ≤ P     r ∑ k = 1 d k     > r m ε ( 1 − δ ) ! + P  T r < r m ( 1 − δ )  . Since d k are i.i.d. with E d 1 = 0 a nd V ar d 1 = m σ 2 as ( f ) , we can use Cheby she v in- equality to boun d the first term above: P     r ∑ k = 1 d k     > r m ε ( 1 − δ ) ! ≤ σ 2 as ( f ) rm ε 2 ( 1 − δ ) 2 . The seco nd term can be bound ed similarly . W e use the fact th at τ k are i.i.d. w ith E τ 1 = m to write P  T r < r m ( 1 − δ )  ≤ σ 2 τ rm 2 δ 2 . W e conclud e the above calculation with in following Theorem. Theorem 3.1 Und er Assumptio n 2.1 the following ho lds for every 0 < δ < 1 P ( | ˆ θ reg r − θ | > ε ) ≤ 1 rm  σ 2 as ( f ) ε 2 ( 1 − δ ) 2 + σ 2 τ m δ 2  (13) and is minimized by δ = δ ∗ : = σ 2 / 3 τ σ 2 / 3 as ( f ) ε − 2 / 3 + σ 2 / 3 τ . Obviously , E T r = r m is the expected length o f trajectory . The main drawback of Theorem 3. 1 is that the b ound on the estimatio n error depend s on m , which is typ i- cally unknown. Replacing m by 1 in (13) would be highly inefficient. This fact mo- ti vates our study o f another estimator, ˆ θ reg-se q n , for wh ich we can obtain much mor e satisfactory results. W e think that the deri vation of better nonasympto tic bo unds for ˆ θ reg r (not in volving m ) is an open pr oblem. Nonasymptotic bounds for MCMC 7 3.2 Results for ˜ θ unb r Recall that ˜ θ unb r can be used only wh en m is kn o w n and this situation is rather rar e in MCMC application s. The analysis of ˜ θ unb r is straigh tforward, b ecause it is simply a sum of i.i.d. random variables. In particular , we obtain the following. Corollary 3.2 Und er Assumption 2 .1, E ( ˜ θ unb r − θ ) 2 = σ 2 unb ( f ) rm , P ( | ˜ θ unb r − θ | > ε ) ≤ σ 2 unb ( f ) rm ε 2 . Note that σ 2 unb ( f ) = V ar Ξ 1 ( f ) / m can be expressed as σ 2 unb ( f ) = σ 2 as ( f ) + θ 2 σ 2 τ + 2 θ ρ ( ¯ f , 1 ) , (14) where ρ ( ¯ f , 1 ) : = Cov ( Ξ 1 ( ¯ f ) , Ξ 1 ( 1 )) / m . Th is follows from the simple observation that V ar Ξ 1 ( f ) = E ( Ξ 1 ( ¯ f ) + θ ( τ 1 − m )) 2 . 3.3 Results for ˆ θ reg-seq n The result below boun ds the MSE and the expected nu mber of samples used to compute the estimator . Theorem 3.3 If Assump tion 2.1 h olds then ( i ) E ( ˆ θ reg-se q n − θ ) 2 ≤ σ 2 as ( f ) n 2 E T R ( n ) and ( ii ) E T R ( n ) ≤ n + C 0 , wher e C 0 : = σ 2 τ + m . Corollary 3.4 Und er Assumption 2 .1, E ( ˆ θ reg-se q n − θ ) 2 ≤ σ 2 as ( f ) n  1 + C 0 n  , (15) P ( | ˆ θ reg-se q n − θ | > ε ) ≤ σ 2 as ( f ) n ε 2  1 + C 0 n  . (16) Remark 2. No te that the leadin g term σ 2 as ( f ) / n in (15) is “asym ptotically c orrect” in the sense tha t the standard fixed length estimator has MSE ∼ σ 2 as ( f ) / n . The regenerative-sequential scheme is “close to th e fixed length simulation”, becau se lim n → ∞ E T R ( n ) / n = 1. 8 Krzysztof Łatuszy ´ nski, Bła ˙ zej Miasojedo w and W ojci ec h Niemiro Pr oof (o f Theo r em 3.3). Just as in (12) we have ˆ θ reg-se q n − θ = ∑ R ( n ) k = 1 ( Ξ k ( f ) − θ τ k ) ∑ R ( n ) k = 1 τ k = 1 T R ( n ) R ( n ) ∑ k = 1 d k , where pairs ( d k , τ k ) are i.i.d. with E d 1 = 0 and V ar d 1 = m σ 2 as ( f ) . Since T R ( n ) > n , it follows that E ( ˆ θ reg-se q n − θ ) 2 ≤ 1 n 2 E R ( n ) ∑ k = 1 d k ! 2 . Since R ( n ) is a stopp ing tim e with respect to G k = σ (( d 1 , τ 1 ) , . . . , ( d k , τ k )) , we are in a position to apply the two W ald ’ s identities (see Append ix). The second identity yields E R ( n ) ∑ k = 1 d k ! 2 = V ar d 1 E R ( n ) = m σ 2 as ( f ) E R ( n ) . In this expression we can replac e m E R ( n ) by E T R ( n ) because o f th e first W a ld’ s identity: E T R ( n ) = E R ( n ) ∑ k = 1 τ k = E τ 1 E R ( n ) = m E R ( n ) and (i) follows. W e now focu s attention on bounding the e xpectation of t he “overshoot” ∆ ( n ) : = T R ( n ) − n . Since we assume tha t X 0 ∼ ν , the cumu lati ve sums τ 1 = T 1 < T 2 < . . . < T k < . . . form a ( nondelayed) renewal pr ocess in discrete time. Let us inv oke the following ele gant theorem of Lorden [21, Theorem 1]: E ∆ ( n ) ≤ E τ 2 1 / m . This inequ ality combined with (11) y ields immediate ly E T R ( n ) = E ( n + ∆ ( n )) ≤ n + σ 2 τ + m , i.e. (ii). 4 The median trick This ing eneous method of co nstructing fixed precision MCMC alg orithms was introdu ced in 19 86 in [15], later used in many pap ers con cerned with computa- tional com plexity and further developed in [26]. W e ru n l ind ependent copies of the Markov chain. Let ˆ θ ( j ) be an estimator computed in j th ru n. The fina l esti- mate is ˆ θ : = med ( ˆ θ ( 1 ) , . . . , ˆ θ ( l ) ) . T o en sure that ˆ θ satisfies (8 ), we requ ire that P ( | ˆ θ ( j ) − θ | > ε ) ≤ a ( j = 1 , . . . , l ) for some mo dest le vel of confid ence 1 − a < 1 − α . This is obtained via Chebyshev’ s inequality , if a bound on MSE is av ailable. The well-known Chern of f ’ s inequality gi ves for odd l , Nonasymptotic bounds for MCMC 9 P ( | ˆ θ − θ | ≥ ε ) ≤ 1 2 [ 4 a ( 1 − a )] l / 2 = 1 2 exp  l 2 ln [ 4 a ( 1 − a )]  . (17) It is poin ted out in [2 6 ] that und er some assumptio ns there is a univ ersal choice o f a , which nearly minimizes the overall nu mber of samples, a ∗ ≈ 0 . 1 1969. Let us now examine how the median trick w o rks in con junction with regenerative MCMC. W e focus on ˆ θ reg-se q n , be cause C orollary 3. 4 gi ves the b est a vailable bound on MSE. W e first choose n such that the right hand s ide of (16) is less than or equal to a ∗ . Then choose l big e nough to make the right h and side of ( 17) (with a = a ∗ ) less than or equal to α . Compute estimator ˆ θ reg-se q n repeatedly , using l indep endent runs of the chain. W e can see that (8) holds if n ≥ C 1 σ 2 as ( f ) ε 2 + C 0 , (18) l ≥ C 2 ln ( 2 α ) − 1 and l is odd , (19) where C 1 : = 1 / a ∗ ≈ 8 . 3 549 and C 2 : = 2 / ln [ 4 a ∗ ( 1 − a ∗ )] − 1 ≈ 2 . 3 147 are absolute constants. In deed, (18) entails C 1 σ 2 as ( f ) / ( ε 2 n ) ≤ 1 − C 0 / n , so C 1 σ 2 as ( f ) / ( ε 2 n )( 1 + C 0 / n ) ≤ 1 − C 2 0 / n 2 < 1. Con sequently σ 2 as ( f ) / ( ε 2 n )( 1 + C 0 / n ) < a ∗ and we are in a position to apply (16). The overall (expected) num ber of gen erated samp les is l E T R ( n ) ∼ nl as ε → 0 and n → ∞ , by Theorem 3 .3 (ii). Consequen tly for ε → 0 th e cost of the algo rithm is approx imately nl ∼ C σ 2 as ( f ) ε 2 log ( 2 α ) − 1 , (20) where C = C 1 C 2 ≈ 19 . 34 . T o see how tight is the obtained lower bound, let us com- pare (20 ) with the familiar asymptotic appr oximation, based on the CL T . Con sider an estimato r based o n on e MCMC run of len gth n , say ˆ θ n = ˆ θ fix t , n with t = 0. From (3) we infer that lim ε → 0 P ( | ˆ θ n − θ | > ε ) = α , holds for n ∼ σ 2 as ( f ) ε 2  Φ − 1 ( 1 − α / 2 )  2 , (21) where Φ − 1 is the quantile fun ction of the standar d normal distribution. T ak ing into account the fact that [ Φ − 1 ( 1 − α / 2 )] 2 ∼ 2 log ( 2 α ) − 1 for α → 0 we arrive at the following conclusion . The r ight ha nd side of (20) is bigger than (2 1) ro ughly b y a constant factor of about 10 (for small ε and α ). The important difference is that (20) is sufficient for an exact con fidence inter v al while (21) only for an asymptotic one. 10 Krzysztof Łatuszy ´ nski, Bła ˙ zej Miasojedo w and W oj ciec h Niemiro 5 Doeblin Chains Assume th at the transition kern el P satisfies th e following Doeblin condition : ther e exist β > 0 and a pro bability measure ν such that P ( x , · ) ≥ β ν ( · ) for every x ∈ X . (22) This amo unts to ta king J : = X in Assump tion 2 .1. Con dition (2 2) imp lies that the ch ain is un iformly ergo dic. W e refer to [31] and [2 3] fo r definition o f u niform ergodicity and relate d concep ts. As a consequence of the regener ation construction, in our pr esent setting τ 1 is distributed a s a geometric ra ndom variable with param eter β and therefore m = E τ 1 = 1 β and σ 2 τ = V ar τ 1 m = 1 − β β . Bounds on the asymp totic variance σ 2 as ( f ) un der (2 2) are well known. Let σ 2 = π ¯ f 2 be the stationary variance. Results in Sectio n 5 of [4] imply that σ 2 as ( f ) ≤ σ 2 1 + 2 p 1 − β 1 − p 1 − β ! ≤ 4 σ 2 β . (23) Since in [4] a more general situation is considered , which complicates the formulas, let us give a s imple derivation o f ( 23) u nder (22). By (10) and th e fo rmula ( 29) given in the Appen dix, σ 2 as ( f ) ≤ E Ξ 1 ( | ¯ f | ) 2 m = E π ¯ f ( X 0 ) 2 + 2 ∞ ∑ i = 1 E π | ¯ f ( X 0 ) ¯ f ( X i ) | I ( τ 1 > i ) . The first term above is eq ual to σ 2 . T o bound the terms of the series, use Cauchy - Schwarz and the fact th at, under (22 ) , ran dom variables X 0 and τ 1 are ind epen- dent. Therefor e E π | ¯ f ( X 0 ) ¯ f ( X i ) | I ( τ 1 > i ) ≤  E π ¯ f ( X i ) 2 E π ¯ f ( X 0 ) 2 P π ( τ 1 > i )  1 / 2 = σ 2 ( 1 − β ) i / 2 . Computing the sum of the geometric series yields (23). If th e chain is r e versible, ther e is a b etter bound th an (23). W e can use the well- known formu la for σ 2 as ( f ) in ter ms o f the spectr al deco mposition o f P (e.g . ex- pression “C” in [11]). Results of [ 30] show that the spectrum of P is a sub set of [ − 1 + β , 1 − β ] . W e c onclude that for re versible Doeblin cha ins, σ 2 as ( f ) ≤ 2 − β β σ 2 ≤ 2 σ 2 β . (24) An important c lass of reversible chains are Independ ence Metrop olis-Hastings chains (see e.g. [31]) that are known to be unifo rmly ergod ic if and only if the rejec- tion pro bability r ( x ) is unif ormly boun ded from 1 by say 1 − β . Th is is eq ui valent to the candidate distribution being bou nded b elo w by β π (c.f. [22 , 1]) and translates Nonasymptotic bounds for MCMC 11 into ( 22) with ν = π . The formu la for σ 2 as ( f ) in (23 ) and (24) d epends on β in an optimal way . Moreover (24) is sharp. T o see this consider the following e x ample. Example 5.1 Let β ≤ 1 / 2 and defin e a Marko v chain ( X n ) n ≥ 0 on X = { 0 , 1 } with stationary distribution π = [ 1 / 2 , 1 / 2 ] and tr ansition matrix P =  1 − β / 2 β / 2 β / 2 1 − β / 2  . Hence P = β π + ( 1 − β ) I 2 and P ( x , · ) ≥ β π . Note th at the residual kernel Q is in our e xample the identity matrix I 2 . Thus, before the first r e gene r a tion τ 1 the chain does n ot move. Let f ( x ) = x . Thus σ 2 = 1 / 4 . T o compu te σ 2 as ( f ) we use another well-known formula (expr ession “B” in [11]): σ 2 as ( f ) = σ 2 + 2 ∞ ∑ i = 1 Cov π { f ( X 0 ) , f ( X i ) } = σ 2 + 2 σ 2 ∞ ∑ i = 1 ( 1 − β ) i = 2 − β β σ 2 . T o compute σ 2 unb ( f ) , n ote tha t Ξ 1 ( f ) = I ( X 0 = 1 ) τ 1 . Sinc e τ 1 is ind ependent of X 0 and X 0 ∼ ν = π we obtain σ 2 unb ( f ) = β V ar Ξ 1 ( f ) = β  E V ar ( Ξ 1 ( f ) | X 0 ) + V ar E ( Ξ 1 ( f ) | X 0 )  = 1 − β 2 β + 1 4 β = 3 − 2 β β σ 2 . Inter estingly , in this example σ 2 unb ( f ) > σ 2 as ( f ) . In th e setting of this Section, we will now comp are upper b ounds on the total simulation effort need ed for different MCMC schemes to get P ( | ˆ θ − θ | > ε ) ≤ α . 5.1 Regenerative-seque ntial estima tor and the median trick Recall that this simulation schem e con sis ts of l MCMC runs, each of approx imate length n . Substitutin g either (23) or (24) in (20) we obta in that the e xpected num ber of samples is nl ∼ 19 . 34 4 σ 2 β ε 2 log ( 2 α ) − 1 and nl ∼ 19 . 34 ( 2 − β ) σ 2 β ε 2 log ( 2 α ) − 1 (25) (respectively in the general case and fo r reversible chains). Note also that in the setting of this Section we have an exact expression for the constant C 0 in Theor em 3.3. Indeed, C 0 = 2 / β − 1. 12 Krzysztof Łatuszy ´ nski, Bła ˙ zej Miasojedo w and W oj ciec h Niemiro 5.2 Standard one-run average and exponential inequalty For unifo rmly ergodic chains a d irect compar ison of ou r ap proach to exponen tial inequalities [10, 18] is possible. W e focus o n the result proved in [18] for chains on a c ountable state space. This ineq uality is tight in the sen se that it redu ces to the Hoeffding bound when specialised to the i.i.d . case. For f bou nded let k f k ∞ : = sup x ∈ X | f ( x ) | . Consider the simple a verag e over n Markov chain samples, s ay ˆ θ n = ˆ θ fix t , n with t = 0. For an arbitrary initial distribution ξ we have P ξ ( | ˆ θ n − θ | > ε ) ≤ 2 exp ( − n − 1 2  2 β k f k ∞ ε − 3 n − 1  2 ) . After iden tifying the leading terms we can see tha t to make the rig ht hand side less than α we need n ∼ k f k 2 ∞ 2 β 2 ε 2 log ( α / 2 ) − 1 ≥ 2 σ 2 β 2 ε 2 log ( α / 2 ) − 1 . (26) Comparing (25) with (26 ) yields a ratio of roug hly 40 β or 20 β respectively . This in particular indicates that t he depend ence on β in [10, 18] probably can be improved. W e note th at in examples o f practical interest β usually decay s exponen tially with the d imension of X and using the regenerative-sequential-median scheme will of- ten result in a lower total simulation cost. Mo reover , this ap proach is valid fo r an unbou nded target function f , in contra st with classical exponen tial inequalities. 5.3 P erfect sampler and the median trick For Doeblin chain s, if r e generation times can be iden tified, perf ect sampling can be p erformed easily as a version of read-on ce algor ithm [35]. This is du e to the following observation. I f condition (22) holds and X 0 ∼ ν then X T k − 1 , k = 1 , 2 , . . . are i. i.d. random variables from π (see [ 5, 2 8, 14, 4] fo r version s of this resu lt). Therefo re from each rando m tour between regeneration times one can obtain a sin- gle p erfect sample (by takin g the state of the ch ain prior to r e generation ) and use it for i.i.d. estimation. W e define ˆ θ perf r : = 1 r r ∑ k = 1 f ( X T k − 1 ) . Clearly Nonasymptotic bounds for MCMC 13 E ( ˆ θ perf r − θ ) 2 = σ 2 r and P ( | ˆ θ perf r − θ | > ε ) ≤ σ 2 r ε 2 . Note that to compute ˆ θ perf r we need to simu late n ∼ r / β steps of the M arkov chain. If we co mbine the perf ect sampler with th e m edian trick we obtain an algor ithm with the expected number of samples nl ∼ 19 . 3 4 σ 2 β ε 2 log ( 2 α ) − 1 . (27) Comparing (2 5) with (26) and (27) leads to the conclu sion that if o ne targets rigor - ous nonasymptotic results in the Doeblin chain setting, t he approach described here outperf orms other methods. 5.4 Remarks on other schemes The bound fo r ˆ θ reg r in Theorem 3.1 is clearly inferior to that for ˆ θ reg-se q n in Corollary 3.4. Therefor e we excluded the scheme based on ˆ θ reg r from our comparisons. As for ˜ θ unb r , this estimator can be used in the Doeb lin chains setting, beca use m = 1 / β is kn o wn . The b ounds for ˜ θ unb r in Subsection 3.2 in volve σ 2 unb ( f ) . Alth ough we cannot provide a rigoro us proof, we conjectur e that in most p ractical situation s we ha ve σ 2 unb ( f ) > σ 2 as ( f ) , because ρ ( ¯ f , 1 ) in (14) is often c lose to ze ro. If this is the case, then the boun d for ˜ θ unb r is inferior to that for ˆ θ reg-se q n . 6 A Geometric Drift Condition Using d rift conditions is a stand ard approach for establishing ge ometric ergodicity . W e refer to [31] or [23] for the defin ition and further details. The assump tion below is the same as in [3]. S pecifically , let J be the small set which app ears in Assumptio n 2.1. Assumption 6.1 (Drift) Ther e exist a function V : X → [ 1 , ∞ [ , constants λ < 1 and K < ∞ such that P V ( x ) : = Z X P ( x , d y ) V ( y ) ≤ ( λ V ( x ) for x 6∈ J , K for x ∈ J , In many pa pers conditio ns similar to Assum ption 6.1 have been established f or re- alistic MCMC algorithms in statistical models of pr actical relev an ce [12, 7, 8, 16, 17, 34]. This opens the possibility of computing our boun ds in these models. 14 Krzysztof Łatuszy ´ nski, Bła ˙ zej Miasojedo w and W oj ciec h Niemiro Under Assump tion 6 .1, it is possible to b ound σ 2 as ( f ) , σ 2 τ and C 0 which a ppear in Theor ems 3.1 and 3.3 , by expressions in volving only λ , β and K . The following result is a minor variation of Theorem 6.5 in [1 9 ]. Theorem 6.2 If Assumptio ns 2 .1 and 6 .1 hold and f is such that k ¯ f k V 1 / 2 : = sup x | ¯ f ( x ) | / V 1 / 2 ( x ) < ∞ , then σ 2 as ( f ) ≤ k ¯ f k 2 V 1 / 2 " 1 + λ 1 / 2 1 − λ 1 / 2 π ( V ) + 2 ( K 1 / 2 − λ 1 / 2 − β ( 2 − λ 1 / 2 )) β ( 1 − λ 1 / 2 ) π ( V 1 / 2 ) # C 0 ≤ λ 1 / 2 1 − λ 1 / 2 π ( V 1 / 2 ) + K 1 / 2 − λ 1 / 2 − β β ( 1 − λ 1 / 2 ) − 1 . T o b ound σ 2 τ we can u se the obvious ineq uality σ 2 τ = C 0 − m ≤ C 0 − 1. Moreover , one can easily co ntrol π V and π V 1 / 2 and fu rther replace k ¯ f k V 1 / 2 e.g. by k f k V 1 / 2 + ( K 1 / 2 − λ 1 / 2 ) / ( 1 − λ 1 / 2 ) , we refer to [19] for details. Let us now discuss possible app roaches to confidence estimation in the setting of this section. Perf ect samplin g is in g eneral un a vailable. For unbo unded f we cannot apply exponential ineq ualities for th e standar d o ne-run estimate. Since m is unk no wn we c annot use ˜ θ unb r . This leaves ˆ θ reg r and ˆ θ reg-se q n combined with th e median tr ick. T o analyse ˆ θ reg r we can app ly Theorem 3 .1. Upper bo unds for σ 2 as ( f ) and σ 2 τ are av ailable. Ho we ver , in Theorem 3.1 we will also need a lower boun d on m . W ithou t fu rther assump tions we can on ly write m = 1 π ( J ) β ≥ 1 β . (28) In the above analy sis (28) is par ticularly disappo inting. It multiplies the bound b y an u nexpected and substantial factor, as π ( J ) is typ ically small in application s. F or ˆ θ reg-se q n we have mu ch more satisfactory results. Theor ems 3 .3 and 6 .2 can be used to obtain boun ds which do n ot inv olve m . In m any realistic examples, the p arameters β , λ and K which app ear in Assumptio ns 2.1 (Small Set) a nd 6.1 (Drift) can b e explicitly computed, see e.g. [16, 17, 34]. W e note that n onasymptotic c onfidence inter v a ls for MCMC estimator s u nder drift conditio n hav e also been obtained in [20], where identification of regeneration times has no t been assumed. In absen ce of regen eration times a different ap proach has been used and the b ounds are typ ically weaker . For example one can comp are [20, Coro llary 3.2] ( for estimator ˆ θ fix t , n ) combined with the bound s in [3] with our Theorem s 3.3 and 6.2 (for estimator ˆ θ reg-se q n ). Nonasymptotic bounds for MCMC 15 Refer ences 1. Y .F . Atchade, F . Perro n (2007): On the geometric ergod icity of Metropolis-Hastings algo- rithms. Statistics 41, 77–84. 2. K.B. Athreya and P . Ney (1978): A new approach to the limit theory of recurrent Marko v chains, T rans . Amer . Math. Soc. 245, 493–501. 3. P . H. Baxe ndale (2005): Rene wal Theory and Computable Con vergen ce Rates for Geometri- cally Ergod ic Marko v Chains. Ann. Appl. Pro b . 15, 700-738. 4. W . Bedno rz, R. Latała and K. Łatuszy ´ nski (2008): A Regener ation Proof of the Central Limit Theorem for Uniformly Ergo dic Marko v Chains. Elect. Comm. in Pr obab . 13, 85–98. 5. L.A. Breyer and G. O. Ro berts (2001): Catalyti c pe rfect simulation. M etho dol. Comput. Appl. Pr obab . 3 161–177. 6. Y .S. Chow and H. T eicher (1 988): Pr obabilit y Theory , Ind ependenc e, Inter chang eability , Mar- tingales . Second Edition, Springer V erlag. 7. G. Fort and E. Moulines (2000): V -subgeometric er godicity for a Hastings–Metro polis algo- rithm. Statist. Pr obab . Lett. 49, 401–410. 8. G. Fo rt, E. Moulines, G.O. Roberts, and J.S. Rosenthal (2003) : On t he geometric erg odicity of hybrid samplers . J. Appl. Pr obab . 40 (1), 123-146. 9. W .R. Gilks, S. Richardson, D.J. Spiegelhalter: Markov chain Monte Carlo i n pr actice. Chap- man & Hall, 1998. 10. P .W . Glynn and D. Ormoneit (2002): Hoeffding’ s i neq uality for uniformly ergod ic Marko v chains, Statist. Pr obab . Lett. 56, 143–146. 11. O. H ¨ aggstr ¨ om J.S. Rosenthal (2007): On va riance conditions for Marko v chain CL Ts. Elect. Comm. in Pr obab . 12 , 454–464. 12. J.P . Hobert and C.J. Geyer (1998) : Geometric ergodicity of Gi bb s and block Gibbs samplers for Hierarchical Random Ef fects Model. J . Multivariate Anal. 67, 414–439. 13. J.P . Hobert, G.L. Jones , B. Presnell, and J.S. Rosenthal (2002): On the Appli ca bility of Re- generati ve Simulati on in Marko v Chain Monte Carlo. Biometrika 89, 731-74 3. 14. J.P . Hober t and C. P . Robert (2004): A mi xture representation of π with applications in Markov chain Monte Carlo and perfe ct sampling. Ann. Appl. Pr obab . 14 1295–1305. 15. M.R. Jerrum, L.G. V aliant, V . V . V azirani (1986): Random generation of combinatorial struc- tures fro, a uniform distribution. Theor etical Computer Science 43, 169–188. 16. G.L. Jones, J .P . Hobert (2 004): Suf ficient b urn-in fo r Gibbs s amplers for a hierarchical random ef fects model. Ann. Statist. 32, pp. 784–817. 17. A.A. Johnson and G.L. Jones (2010): Gibbs sampling for a Bayesian hierarchical general linear model. Electr onic J. Statist. 4, 313–3 33. 18. I. Kon toyiannis, L. Lastras-Montano, S. P . Meyn (2005): Relativ e Entropy and Exponential De viation Bounds f or Gen eral Mark ov Chains . 2005 IEEE International Symp osium on In for- mation Theor y . 19. K. Łatuszy ´ nski, B. Miasojedow annd W . Niemiro (2009): Nonasymptotic bounds on the esti- mation erro r for regenerati ve MCMC algorithms. arXiv:09 07.4915v1 20. K. Łatuszy ´ nski, W . Niemiro (2011): Rigorous confid ence bounds for MCMC under a geome t- ric drift condition. J . of Complexity 27, 23–3 8. 21. G. Lorden: On exces s over the boundar y . Ann. M ath. S tatist. 41, 520–527, 1970. 22. K.L. Mengersen, L.R. T weedie (199 6): Rates of con ver gence of the Hastings and Metropolis algorithms. Ann. Statist. 24, 1, 101–121 . 23. S.P . Meyn and R.L. T weedie: Markov Cha ins and Stochastic Stab ility . Springer -V erlag, 1993. 24. P . Mykland, L. Tierne y and B. Y u (1995): Reg eneration in Mark ov chain samplers. J . Am. Statist. Assoc.. , 90, 233–241. 25. R. Neath, G. L. Jones (2009): V ariable-at-a- time implementation of Marko v chain Monte Carlo. Preprint. arXi v:0903.0664v1 26. W . Niemiro, P . Pokaro wski (2009): Fix ed precision MCMC Estimation by Median of Products of A verages. J. Appl. Pr obab. 46 (2), 309–329 . 16 Krzysztof Łatuszy ´ nski, Bła ˙ zej Miasojedo w and W oj ciec h Niemiro 27. E. Numme lin (19 78): A splitting techn ique for Harris r ecurre nt Marko v cha ins, Z. W ahr . V erw . Geb . 43, 309–318 . 28. E. Nummelin (2002): MC’ s for M CMC’is ts, International Statistical Rev iew , 70, 215–240. 29. C.P . Robert and G. Casella: Monte Carlo Statistical Methods. Springer- V erlag, New Y ork, 2004. 30. G.O. Roberts and J.S. Rosenthal (1997): Geometric ergodicity and hybrid Markov chains. Elec. Comm. Pr ob . 2 (2). 31. G.O. Roberts and J.S. Rosenthal (2004): General state space Mark ov chains and MCMC al- gorithms. Pr obability Surve ys 1, 20–71. 32. J.S. Rosentha l (1995): Minorization cond itions and con ver gence rates for M ar ko v chains . J. Amer . Statist. Association 90, 558–566. 33. D. Rudolf (2008): Explicit error bounds for lazy re ver sible Mark ov chain Monte Carlo. J. of Complexity . 25, 11–24. 34. V . Roy , J.P . Hobert (2010): On Monte Carlo methods for Bayesian multiv ariate regr ession models with hea vy-tailed errors . J . Multivariate Anal. 101, 1190–1202 35. D.B. W ilson (2000): How to couple from the past using a read-once source of randomness . Random Structur es Algorithms 16 (1), 85–113. Ap pendix For con venience, we recall the two iden tities of Abr aham W ald, wh ich we ne ed in the pro of of Theorem 3.3. Proof s can be foun d e .g. in [6, Th eorems 1 and 3 in Section 5.3]. Assume that η 1 , . . . , η k , . . . , are i.i.d. rand om variables an d R is a stoppin g time such that E R < ∞ . I W a ld identity: If E | η 1 | < ∞ then E R ∑ k = 1 η k = E R E η 1 . II W a ld identity: If E η 1 = 0 and E η 2 1 < ∞ then E R ∑ k = 1 η k ! 2 = E R E η 2 1 . In Section 5 we used the following formula tak en from [28, Equ ation (4.1.4 )]. In the notation of our Sections 2 and 3, for ev ery g ≥ 0 we h a ve E ν Ξ 1 ( g ) 2 m = E π g ( X 0 ) 2 + 2 ∞ ∑ i = 1 E π g ( X 0 ) g ( X i ) I ( T > i ) . (2 9) In [28] th is formu la, with g = ¯ f , is used to derive an expre ss ion for the a symptotic variance σ 2 as ( f ) = E ν Ξ 1 ( ¯ f ) / m under the a ss umption that f is b ounded. For g ≥ 0, the proof is the same.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment