A Multistep Lyapunov Approach for Finite-Time Analysis of Biased Stochastic Approximation

A Multiste p Ly apuno v Approac h for Finite - Time Analysis of Biased Sto c hastic Appro ximation Gang W ang, Bingcong Li, and Georgios B. Giannakis ∗ Septem ber 2, 2020 Abstract Motiv ated b y the widespread u se of temp oral-diﬀerence (TD-) and Q- learning algorithms in reinforcemen t learning, this pap er studies a class of b iased sto chastic app ro ximation (SA ) proced ures under a mild “ergodic- like” assumption on the underlying sto chas tic noise sequence. Building u p o n a carefully designed multistep Lya punov function th at lo oks ahead to sever al future up dates to accommodate the sto chastic p erturbations (for control of the gradient bias), w e pro ve a general result on the con vergence of the iterates, and use it to deriv e non- asymptotic b ounds on the mean-squ are error in the case of constant stepsizes. This nov el looking-ahead viewpoint rend ers ﬁnite-time analysis of biased SA algorithms under a large famil y of sto chastic p erturbations p ossible. F or direct comparison with existing contributions, w e also demonstrate these b ounds by applying them to TD- and Q -learning with linear function approximation, under the practical Marko v chain observ ation mo del. The resultant ﬁnite- time error b ound for b oth the TD- as w ell as the Q-learning algorithms is the ﬁrst of its kind, in the se nse that it holds i) f or the unmodiﬁed vers ions (i.e., without making a ny mo diﬁcations to the parameter up dates) using ev en nonlinear function a pproximators; as w ell as for Mark o v c hains ii) under general mixing conditions and iii) starting from an y initial distribution, at l east one of whic h has to b e violated for existing results t o b e applicable. 1 In tro duction Sto chastic appro ximation (SA) algorithms now a days are widely used in nu merous ar e as, including statistical signal pro cessing, communications, control, optimizatio n, data s cience, ma chine lear ning, and (deep) reinforce men t learning (RL). Ever since the seminal contribution [1], ther e hav e b een a multitude of eﬀorts on SA schemes, their applications, a nd theoretical developmen ts; se e , for in- stance, [2], [3], [4], [5]. On the theory side, conv en tional SA con vergence a nalysis and error b ounds are mos tly as ymptotic—that hold only in the limit as the num b er of iterations increases to inﬁnity . Nevertheless, recent research eﬀorts hav e gradually shifted tow ard developing non-asymptotic p er - formance guara n tees—that hold e ven for ﬁnite itera tions—for SA algorithms in diﬀeren t settings [4], [5], [6], mainly motiv ated b y the emerging need for dealing with massive data examples in mo dern large-s cale o ptimization a nd statistical le a rning tas ks. Many sto chastic control tasks can be na turally form ulated as Mar ko v decision pro cesses (MD Ps), which provide a ﬂexible framework for mo deling de c is ion making in scenar ios where outco mes a r e ∗ The work w as supported partially by NSF gran ts 1505970, 1711471, and 1901134. The authors are with the Digital T echnology Cent er and the Departmen t of Electrical and Computer Engineering, Universit y of Minnesota, Minneapoli s, MN 55455, USA. (e-mail: gangw ang@umn.edu; lixx5599@umn.edu; georgios@umn.edu). 1 partly random and pa rtly under the control of a decis ion maker. RL is a collec tio n of techniques for so lving MDP s, esp ecia lly when the underlying transition mechanism is unknown [7], [8], [9], [10]. Originally in tr o duced b y [7], temporal-diﬀerenc e (TD) lear ning ha s b eco me one of the most widely employ ed RL algorithms. Another ma jor breakthrough in RL was the developmen t of a TD control algorithm, known as Q-lea rning [8], on which m uc h of the co ntem p orar y artiﬁcial in telligence (see e.g., [1 1]) is built. Despite their p opula rity tha nks to the simple up dates they feature, theoretica l analysis o f RL (with function approximation) has pr ov ed challenging; s ee, for exa mple, [1 2], [13], [14], [15], [10], [16]. Moreover, non-asymptotic performance guarant ees a pp e ared only recen tly , and they still remain limited [17], [18], [19], [2 0], [21], [22], [2 3], [24]. T ar geting a deeper understanding for the statistical eﬃciency of basic RL (e.g., TD a nd Q- learning) algorithms, the ob j ective of this present pap er is to der ive non-asymptotic guarant ees for a certain clas s of ( biased) SA procedures . In par ticular, w e ﬁrst c haracterize a set of eas y-to-chec k conditions on t he n onlinear op er ators used in SA up dates, and int ro duce a mild assumption on t he sto chastic noise sequence s atisﬁed by a bro ad family o f discrete- time sto chastic pro cesses. W e prove a general conv ergence result base d on a ca refully constructed multi-step Ly a punov function, whic h relies on a n umber (as needed) of future SA up dates to gain control o ver the gradient bias arising from instantaneous sto chastic pe r turbations. F or an intro duction to Lyapuno v theo r y , see e.g., [2 5], [26]. W e further develop non-asymptotic b ounds on the mea n-square e r ror o f the iterates . Finally , for direct compar ison to past contributions, w e sp ecialize the r esults for general SA alg orithms to b oth the TD-learning as well as the Q-learning with linear function approximation, from data gathered along a single tr a jectory of a Mar ko v c hain. W e there b y o btain ﬁnite-time er ror b ounds for b oth TD and Q -learning a lg orithms using (non-)linear function a pproximators in the case o f constant stepsizes, under the most general ass umptions to date. The merits o f our b ounds are that they apply to i) the unmodiﬁed TD a s well as Q-learning algo rithms (in sha rp contrast, e.g., a pro jection step is required by [19], [2 1], [23]); ii) no nlinear funct ion approximators a nd Mark ov chains having general mixing rates (bounds in [2 1], [22], [27] were derived based on linear function approximation and geo metric mixing ); a nd, iii) Markov chains starting from any initial distribution as well as fro m the ﬁrst iteration (meaning there is no need to wait un til the Markov chain gets “close” to its unique stationary distributio n as required by e.g ., [21], [2 2], [27]). The remainder o f this pap er is s tructured as follows. Sectio n 2 b egins with the basic background on SA and the formal problem formulation. Section 3 presents the novel multi-step Lyapunov function follow ed by the main results on the non-a symptotic co nv er gence guar antees for general SA algorithms. In Section 4, we demonstr a te the co nsequences of o ur ma in results for the problem of approximate reinforcement learning (i.e., TD(0) a nd Q-lea r ning using linea r function approximators ), and develop ﬁnite-time er r or b ounds. Finally , the pap er is concluded with resea r ch outlo ok in Section 6, while technical pro ofs of the ma in results a re provided in the Appe ndix . Notation. Throug hout the pap er, low er- (upp e r -) case letters deno te deterministic (random) quantities of suitable dimensions clear from the context, e.g., θ (Θ). Ca llig raphic letters stand for sets, e.g. S , sy m b ol ( · ) ⊤ represents transp ose of matrices or vectors, and k θ k 2 (or simply , k θ k ) denotes the Euclidean nor m of vector θ . W e rese rve notation such as c , c ′ etc. for some cons ta nt s that do not dep end o n any parameters of the considered MDP s, including the disco unt fa c tor γ , or size o f state and action spaces, a nd so o n. 2 Bac kgroun d and p roblem set-up Consider the following nonlinear re cursion with a constant stepsize ǫ > 0 , starting from Θ 0 ∈ R d Θ k +1 = Θ k + ǫf (Θ k , X k ) , k = 0 , 1 , 2 . . . (1) 2 where Θ k ∈ R d denotes the k -th iterate, { X k ∈ R m } k ∈ N is a sto chastic noise sequence deﬁned on a complete probability space, and f : R d × R m → R d is a contin uo us function of ( θ, x ). In the simplest setting, for ex ample, { X k } k is a n indep endent and iden tically distributed (i.i.d.) rando m sequence of vectors, and f (Θ k , X k ) is a c o nditionally unbiased estimate of the gr adient f (Θ k ) := E [ f (Θ k , X k ) |F k ]. Here, ( F k ) k ≥ 0 is an increasing fa mily of σ -ﬁelds, with Θ 0 being F 0 -measurable , and f ( θ , X k ) b eing F k -measurable . F or an introduction to c onditional ex p ecta tion a nd σ -ﬁelds, see e.g., [28]. Dep ending on whether F 0 is a trivia l σ -ﬁeld, the initial g uess Θ 0 can be r a ndom or deterministic. With no loss of gene r ality , the rest of this pa pe r works with a deterministic Θ 0 . In a more complica ted setting per taining to e.g., MDPs, { X k } k is a Marko v chain ass umed to have a unique stationar y distribution, and f (Θ k , X k ) can b e view ed a s a bia s ed estimate of s o me gra dient ¯ f (Θ k ) = lim k →∞ E X k [ f (Θ k , X k )]. In b oth cases, we a r e prompted to assume tha t the following limit exists for each θ ∈ R d f ( θ ) = lim k →∞ E [ f ( θ, X k )] . (2) T aking a dynamica l s y stems viewpo int [3], the corr esp onding o rdinary diﬀerential equation (ODE) for (1 ) is given by ˙ θ ( t ) = f ( θ ( t )) . (3) Assume that this ODE a dmits an eq uilibrium po int θ ∗ at the or igin, i.e., f (0) = 0 . This assumption is made without loss of ge nerality , as one can a lwa y s shift a nonzero eq uilibrium p oint to zero by appropria te centering θ ← θ − θ ∗ . F o llowing the terminology in [3], [29], the rec ur sion (1 ) is termed (nonlinear) sto chastic approximation. Our goal here is to provide a non-asy mptotic conv ergence analysis of the iterate sequence { Θ k } k ∈ N + generated b y a recursion of the form (1) to the equilibrium po int θ ∗ of its cor resp onding ODE (3). The motiv ating imp etus for considering recursio n (1) w as to gain a deep e r insight into the classical TD as well a s Q-learning a lgorithms [10] fr om dis counted MDPs and reinforcement learning [10], [9]. It is a (bia sed) SA pro cedur e for solving a ﬁx ed po int equation deﬁned by the so -called Bellman’s o p er ator [15]. As a matter o f fact, a large family of TD-based algor ithms, including TD(0), TD( λ ), and GTD, as well as sto chastic g radient desce nt for nonlinear lea st-squar e s estimation can be descr ib e d in this form (see e.g ., [20] for a detailed dis c us sion). Certainly , conv er gence gua rantees of SA pro c edures as in (1) would not b e poss ible without impo sing ass umptions on the op erator s f ( θ, x ) and f ( θ ). In this work, motiv ated by the analy s is of TD-learning and r elated algo rithms in reinfor cement learning, we cons ider a class o f SA pro ce dur es that sa tisfy the following pro pe r ties. Assumption 1. The function f ( θ , x ) satisﬁes the glob al ly Lipschitz c ondition in θ , uniformly in x , i.e., ther e exists a c onstant L 1 > 0 such t hat for al l θ , θ ′ ∈ R d and e ach x ∈ X , it holds that k f ( θ, x ) − f ( θ ′ , x ) k ≤ L 1 k θ − θ ′ k . (4) Mor e over, ther e exists a c onstant L 2 > 0 such t hat, for e ach x ∈ X , it holds for al l θ k f ( θ, x ) k ≤ L 2 ( k θ k + 1) (5) wher e X ⊆ R m denotes the living sp ac e of the sto chastic pr o c ess { X k } . It is worth mentioning tha t (5) is equiv alent to a ssuming that f (0 , x ) satisfying (4) is uniformly bo unded for all x ∈ X . T o see this, supp os e k f (0 , x ) k ≤ ˆ f holds for all x ∈ X . F rom (4), it follows that k f ( θ, x ) k ≤ L 1 k θ − θ ′ k + k f ( θ ′ , x ) k , in which taking θ ′ = 0 co nﬁrms that k f ( θ , x ) k ≤ L 1 k θ k + k f (0 , x ) k ≤ L 1 k θ k + ˆ f ≤ max( L 1 , ˆ f ) · ( k θ k + 1). In this rega r d, by deﬁning L := max { L 1 , L 2 } , we will assume for simplicity that (4 ) and (5 ) hold w ith the same co nstant L . 3 Assumption 2. Consider t he ODE (3) . Ther e exists a t wic e diﬀer entiable function W ( θ ) that satisﬁes glob al ly and uniformly the fol lo wing c onditions for al l θ , θ ′ ∈ R d c 1 k θ k 2 ≤ W ( θ ) ≤ c 2 k θ k 2 (6a)  ∂ W ∂ θ     θ  ⊤ f ( θ ) ≤ − c 3 L k θ k 2 (6b)     ∂ W ∂ θ     θ − ∂ W ∂ θ     θ ′     ≤ c 4 k θ − θ ′ k (6c) for some c onst ants c 1 , c 2 , c 3 , c 4 > 0 . Regarding these assumptions , tw o r e ma rks come in orde r . Remark 1. A ssumption 1 is standar d and widely adopte d in the c onver genc e analysis of S A algo- rithms; se e e.g., [3, Chapter 3], [5], [12], [13], and also [15 ], [20 ], [22] in t he c ase of line ar SA (i.e., f ( θ, x ) is line ar in θ ). Remark 2. By evaluating ine quality (6a) at θ = 0 , one c onﬁrms that W ( θ ) > W (0) = 0 for al l θ 6 = 0 . Sinc e W ( θ ) is twic e diﬀer ent iable, it implies that ∂ W ∂ θ | θ =0 = 0 . F r om (6b) , it holds that b oth f ( θ ) 6 = 0 and ∂ W ∂ θ | θ 6 = 0 at any p oint θ 6 = 0 . In wor ds, Assumption 2 states that the e quilibrium p oint θ = 0 is u nique, and glob al ly, asymptotic al ly stable for the ODE (3) . This also app e ar e d in e.g., [3, A5] and [5] (st r ongly c onvex c ase). This is in the same spirit of r e qu iring a Hu rwitz matrix A (i.e., every eigenvalue has st rictly ne gative r e al p art) for t he ODE ˙ θ = Aθ in line ar SA by [15, The or em 2], [19], [22]. In addition to Assumptions 1 a nd 2, to leverage the co rresp onding ODE to study conv ergence of SA pro cedur es, we make an assumption o n the sto chastic p erturba tion sequence { X k } k ∈ N . Assumption 3. F or e ach θ ∈ R d , t her e exists a function σ ( T ; T 0 ) : N + × N + → R + monotonic al ly de cr e asing to zer o as either T → ∞ or T 0 → ∞ ; i.e., lim T →∞ σ ( T ; T 0 ) = 0 for any ﬁ xe d T 0 ∈ N + , and lim T 0 →∞ σ ( T ; T 0 ) = 0 for any ﬁx e d T ∈ N + , such that      1 T T 0 + T − 1 X k = T 0 E  f ( θ, X k )   X 0  − f ( θ )      ≤ σ ( T ; T 0 ) L ( k θ k + 1) (7) wher e t he exp e ctation E is t aken over { X k } T 0 + T − 1 k = T 0 . Simply put, Assumption 3 r equires that the bia s of the ‘ergo dic’ av erage of an y c onsecutive T gra dien t estimates { f ( θ, X k ) } T 0 + T − 1 k = T 0 from their limit f ( θ ) v anishes (at least) sublinearly in T . Indeed, this is fairly mild and more ge neral than those s tudied by e.g., [5 ], [20], [21], [22], each of which imp oses req uirements o n every s ing le gradient e s timate f ( θ, X k ). In sharp contrast, our condition (7) can allow for lar g e instantaneous biased gra dient estimates f (Θ k , X k ) of f (Θ k ). Indeed, Assumption 3 is satisﬁed b y a broad family of discrete-time sto chastic proce sses, including e.g., i.i.d. random vector sequences [5], ﬁnite-state irr educible and ap erio dic Markov chains [3 0], and Or nstein- Uhlen b eck proces ses [31]; whereas, exis ting works [5], [20], [21], [2 2] focus solely on one t ype o f those sto chastic pro cesses. 4 3 Non-asymptotic b ounds on the mean-square error In this pap er, we seek to dev elop nov el to ols for proving non-as y mptotic bounds o n the mean-sq uare error of the iter ates { Θ k } k ≥ 1 generated by a rec ur sion of the form (1) (to the equilibrium p oint θ ∗ = 0). Before pr esenting the ma in res ults, we star t oﬀ by int ro ducing an ins tr ument al result which is the key to o ur nov el a pproach to co nt rolling the p oss ible bias in gra dien t es timates of the SA pro cedur e. Its pro of is provided in App endix A of the supplementary material. Prop ositio n 1. Under Assu mptions 1 and 3, ther e exists a fun ction g ′ ( k , T , Θ k ) such that the next r elation holds for al l T ∈ N + Θ k + T = Θ k + ǫT f (Θ k ) + g ′ ( k , T , Θ k ) (8) satisfying   E  g ′ ( k , T , Θ k )   Θ k    ≤ ǫLT β k ( T , ǫ )( k Θ k k + 1) (9) β k ( T , ǫ ) := ǫLT (1 + ǫL ) T − 2 + σ ( T ; k ) (10) wher e t he exp e ctation is taken over { X j } k + T − 1 j = k . Evidently , Prop osition 1 oﬀers a b ound on the av erag e gradient bias over some T > 0 iterations, which is indeed motiv ated by our Assumption 3. Based on the re sults in P rop osition 1, we present the following theorem, which establishes a gene r al co nv er gence result that applies to any s to chastic sequence { X k } k ∈ N satisfying Assumption 3. Theorem 1. Under Assu mptions 1 — 3 and for any δ > 0 , t her e exist a function W ′ ( k , Θ k ) , and c on- stants ( T ∗ ∈ N + , ǫ δ ) such that σ ( T ∗ ; k ) ≤ δ and the fol lowing ine qu alities ar e glob al ly and u n iformly satisﬁe d for al l ǫ ∈ (0 , ǫ δ ) and al l k ∈ N c ′ 1 k Θ k k 2 ≤ W ′ ( k , Θ k ) ≤ c ′ 2 k Θ k k 2 + c ′′ 2 ( ǫL ) 2 (11) E  W ′ ( k + 1 , Θ k +1 ) − W ′ ( k , Θ k )   Θ k  ≤ − ǫc ′ 3 k Θ k k 2 + c ′ 4 ǫ 2 + c ′ 5 σ ( T ∗ ; k ) ǫ (12) wher e c ′ 1 , c ′ 2 , c ′ 3 , c ′′ 2 , c ′ 4 , c ′ 5 > 0 ar e c onst ants dep endent on c 1 ∼ c 4 of (6) but indep endent of ǫ > 0 . Pro of o f Theorem 1 is relegated to Appendix B o f the supplementary ma terial due to space limitations. Our pro o f builds critically on the construction of function W ′ ( k , Θ k ) from the Ly apunov function W ( θ ) of the ODE (3) . T o b e able to use the concentration b ound in (7), we are motiv ated to in tro duce a function ca ndidate that necessarily lo o k s a head to a num ber of T future iterates, with parameter T ∈ N + to b e des igned such that the gradient bias can b e made aﬀor dable, given by W ′ ( k , Θ k ) = k + T − 1 X j = k W (Θ j ( k , Θ k )) (13) where, to make the dep e ndence of Θ j ≥ k as a function o f Θ k explicit, we inten tionally wr ite Θ j = Θ j ( k , Θ k ), understo o d as the iter ate of the rec ursion (1) at time instant j ≥ k , with an initial condition Θ k at time k . The design parameter T ≥ 1 allows us to explo it the monotonically decreasing function σ ( T ; k ) → 0 in (9) to g ain control over the g radient bias, therefore rendering the bo unds (11)–(12) po ssible. When { X k } k is e.g., an i.i.d. random sequence [5 ], or a Markov chain that has approximately arrived a t its steady state (i.e., after a certain mixing time o f recursions) [22], they hav e s hown it s uﬃces to choose T = 1, that is W ′ (Θ k ) = W (Θ k ) to v alidate (11)–(12). F or 5 general Markov c hains how ever, functions like W ′ (Θ k ) = W (Θ k ) may fail to yield ﬁnite-time e rror bo unds that hold for the entire sequence { Θ k } k ≥ 1 . In a nutshell, our novel wa y of construc ting the Lyapuno v function oﬀers an eﬀective too l for ﬁnite-time analysis of g eneral SA algo rithms driven b y a br oad family of (discrete-time) sto chastic pro cesses . W e ar e now r eady to study the drift of W ′ ( k , Θ k ), which follows from Theorem 1, and whos e pro of is provided in App endix C of the supplementary material. Lemma 1. U nder Assumptions 1 — 3, the fol lo wing holds tru e for al l ǫ ∈ (0 , ǫ δ ) and al l k ∈ N E  W ′ ( k + 1 , Θ k +1 )  ≤  1 − c ′ 3 ǫ c ′ 2  E  W ′ ( k , Θ k )  + c ′′ 4 ǫ 2 + c ′ 5 σ ( T ∗ ; k ) ǫ (14) wher e c ′′ 4 > 0 is an appr opriate c onstant indep endent of ǫ , and T ∗ ∈ N ∗ is ﬁ xe d in The or em 1. Theorem 2. L et k ǫ := min { k ∈ N + | σ ( T ∗ ; k ) ≤ ǫ } . Under Ass u mptions 1 — 3, and cho osing any stepsize ǫ ∈ (0 , ǫ δ ) , t he fol lowing ﬁn ite-time err or b ounds hold for al l k ∈ N E  k Θ k k 2  ≤ c ′ 2 c ′ 1  1 − c ′ 3 ǫ c ′ 2  k k Θ 0 k 2 + c ′′ 2 L 2 c ′ 1 ǫ 2 + c 6 c ′ 1 ǫ + c 6 c ′ 1  1 − c ′ 3 ǫ c ′ 2  max { k − k ǫ , 0 } δ (15) wher e c 6 > 0 is some c onst ant; and δ > 0 is given in The or em 1. Our b ound in Theorem 2 r eveals a t wo-phase co n vergence b ehavior of biased SA schemes ob eying Assumptions 1 — 3. A t the b eginning, since the gr adient bias characteriz e d in terms of σ ( T ∗ ; k ) is sizable; precis ely , σ ( T ∗ ; k ) ≫ ǫ , the la st ter m in (15) is ba sically a constant ( c 6 /c ′ 1 ) δ , that do es not shrink as ǫ → 0 . That is, the biased SA pro cedure conv erges linearly fast only to a constant-size [size-( c 6 /c ′ 1 ) δ ] neig hborho o d of the equilibrium p oint θ ∗ = 0 o f the asso ciated O DE. On the other hand, as k increa s es, the gr adient bias σ ( T ∗ ; k ) decreases. Sp eciﬁcally , when k ≥ k ǫ , o r σ ( T ∗ ; k ) ≤ ǫ , the s ize of this neighborho o d can b e controlled by ǫ . Under this condition, we es tablish that the SA recursion c onv erges linearly fast to an ǫ -neig h b orho o d of the equilibrium p oint θ ∗ = 0, which can b e made a rbitrarily small by s e tting ǫ > 0 small enoug h. Pro of of Theorem 2 is p ostp oned to Appendix D of the supplemental do cument. At this p oint, some obser v ations ar e worth making. Remark 3. Existing non-asymptotic r esults have fo cuse d on either line ar SA algorithms (e.g., [32], [20], [22], [21], [33 ]), or nonline ar SA under i.i.d. noise (e.g. , [4], [5]). In sharp c ontr ast, the ﬁnite-time err or b ound in The or em 2 is applic able to a class of n online ar SA pr o c e dur es under a br o ad family of discr ete-time sto chastic p erturb ation pr o c esses. Remark 4. When t he gener al r e cursion (1) is sp e cialize d to line ar SA driven by Markovia n noise { X k } k ∈ N , i.e., f (Θ k , X k ) = A ( X k )Θ k + b ( X k ) , our est ablishe d b ound in (1 5) impr oves up on the state-of-the-art in [22 , The or em 7] by a c onstant ∼ (1 − c ′ 3 ǫ/c ′ 2 ) τ , wher e τ ≫ 1 is the m ixing time of the Markov chain { X k } . In fact, t he b oun d in [22, The or em 7] b e c omes applic able only after a mixing time of u p dates (i.e., for k ≥ τ ) t il l the Markov chain gets s uﬃciently ‘close’ to its stationary distribution; yet, in sharp c ontr ast, our b oun d (15) is eﬀe ctive fr om the ﬁrst iter ation for Markov chains starting with any initial distribution. F urthermor e, our ste ad-state value (the last term of (15) ) s c ales only with t he st epsize ǫ > 0 (which has r emove d the indep endenc e on τ fr om the b ound in [22]), and it vanishes as ǫ → 0 . Evidently , with the b ound in (1 5), one can eas ily estimate the n umber of data sa mples (e.g., the length o f a Markov chain tra jectory) requir e d for the mea n- square err or to b e of the sa me order as its stea dy-state v alue. 6 4 Applications to appro ximate reinforcemen t learning W e now turn to the co nsequences o f o ur gener al results for the problem o f reinfo r cement le a rning with (non-)linear function approximation (a.k.a., approximate reinforc ement learning), in pa rticular the TD- and Q-le a rning. T ow a rd this ob jectiv e, we b egin b y pr oviding a brief in tro duction to disco un ted MDPs and basic r einforcement learning algorithms; int erested readers can r efer to sta ndard so urces (e.g., [1 0], [9]) for mo re background. 4.1 Bac kground and set-up Consider a n MDP , deﬁned by the quintuple ( S , U , P , R, γ ), where S is a ﬁnite s et of p o ssible states (a.k.a. state space), U is a ﬁnite set o f p ossible actions (a.k.a . actio n spa ce), P := { P u ∈ R |S |×|S | | u ∈ U } is a collection o f pr obability transition ma trices, indexed by actions u , R ( s, u ) : S × U → R is a reward r eceived up on executing action u while in state s , and γ ∈ [0 , 1) is the discount factor. The r e sults along with theoretica l ana lysis develop ed in this pap er may b e genera lized to dea l with inﬁnite a nd compact state and/or action s paces, but we restrict o urselves to ﬁnite spaces he r e for an ea se of ex p o sition. An agent selects actions to interact with the MDP (the environment ) by op er ating a p olicy . Spec iﬁc a lly , at e ach time step k ∈ N , the agent ﬁrst observes the s ta te S k = s ∈ S of the en- vironment, and takes an action U k = u ∈ U by following a deterministic po licy π : S → U , or a sto chastic one U k ∼ π ( ·| S k ), where π ( ·| s ) is a probability distribution function supp orted on U . The en vironment then mo ves to the next state S k +1 = s ′ ∈ S with pro bability P u ss ′ = Pr( S k +1 = s ′ | S k = s, U k = u ), asso cia ted with which an instantaneous r eward R k := R ( S k , U k ) is revealed to the age nt. This pro cedure g enerates a tra jectory of states, ac tio ns, and r ewards, namely , S 0 , U 0 , R 0 , S 1 , . . . , S T , U T , R T , S T +1 , . . . over S × U × R . F or a given policy π , the v alue function V π ( s ) : S → R measures the expected dis c o unted reward obtained by starting the MDP in a given state, and then following the p o licy π to take actions in all s ubsequent itera tions. More precisely , we deﬁne V π ( s ) = E " ∞ X k =0 γ k R ( S k , U k )    S 0 = s # (16) where the conditional exp ectation E is taken ov er the entire tra jectory of observ ations conditioned on the initial sta te S 0 = s . Let R π ( s ) : S → R with R π ( s ) := P s ′ ∈S P u ∈U π ( u | s ) P u ss ′ R ( s, u ), a nd P π ( s, s ′ ) : S × S → R with P π ( s, s ′ ) := P u ∈U π ( u | s ) P u ss ′ . P ostulating a canonical ordering on the elements of S , we ca n view V π and R π as vectors in R |S | , and P π as a ma tr ix in R |S |×|S | . It is well known that such a function V π exists given unifor mly b ounded rewards { R ( s, u ) ≤ ¯ r } s,u , and it is the unique s o lution to the next Bellman’s eq ua tion [15, 29] V π ( s ) = R π ( s ) + γ X s ′ ∈S P π ( s, s ′ ) V π ( s ′ ) , ∀ s ∈ S . (17) Evidently , if all tra ns ition proba bilities { P π ( s, s ′ ) } s,s ′ were known, vector V π can b e found b y solving a system of |S | linear equatio ns dictated by (17). But in practice, either { P π ( s, s ′ ) } s,s ′ are unknown or |S | is huge or even inﬁnit y , it is imp ossible to ev aluate exa ctly V π . Here, the goa l is to eﬃciently estimate the v a lue function V π ( s ) by observ ing a sing le trace o f the Markov c hain { S 0 , S 1 , S 2 , . . . } and the asso c ia ted rewards, which is to be dealt w ith in subsection 4.2. Since the p olicy π we 7 consider in this pap er will b e as sumed ﬁxed, we sha ll drop the depe ndence on π in our notation for brevity . Likewise, we can deﬁne for control purp o s e the so -called action- v alue function (a.k.a., Q-function), which measur es the quality o f a given p olicy by the expected sum of discounted instantaneous rewards, co nditio ne d on star ting in a given sta te-action pair, a nd following the p olicy π to ta ke subsequent actions; i.e., Q ( s, u ) = E " ∞ X k =0 γ k R ( S k , U k )    S 0 = s, U 0 = u # , where U k ∼ π ( ·| S k ) for all k ∈ N + . (18) Naturally , we w ould like to choose the po licy π such that the v alues o f the Q-function are optimized. Indeed, it has b een es tablished tha t the Q-function as s o ciated with the o ptimal p olicy π ∗ , yielding the optimal Q-function denoted by Q ∗ , sa tisﬁes the following Bellman eq uation [9, 13, 34] Q ∗ ( s, u ) = E [ R ( s, u )] + γ E  max u ′ ∈U Q ∗ ( s ′ , u ′ )    s, u  . (19) after assuming a canonica l o r dering on the elemen ts of S × U , the table Q can b e treated as a matrix in R |S |×|U | . Once { Q ∗ ( s, u ) } s,u bec omes av ailable, an optimal p olicy π ∗ can be r e c ov ered by setting π ∗ ( s ) ∈ arg max u ∈U Q ∗ ( s, u ) for all s ∈ S , w itho ut a n y k nowledge ab out the transitio n proba bilities. Again, in the learning context of interest, the trans ition pro babilities { P u ss ′ } s,u,s ′ are unknown and the dimensio ns |S | and/or |U | c a n b e huge or even inﬁnity in practice, so it is no t p ossible to exactly ev aluate the Bellman equa tion (19). As o ne of the most p opular solutions for ﬁnding the optimal p olicy , Q- learning [8] iteratively upda tes the es timate Q k of Q ∗ using a single tra jectory o f samples { ( S k , U k , S k +1 ) } genera ted by following the p olicy π , acc o rding to the recurs io n Q k +1 ( S k , U k ) = Q k ( S k , U k ) + ǫ k  R ( S k , U k ) + γ max u ′ ∈U Q k ( S k +1 , u ′ ) − Q k ( S k , U k )  (20) where { 0 < ǫ k < 1 } is a sequence of stepsizes to be chosen by the user . Under sta ndard conditions on the stepsizes, the sequence { Q k } co n verges to Q ∗ almost surely as long as every state-action pair ( s, u ) ∈ S × U is vis ited inﬁnitely often; s ee, for ins tance, [3 4, 13, 9, 10]. How ever, it is well known that for many impo rtant problems o f practical int erest, the computa- tional req uirements of exa ct function e stimation ar e overwhelming, mainly b ecaus e of a large num be r of states a nd actio ns (i.e., B e llman’s “ curse o f dimensiona lity”) [9]. Instead, a p opular appr oach in the literature has b een to leverage lo w-dimensional parametric approximations of the v alue function, or the Q-function. Although contempor ary nonlinear approximators such a s neur al netw orks [11], [35] could lead to more p ow erful approximations, the simplicity of r einforcement learning with linear function appr oximation [10] allows us to analyze them in detail. 4.2 T emporal-diﬀerence learning wit h linear function approxim ation In this section, we ta rget a deep er understanding of the dyna mics of TD-lear ning algo rithms with linear function appr oximation. T oward this ob jective, w e assume a linea r function approximator for the true v a lue function V ( s ), given by V ( s ) ≈ V θ ( s ) = φ ⊤ ( s ) θ (21) where θ ∈ R d is a parameter vector to b e learned, typically having d ≪ |S | ; and φ ( s ) ∈ R d is a ﬁxed feature vector dictated by presele cted bas is functions φ 1 , φ 2 , . . . , φ d : S → R , that a ct on state s as follows φ ( s ) := [ φ 1 ( s ) φ 2 ( s ) · · · φ n ( s )] ⊤ ∈ R d , ∀ s ∈ S . (22) 8 Without lo ss of ge nerality , we ass ume that all feature vectors ar e k φ ( s ) k ≤ 1 , ∀ s ∈ S [29]. F or future re fer ence, we stack up all fea tur e vectors { φ ( s ) } s ∈S to for m the ma tr ix Φ :=      φ ⊤ ( s 1 ) φ ⊤ ( s 2 ) . . . φ ⊤ ( s |S | )      ∈ R |S |× d . W e also make a standar d assumption that Φ has full co lumn ra nk , i.e., a ny re dundant or irrelev ant feature vectors have b een r emov ed [1 5] Consider the class ical TD-lear ning for estimating the unknown parameter vector θ [7]. Starting with some Θ 0 , the simplest yet widely used TD v ar iant, which is also known as TD(0), upda tes the estimate Θ k of θ a ccording to Θ k +1 = Θ k − ǫ  φ ⊤ ( S k )Θ k − R ( S k , U k ) − γ φ ⊤ ( S k +1 )Θ k  φ ( S k ) (23) where ǫ ∈ (0 , 1) is a constant stepsize . W e are interested in developing ﬁnite- time error b ounds for (23) , where the observed tuples { ( S k , R ( S k , U k ) , S k +1 ) } k ∈ N are gathered from a single tra jectory of the Markov chain { S k } k ∈ N . T o apply our results developed for genera l SA algor ithms to TD(0) learning her e, we just need to verify our working assumptions in Sec tio n 2. Let µ ∈ [0 , 1) |S | also denote the stationar y distribution of the Markov chain, i.e., µ = µP , and also D ∈ R |S |×|S | be a diagona l matrix holding entries of µ on its ma in diagona l. Up on int ro ducing A := − Φ D (Φ − γ P Φ) ⊤ and b := Φ D R , it ha s b een s hown in [15, Theo rem 2] that the TD(0) algorithm w ith diminishing stepsizes ob eying standa rd conditions, tracks the ODE ˙ θ = Aθ + b (24) as well as conv erges to its unique equilibrium p oint θ ∗ = − A − 1 b . By substituting Θ = ˜ Θ + θ ∗ int o (23), we deduce that the c e n tered r e c ursion ˜ Θ k +1 = ˜ Θ k + ǫφ ( S k ) h γ φ ⊤ ( S k +1 ) ˜ Θ k − φ ⊤ ( S k ) ˜ Θ k + R ( S k , U k ) + γ φ ⊤ ( S k +1 ) θ ∗ − φ ⊤ ( S k ) θ ∗ i tracks the ODE ˙ ˜ θ = A ˜ θ , which has a unique equilibr ium p oint at the orig in ˜ θ = 0. If we deﬁne X k := [ S k S k +1 ] ⊤ f ( ˜ Θ k , X k ) := φ ( S k )[ γ φ ( S k +1 ) − φ ( S k )] ⊤ ˜ Θ k + R ( S k , U k ) φ ( S k ) (25) + φ ( S k )[ γ φ ( S k +1 ) − φ ( S k )] ⊤ θ ∗ (26) it now b e comes clear that the TD(0) algo rithm is a sp ecial case of the sto chastic recur sion (1). As has also b een made clear in our discus sion ab ove, the following limit exists for each ˜ θ ∈ R d f ( ˜ θ ) = lim k →∞ E [ f ( ˜ θ, X k )] = A ˜ θ, with f (0) = 0 . ( 27) Next, let us turn to verify Assumptions 1 — 3. V eri fying Assumption 1. F or any ˜ θ, ˜ θ ′ ∈ R d , cons ider X = x = [ s s ′ ] ⊤ with s, s ′ ∈ S . W e hav e that   f ( ˜ θ, X ) − f ( ˜ θ ′ , X )   =    φ ( s )[ γ φ ( s ′ ) − φ ( s )] ⊤  ˜ θ − ˜ θ ′     9 ≤   γ φ ( s ) φ ⊤ ( s ′ ) − φ ( s ) φ ⊤ ( s )     ˜ θ − ˜ θ ′   ≤  γ k φ ( s ) kk φ ( s ′ ) k + k φ ( s ) kk φ ( s ′ ) k    ˜ θ − ˜ θ ′   ≤ 2   ˜ θ − ˜ θ ′   (28) for all p ossible s , s ′ ∈ S , where the la st inequality follows from the assumption that k φ ( s ) k ≤ 1 for all s ∈ S and 0 ≤ γ < 1. Clear ly , inequality (4) holds with co nstant L 1 = 2. Likewise, fo r each ˜ θ , it follows for all s, s ′ ∈ S that   f ( ˜ θ, X )   =    φ ( s ) [ γ φ ( s ′ ) − φ ( s )] ⊤ ˜ θ + R ( s, u ) φ ( s ) + φ ( s )[ γ φ ( s ′ ) − φ ( s )] ⊤ θ ∗    ≤ k γ φ ( s ) φ ( s ′ ) − φ ( s ) φ ( s ) k k ˜ θ k + | R ( s, u ) |k φ ( s ) k +    φ ( s )[ γ φ ( s ′ ) − φ ( s )] ⊤    k θ ∗ k ≤ 2 k ˜ θ k + ¯ r + 2 k θ ∗ k (29) ≤ max(2 , ¯ r + 2 k θ ∗ k )( k ˜ θ k + 1 ) (30) where (29) is due to the uniformly b ounded r ewards {| R ( s, u ) | ≤ ¯ r } , verifying the inequality (5). By combining (28) and (30), Ass umption 1 holds with cons tant L := max(2 , ¯ r + 2 k θ ∗ k ). V eri fying Assumption 2. Consider now the centered ODE ˙ ˜ θ = f ( ˜ θ ) = A ˜ θ . It has b e en shown in [15] that A = − Φ D (Φ − γ P Φ) ⊤ is negative deﬁnite (but not symmetric), in the sens e that ˜ θ ⊤ A ˜ θ < 0 for all ˜ θ 6 = 0; that is, A is Hurwitz a nd full rank. So, the equilibrium p oint ˜ θ = 0 is globally , asymptotically stable for this ODE. Base d on linear system theory , one can construct a Lyapuno v function candida te V ( ˜ θ ) as follows V ( ˜ θ ) = ˜ θ ⊤ P ˜ θ (31) for some symmetric, p ositive deﬁnite matrix P ∈ R d × d to b e determined. It is clear that λ min ( P ) k ˜ θ k 2 ≤ V ( ˜ θ ) ≤ λ max ( P ) k ˜ θ k 2 for all ˜ θ ∈ R d , where λ max ( P ) ≥ λ min ( P ) > 0 denote the s ma llest and largest eige n v alues of P , resp ectively . Hence, inequa lit y (6a) holds with constants c 1 = λ min ( P ) and c 2 = λ max ( P ). Regarding (6b), we hav e that  ∂ V ∂ ˜ θ    ˜ θ  ⊤ f ( ˜ θ ) = 2 ˜ θ ⊤ P A ˜ θ = ˜ θ ⊤  A ⊤ P + P A  ˜ θ. Consider the following contin uous Lyapuno v equation with Hurwitz A A ⊤ P + P A = − Q (32) which is well known to hav e a unique s y mmetric, p ositive deﬁnite solution P given any symmetric, po sitive deﬁnite matrix Q (e.g., [9 ]). So, by choosing any Q = Q ⊤ ≻ 0, we veriﬁed that (6b) holds with consta nt c 3 = λ min ( Q ) /L > 0 as well. As far as (6c) is concerned, obser ving that   ∂ V ∂ ˜ θ   ˜ θ   = 2 k P ˜ θ k ≤ 2 λ max ( P ) k ˜ θ k , it suﬃces to take c 4 = 2 λ max ( P ). Summarizing these three cases, we hav e shown that Assumption 2 is met by the TD(0) alg orithm. V eri fying Assumption 3. This following prop erty holds fo r any ﬁnite, irreducible, and ap erio dic Marko v chain [3 6, Theore m 4.9 ]. 10 Lemma 2 . F or any ﬁnite-state, irr e ducible, and ap erio dic Markov chain, ther e ar e c onstants c 0 > 0 and ρ ∈ (0 , 1) su ch that max x ∈X d T V  Pr( X k ∈ ·| X 0 = x ) , µ  ≤ c 0 ρ k , ∀ k ∈ N (33) wher e µ is t he ste ady-state distribution of { X k } k ∈ N , and d T V ( ν, µ ) denotes the total variation distanc e b etwe en pr ob ability me asur es µ and ν , deﬁne d as fol lo ws d T V ( ν, µ ) := 1 2 X x ∈X   ν ( x ) − µ ( x )   . (34) Considering X T 0 = x T 0 = [ s 0 s ′ 0 ] ⊤ with any s 0 , s ′ 0 ∈ S , it ho lds for e a ch ˜ θ ∈ R d that      1 T T 0 + T X k = T 0 +1 E  f ( ˜ θ, X k )   X 0 = [ s 0 s ′ 0 ]  − f ( ˜ θ )      =     1 T T 0 + T X k = T 0 +1 E h φ ( S k )[ γ φ ( S k +1 ) − φ ( S k )] ⊤ ˜ θ + R ( S k , U k ) φ ( S k ) + φ ( S k )[ γ φ ( S k +1 ) − φ ( S k )] ⊤ θ ∗    S 0 = s 0 , S 1 = s ′ 0 i − E µ [ f ( ˜ θ, X )]     =     1 T T 0 + T X k = T 0 +1 X s ∈S  Pr( S k = s | S 0 = s 0 , S 1 = s ′ 0 ) − µ ( s )  h φ ( s )( γ P ( s, s ′ ) φ ( s ′ ) − φ ( s )) ⊤ ˜ θ (35) + r ( s ) φ ( s ) + φ ( s )( γ P ( s, s ′ ) φ ( s ′ ) − φ ( s )) ⊤ θ ∗ i     ≤ max s,s ′    φ ( s )( γ P ( s, s ′ ) φ ( s ′ ) − φ ( s )) ⊤ ˜ θ + r ( s ) φ ( s ) + φ ( s )( γ P ( s, s ′ ) φ ( s ′ ) − φ ( s )) ⊤ θ ∗    × 1 T T 0 + T X k = T 0 +1 X s ∈S   Pr( S k = s | S 1 = s ′ 0 ) − µ ( s )   (36) ≤ L ( k ˜ θ k + 1) × 1 T T 0 + T X k = T 0 +1 2 c 0 η k − 1 (37) ≤ σ ( T ; T 0 ) L ( k ˜ θ k + 1 ) (38) where E π [ f ( ˜ θ, X )] ta kes expe c tation ov er X = [ S S ′ ] ⊤ with resp ect to the statio nary distribution µ of the Markov chain { S k } ; (35) leverages the tr ansition proba bilities { P ss ′ } ; (37) uses the inequality (30) and the fact that ﬁnite-state, irreducible, and a per io dic Ma r ko v chains mix at a geo metric r a te η ∈ (0 , 1) from a ny s ′ 0 ∈ S (see Lemma 2). Now, we ca n conc lude that Assumption 3 holds with L := max(1 + γ , ¯ r ) and function σ ( T ; T 0 ) := 2 c 0 η T 0 (1 − η ) T , which indeed monoto nica lly decrea ses to 0 as T → ∞ o r T 0 → ∞ . Remark 5. It is worth r emarking that although ge ometric mixing is use d her e for e ase of pr esenta- tion, it is cle ar fr om (3 6 ) that Assum ption 3 is s atisﬁ e d by Markov chains with gener al m ix ing r ates as long as P T 0 + T k = T 0 +1 P s ∈S   Pr( S k = s | S 1 = s ′ 0 ) − µ ( s )   gr ows sub-line arly in T ; that is, lim T →∞ 1 T T 0 + T X k = T 0 +1 X s ∈S   Pr( S k = s | S T 0 +1 = s ′ 0 ) − µ ( s )   = 0 , ∀ s ′ 0 ∈ S , T 0 ∈ N . (39) 11 In a nutshell, the ﬁnite-time erro r bound in Theore m 2 applies to TD(0) learning algor ithms in a generally mixing Mar ko v chain setting, provided standard assumptions—that the reward funct ion is uniformly b ounded as max s,s ′ | R ( s, u ) | ≤ ¯ r , the feature matrix Φ is full column rank , and all feature vectors { φ ( s ) } ar e bounded to o—hold true. Remark 6 . Although t he b asic TD(0) le arning algorithm is de alt with her e, it is worth r emarking that other variants of TD-le arning such as TD( λ ), and GTD c an also b e c ast in the form of (1) ; se e r elate d tr e atments in e.g., [22]. Henc e, our novel mult istep Lyapunov appr o ach as wel l as the asso ciate d non- asymptotic c onver genc e analysis is r e adily applic able to gener al TD-le arning algorithms. 4.3 Q-learning with linear function appro ximation In this section, we provide an impro ved non-asy mptotic analysis for Q-lea r ning using linear function approximators. In this dire c tion, we assume that the Q-function is par ameterized by a linear function as follows Q ( s, u ) ≈ Q θ ( s, u ) = ψ ⊤ ( s, a ) θ (40) where we hav e kept the nota tion θ ∈ R d for the parameter vector to b e lear ned (as in the previo us subsection), t ypically of size d ≪ |S | × |U | , the num b er o f state-actio n pa irs; and the feature vector ψ ( s, a ) ∈ R d stacks up d features pro duced b y pre-selected ba sis functions { ψ ℓ ( s, u ) : S × U → R } d ℓ =1 . Similar to TD-lea rning with linear function a pproximation, here we ca n als o introduce the feature matrix, given by Ψ :=      ψ ⊤ ( s 1 , u 1 ) ψ ⊤ ( s 1 , u 2 ) . . . ψ ⊤ ( s |S | , u |U | )      ∈ R |S ||U |× d which is assumed to hav e full column ra nk (that is, linearly independent columns) and satis fy k ψ ( s, u ) k ≤ 1 for a ll sta te- action pairs ( s, u ) ∈ S × U . The well-known a pproximate Q-lea rning algor ithm up dates the par ameter vector Θ , acc o rding to the recur sion (e.g., [9, 10]) Θ k +1 = Θ k + ǫψ ( S k , U k )  R ( S k , U k ) + γ max u ∈U ψ ⊤ ( S k +1 , u )Θ k − ψ ⊤ ( S k , U k )Θ k  (41) with s o me constant stepsize ǫ ∈ (0 , 1). The ob jectiv e here is to establish ﬁnite-time error guar a ntees for (41 ), when the obser ved da ta samples { ( S k , U k , R ( S k , U k ) , S k +1 , U k +1 ) } k ∈ N are collected alo ng a single path of the Mar ko v chain { S k } k ∈ N by following a deter ministic p olicy π . Consider ing F ( θ , X k ) = ψ ( S k , U k )  R ( S k , U k ) + γ max u ∈U ψ ⊤ ( S k +1 , u ) θ − ψ ⊤ ( S k , U k ) θ  , wher e X k := ( S k , U k , S k +1 ) (42) it b eco mes o b vious that (41) has the form of (1). T o ch eck whether our non-a s ymptotic erro r guarantees es ta blished for nonlinea r SA pro cedures in Section 3 can b e a pplied to Q -learning with linear function approximation or not, it, ag ain, suﬃces to show that Ass umptions 1 — 3 are satisﬁed by the Q-lea rning updates (41). In general, Q- le arning with function approximation can diverge. This is mainly b ecause Q- learning implements oﬀ-p olicy 1 sampling to collect the data, which renders the exp ected Q-lear ning 1 On-p olicy metho ds estimate the v alue of a p olicy whi le using it for cont rol (namely , taking actions); while in oﬀ-poli cy metho ds, the p olicy used to generate behavior, call ed the b ehavior/sampling p olicy , ma y b e independen t of the poli cy that is ev aluated and improv ed, called the target/estimation poli cy [10]. 12 upda te p oss ibly a n e x pansive mapping [37 ]. Under appr opriate regular it y co nditions o n the sa mpling po licy , asy mpto tic convergence of Q- learning with linear function approximation w as established in [16], and ﬁnite-time a nalysis was recently given in [23, 27]. In the following, we also imp ose so me regular ity conditions on the sampling p olicy π . Assumption 4 ([2 7]) . Supp ose that the Markov chain { S k } k ∈ N induc e d by p olicy π is irr e ducible and ap erio dic, whose u nique stationary distribution is denote d by µ . Assume that t he e quation ¯ F ( θ ) := E µ [ F ( θ , X )] = 0 has a unique solution θ ∗ , and the next ine quality holds for al l θ ∈ R d γ 2 E µ  max u ′ ∈U  ψ ⊤ ( s ′ , u ′ ) θ  2  − E µ h  ψ ⊤ ( s, u ) θ  2 i ≤ − c k θ k 2 , where u ∼ π ( ·| s ) (43) for some c onst ant 0 < c < 1 . Now, let us turn to v erify Assumptions 1 — 3 . T ow a rd this end, we star t by introducing ˜ θ := θ − θ ∗ and X := ( S, U, S ′ ). It then follows that f ( ˜ θ ) := F ( ˜ θ + θ ∗ , X ) = ψ ( S, U )  R ( S, U ) + γ ma x u ∈U ψ ⊤ ( S ′ , u ) θ − ψ ⊤ ( S, U ) θ  (44) It is then ev ident that ¯ f ( ˜ θ ) := E µ [ F ( ˜ θ + θ ∗ , X )] = 0 has a unique so lution ˜ θ ∗ = 0. Now, we can rewrite (41) as follows ˜ Θ k +1 = ˜ Θ k + ǫf ( ˜ Θ k , X k ) . (45) V eri fying Assumption 1. F or any ˜ θ 1 , ˜ θ 2 and x = ( s, u, s ′ ), we hav e that    f ( ˜ θ 1 , x ) − ¯ f ( ˜ θ 2 , x )    =    ψ ( s, u ) h R ( s, u ) + γ max u 1 ∈U ψ ⊤ ( s ′ , u 1 )  ˜ θ 1 + θ ∗  − ψ ⊤ ( s, u )  ˜ θ 1 + θ ∗  i − ψ ( s, u ) h R ( s, u ) + γ max u 2 ∈U ψ ⊤ ( s ′ , u 2 )  ˜ θ 2 + θ ∗  − ψ ⊤ ( s, u )  ˜ θ 2 + θ ∗  i    =    γ ψ ( s, u ) h max u 1 ∈U ψ ⊤ ( s ′ , u 1 )  ˜ θ 1 + θ ∗  − max u 2 ∈U ψ ⊤ ( s ′ , u 2 )  ˜ θ 2 + θ ∗  i + ψ ( s, u ) ψ ⊤ ( s, u )  ˜ θ 1 − ˜ θ 2     ≤ γ    max u 1 ∈U ψ ⊤ ( s ′ , u 1 )  ˜ θ 1 + θ ∗  − max u 2 ∈U ψ ⊤ ( s ′ , u 2 )  ˜ θ 2 + θ ∗     +   ˜ θ 1 − ˜ θ 2   (46) where the last inequality follows from k ψ ( s, u ) k ≤ 1 for all ( s, u ) ∈ S × U . On one ha nd, supp ose that u ∗ 1 ∈ max u 1 ∈U ψ ⊤ ( s ′ , u 1 )  ˜ θ 1 + θ ∗  , then max u 1 ∈U ψ ⊤ ( s ′ , u 1 )  ˜ θ 1 + θ ∗  − max u 2 ∈U ψ ⊤ ( s ′ , u 2 )  ˜ θ 2 + θ ∗  = ψ ⊤ ( s ′ , u ∗ 1 )  ˜ θ 1 + θ ∗  − max u 2 ∈U ψ ⊤ ( s ′ , u 2 )  ˜ θ 2 + θ ∗  ≤ ψ ⊤ ( s ′ , u ∗ 1 )  ˜ θ 1 + θ ∗  − ψ ⊤ ( s ′ , u ∗ 1 )  ˜ θ 2 + θ ∗  = ψ ⊤ ( s ′ , u ∗ 1 )  ˜ θ 1 − ˜ θ 2  ≤   ˜ θ 1 − ˜ θ 2   (47) due a gain to k ψ ( s ′ , u ∗ 1 ) k ≤ 1 . O n the other hand, if w e let u ∗ 2 ∈ max u 2 ∈U ψ ⊤ ( s ′ , u 2 )  ˜ θ 2 + θ ∗  , it follows similarly that max u 1 ∈U ψ ⊤ ( s ′ , u 1 )  ˜ θ 1 + θ ∗  − max u 2 ∈U ψ ⊤ ( s ′ , u 2 )  ˜ θ 2 + θ ∗  = ma x u 1 ∈U ψ ⊤ ( s ′ , u 1 )  ˜ θ 1 + θ ∗  − ψ ⊤ ( s ′ , u ∗ 2 )  ˜ θ 2 + θ ∗  13 ≥ ψ ⊤ ( s ′ , u ∗ 2 )  ˜ θ 1 + θ ∗  − ψ ⊤ ( s ′ , u ∗ 2 )  ˜ θ 2 + θ ∗  = ψ ⊤ ( s ′ , u ∗ 2 )  ˜ θ 1 − ˜ θ 2  ≥ −   ˜ θ 1 − ˜ θ 2   . (48) Combining (47) and (48) y ie lds    max u 1 ∈U ψ ⊤ ( s ′ , u 1 )  ˜ θ 1 + θ ∗  − max u 2 ∈U ψ ⊤ ( s ′ , u 2 )  ˜ θ 2 + θ ∗     ≤   ˜ θ 1 − ˜ θ 2   (49) which, in conjunction with (4 6), prov es that    f ( ˜ θ 1 , x ) − f ( ˜ θ 2 , x )    ≤ ( γ + 1 )   ˜ θ 1 − ˜ θ 2   . (50) In the meanwhile, it is easy to see that   f ( ˜ θ, x )   =    ψ ( s, u ) h R ( s, u ) + γ max u 1 ∈U ψ ⊤ ( s ′ , u 1 )  ˜ θ + θ ∗  − ψ ⊤ ( s, u )  ˜ θ + θ ∗  i    ≤ | R ( s, u ) | + γ k ψ ( s ′ , u ∗ 1 ) k   ˜ θ + θ ∗   + k ψ ( s, u ) k   ˜ θ + θ ∗   ≤ ¯ r + ( γ + 1)  k ˜ θ k + k θ ∗ k  = ( γ + 1 ) k ˜ θ k +  ¯ r + ( γ + 1) k θ ∗ k  (51) where we hav e used the fact that | R ( s, u ) | ≤ ¯ r for all ( s, u ) ∈ S × U . With (50) and (51), we have prov ed tha t Assumption 1 is met with L := max { γ + 1 , ¯ r + ( γ + 1) k θ ∗ k} . V eri fying Assumption 2. The O DE asso cia ted with the (centered) Q-learning up date (45) is ˙ ˜ θ = ¯ f ( ˜ θ ) = E µ n ψ ( s, u ) h R ( s, u ) + γ max u ′ ∈U ψ ⊤ ( s ′ , u ′ )  ˜ θ + θ ∗  − ψ ⊤ ( s, u )  ˜ θ + θ ∗  io (52) for w hich we c o nsider the Lyapunov candida te function W ( ˜ θ ) = k ˜ θ k 2 / 2. Evidently , it follows that W ( ˜ θ ) ≥ 0 for all ˜ θ 6 = 0 , so (6a) holds with c 1 = c 2 = 1 / 2. Secondly , using ¯ f ( ˜ θ ∗ ) = 0 , we hav e tha t  ∂ W ( ˜ θ ) ∂ ˜ θ  ⊤ ¯ f ( ˜ θ ) =  ∂ W ( ˜ θ ) ∂ ˜ θ  ⊤ h ¯ f ( ˜ θ ) − ¯ f ( ˜ θ ∗ ) i = ˜ θ ⊤ E µ n ψ ( s, u ) h R ( s, u ) + γ max u 1 ∈U ψ ⊤ ( s ′ , u 1 )  ˜ θ + θ ∗  − ψ ⊤ ( s, u )  ˜ θ + θ ∗  i − ψ ( s, u ) h R ( s, u ) + γ max u 2 ∈U ψ ⊤ ( s ′ , u 2 ) θ ∗ − ψ ⊤ ( s, u ) θ ∗ io = γ E µ n ˜ θ ⊤ ψ ( s, u ) h max u 1 ∈U ψ ⊤ ( s ′ , u 1 )  ˜ θ + θ ∗  − max u 2 ∈U ψ ⊤ ( s ′ , u 2 ) θ ∗ io − E µ h ψ ⊤ ( s, u ) ˜ θ i 2 ≤ γ r E µ h ψ ⊤ ( s, u ) ˜ θ i 2 r E µ h max u 1 ∈U ψ ⊤ ( s ′ , u 1 )  ˜ θ + θ ∗  − max u 2 ∈U ψ ⊤ ( s ′ , u 2 ) θ ∗ i 2 − E µ h ψ ⊤ ( s, u ) ˜ θ i 2 (53) = r E µ h ψ ⊤ ( s, u ) ˜ θ i 2 ( γ r E µ h max u 1 ∈U ψ ⊤ ( s ′ , u 1 )  ˜ θ + θ ∗  − max u 2 ∈U ψ ⊤ ( s ′ , u 2 ) θ ∗ i 2 − r E µ h ψ ⊤ ( s, u ) ˜ θ i 2 ) 14 ≤ r E µ h ψ ⊤ ( s, u ) ˜ θ i 2 ( γ r E µ max u ′ ∈U h ψ ⊤ ( s ′ , u ′ ) ˜ θ i 2 − r E µ h ψ ⊤ ( s, u ) ˜ θ i 2 ) (54) = r E µ h ψ ⊤ ( s, u ) ˜ θ i 2  γ 2 E µ h max u ′ ∈U  ψ ⊤ ( s ′ , u ′ ) ˜ θ  2 i − E µ h ψ ⊤ ( s, u ) ˜ θ i 2  γ r E µ max u ′ ∈U h ψ ⊤ ( s ′ , u ′ ) ˜ θ i 2 + r E µ h ψ ⊤ ( s, u ) ˜ θ i 2 (55) ≤ − c   ˜ θ   2 r γ 2 E µ h max u ′ ∈U  ψ ⊤ ( s ′ , u ′ ) ˜ θ  2 i. E µ h ψ ⊤ ( s, u ) ˜ θ i 2 + 1 (56) ≤ − c   ˜ θ   2 2 − c (57) which suggests that (6b) holds with c 3 := c/ [(2 − c ) L ]. On the other hand, it follows for any ˜ θ, ˜ θ ′ that     ∂ W ∂ ˜ θ   ˜ θ − ∂ W ∂ ˜ θ   ˜ θ ′     =   ˜ θ − ˜ θ ′   v alidating (6c) with c 4 = 1. V eri fying Assumption 3. Let P u ss ′ be the tr ansition pr obability of the Mar ko v chain { S k } k ∈ N from s tate s to s ′ after taking actio n u ; and let p ( n ) ss ′ be the n - step tra nsition pr obability fr om s ta te s to s ′ following polic y π . Deﬁne then X k := ( S k , U k , S k +1 ), and it can be veriﬁed that { X k } k ∈ N is a Ma rko v chain with state s pa ce X := { x = ( s, u, s ′ ) : s ∈ S , π ( u | s ) > 0 , P u ss ′ > 0 } ⊆ S × U × S . Next, we show that { X k } is ap erio dic and irreducible . T ow ard that end, consider tw o arbitra ry states x i = ( s i , u i , s ′ i ) , x j = ( s j , u j , s ′ j ) ∈ X . Since { S k } k is ir reducible, there exists an integer n > 0 s uch that p ( n ) s ′ i ,s j > 0 . Using the deﬁnition o f { X k } k , it follows that p ( n +1) x i ,x j = p ( n ) s ′ i s j π ( u j | s j ) P u j s j s ′ j > 0 (58) which corrob or ates that the Mar ko v chain { X k } k is irreducible (e.g., [36, Chapter 1.3]). T o pro ve that { X k } k is ap erio dic, we ass ume, for the s ake of co n tradiction, that { X k } k is p er io dic with p erio d d ≥ 2. A s { X k } k has b een s hown irreducible , it follows readily that every sta te in X has the same p erio d o f d . Hence, for each state x = ( s, u , s ′ ) ∈ X , it holds tha t p ( n +1) x,x = 0 for all int egers n + 1 > 0 not divisible b y d . F urther, we deduce for any positive in teger ( n + 1) not divisible by d that p ( n +1) s ′ s ′ = X s ∈S p ( n ) s ′ s p (1) ss ′ = X s ∈S p ( n ) s ′ s X u ∈U π ( u | s ) P u ss ′ = X s ∈S X u ∈U p ( n +1) xx = 0 (59) where the last tw o equalities aris e from (58) and the p e rio dicity assumption of { X k } k , re s pe c tively . It b ecomes evident from (59) that { S k } k is p erio dic to o, and its p erio d is at leas t d . This clearly contradicts with the assumption that { S k } k is ap erio dic. Therefore, we conclude that the Markov chain { X k } k is irreducible and ap erio dic provided that { S k } k is ir reducible and a p e rio dic. 15 Now, cons ider tw o arbitrary sta tes x 0 = ( s 0 , u 0 , s ′ 0 ) , x = ( s, u, s ′ ) ∈ X . It follows that     1 T T 0 + T X k = T 0 +1 E  f ( ˜ θ, X k )   X 0 = x 0  − ¯ f ( ˜ θ )     =     1 T T 0 + T X k = T 0 +1 X x ∈X  p k x 0 x − µ ( x )  f ( ˜ θ, x )     (60) =     1 T T 0 + T X k = T 0 +1 X s,s ′ ∈S X u ∈U h p k s ′ 0 s π ( u | s ) P u ss ′ − µ ( x ) i × ψ ( s, u ) h R ( s, u ) + γ max u ′ ∈U ψ ⊤ ( s ′ , u ′ )  ˜ θ + θ ∗  − ψ ⊤ ( s, u )  ˜ θ + θ ∗  i     (61) ≤ max ( s,u,s ′ ) ∈X     ψ ( s, u ) h R ( s, u ) + γ max u ′ ∈U ψ ⊤ ( s ′ , u ′ )( ˜ θ + θ ∗ ) − ψ ⊤ ( s, u )  ˜ θ + θ ∗  i     × 1 T T 0 + T X k = T 0 +1 X x =( s,u,s ′ ) ∈X    p k s ′ 0 s π ( u | s ) P u ss ′ − µ ( x )    ≤ L  k ˜ θ k + 1  × 1 T T 0 + T X k = T 0 +1 2 c 0 η k − 1 (62) ≤ σ ( T ; T 0 ) L  k ˜ θ k + 1  where (60) is due to the deﬁnition that ¯ f ( ˜ θ ) = E X ∼ µ [ f ( ˜ θ, X )] = P x ∈X µ ( x ) f ( ˜ θ, x ); equality (61) uses (44) a nd (5 8) ; and, (62) arises fro m the ge o metric mixing prop erty of irreducible, ap erio dic Marko v chain { X k } k as well as (51). W e have proved that Ass umptions 1 — 3 a re satisﬁed by Q-lear ning with linear function appr ox- imation, pr ovided that cer tain conditions o n the sa mpling p olicy a nd function approximators ho ld. Hence, the established ﬁnite-time err or b ound in Theorem 2 holds for Q-lea rning algor ithms with linear function approximation. 5 Comparison to past w ork Since the seminal work [7], there ha s b een a la rge b o dy o f contributions o n TD-learning in diverse settings. Here we shall fo cus on the subset of r esults that has established error b ounds for discounted MDPs, which a re most rele v ant for direc t compa rison to our r esults. Genera l a symptotic results o n the con vergence of TD-learning algo rithms with (non-)linear function approximation were given in [12], [15], [29]; and those of Q-lea rning (with function approximation) in [3 4], [13], [38], [16]. Examples o f divergence with (non-)linear approximation were pr ovided by [14], [15]. Co nnec tions betw een TD-learning and SA were drawn in [12], [13], a nd [9]. On the other hand, non- asymptotic guar antees for RL a lgorithms app ea red only rec e n tly a nd remain still limited; see [17], [18], [3 2], [39], [40], [3 3], [41], [23], [27]. Finite-time analys is o f TD(0) was initially investigated by [17], which, how ever, was p ointed out to contain serious err ors in the pro ofs by [18] and thus not included in our c omparison. Finite-time per formance of other TD-based algorithms such a s GTD was studied by [32, 39], and [33]. Reﬁned ﬁnite- time erro r bounds of TD(0) were provided by [1 9], [2 0], but their r esults apply only to i.i.d. data s a mples, whic h, ho wev e r , are diﬃcult to acquire in prac tice . Dealing with the more realis tic yet challenging Ma rko v chain observ ation mo del, ﬁnite-time a nalysis on the mea n-square err or was r ecently studied by [2 1], [2 2], [23], [27], a nd [41]. Nevertheless, their b ounds were derived fo r TD- and/ or Q-lear ning using linear function approximators, and were based on strong geo metric mixing conditions. In addition, [21] 16 and [27] re q uire including a pr o jection step into the s tandard upda tes, where a s the b ounds r ep orted by [22] and [2 7] b ecome a pplicable only after a certain mixing time of TD up dates, that is, after the Marko v chain gets suﬃcient ly “clos e ” to its sta tio nary distribution. On the other hand, drawing connections b etw een TD le a rning a nd Marko v jump linear systems, exac t conv ergence behaviors of the ﬁrst- and seco nd-order moments of a family of TD lear ning algo rithms to their steady - state v alues were characterized using classica l co ntrol theory in [24]. In con trast, our bound in Theorem 2 not only a pplies to TD- a nd Q-lea rning with linea r function approximation, but a lso using a certain clas s of no nlinear function approximators compliant with Assumption 1 . More impor tantly , our non-asymptotic guara n tees hold for the unmo diﬁed TD- and Q-learning (i.e., without a ny pro jection steps ), a s well as for Marko v chains under genera l mixing conditions as sp eciﬁed in (39) and from any initial dis tr ibution. 6 Conclusions In this paper , we pro vided a non-a symptotic analysis for a cla ss of biased SA algorithms driven b y a broad family of sto chastic pe rturbations, which include as sp ecial cases e.g., i.i.d. random sequences of vectors and ergo dic Ma rko v chains. T aking a dyna mical sys tems viewp oint, our appro ach has bee n to desig n a novel multistep Lyapuno v function that in volv es future iterates to con trol the gradient bias. W e proved a general conv ergence result based on this multistep Lyapuno v function, and developed non-a symptotic bo unds on the mea n-square erro r of the itera te genera ted by the SA pro cedure to the equilibrium p oint of the asso ciated ODE. Subsequently , we illus tr ated this genera l result by applying it to obtain ﬁnite-time err or b ounds for the unmo diﬁed TD- and Q-learning with linear function approximation, where data are gathered a long a single tra jectory of a Marko v chain. Our b o unds hold for Marko v chains with general mixing ra tes and any initial distribution, as well as from the ﬁrst itera tion. Although the fo c us here has b een on biased SA pro cedures with co nstant stepsizes, our non-asy mptotic r esults can be extended to a c c ommo date time-v ar y ing stepsizes as well. Our future work will also aim at generaliz ing this novel a nalysis to SARSA and distributed RL alg orithms. References [1] H. Robbins and S. Monro, “A s to chastic approximation metho d,” Ann. Math. S tat. , vol. 2 2 , no. 3, pp. 4 00–4 07, 195 1. [2] H. Kushner a nd G. G. Yin, Sto chastic A ppr oximation and Re cursive Algorithms and A pplic a- tions . Springer Science & B usiness Media, 200 3, vol. 35. [3] V. S. Bork ar, Sto chastic Appr oximation: A Dynamic al Systems Viewp oint . Cam br idge, New Y ork , NY, 20 08, vol. 48. [4] A. Nemirovski, A. Juditsk y , G. La n, and A. Shapiro , “Robust sto chastic approximation ap- proach to sto chastic progra mming,” SIA M J. Opt. , vol. 19, no. 4, pp. 15 74–16 09, Jan. 20 09. [5] F. Bach and E. Moulines, “Non-asymptotic analys is of sto chastic approximation algorithms for machine lea r ning,” in Adv . in Neur al Inf. Pr o c ess. Syst. , 2011 , pp. 451 –459. [6] B. Ka rimi, B. Mias o jedow, E. Moulines, and H.-T. W a i, “Non-as ymptotic analysis of biased sto chastic approximation schemes,” vol. 1, 201 9 , p. 30. 17 [7] R. S. Sutton, “Learning to predict by the metho ds of temp or al diﬀerences,” Mac h. L e arn. , vol. 3, no. 1, pp. 9 –44, May 1 988. [8] C. J. C. H. W atkins, “Learning from delay ed rewards,” Ph.D. dissertation, King ’s College, Cambridge, 19 89. [9] D. P . B e r tsek as and J. N. Tsitsiklis, N eur o-D yn amic Pr o gr amming . Athena Scientiﬁc Belmont , MA, 19 9 6, vol. 5 . [10] R. S. Sutton and A. G. B a rto, R einfor c ement Le arning: An Int r o duction . MIT pres s, 2018. [11] V. Mnih, K. Kavuk cuoglu, D. Silver, A. A. Rusu, J. V eness, M. G. Bellemare, A. Grav es, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al. , “ Human-level control thro ugh deep r ein- forcement lea rning,” Natur e , vol. 518, no. 754 0, p. 529 , May 20 15. [12] T. Jaakkola, M. I. J ordan, a nd S. P . Sing h, “Co nv er gence of sto chastic iterative dy namic pro- gramming algo rithms,” in Adv . in Neur al Inf. Pr o c ess. Syst. , 1994 , pp. 703 –710. [13] J. N. Tsitsiklis, “Asynchronous sto chastic a pproximation and Q-lea r ning,” Mach. L e arn. , v ol. 16, no. 3, pp. 1 85–2 02, Sept. 1994 . [14] L. Bair d, “ Res idual a lgorithms: Reinforcement lea rning with function approximation,” in Intl. Conf. on Mach. L e arn. , 199 5, pp. 30 – 37. [15] J. N. Tsits ik lis a nd B. V a n Roy , “An analys is of temp ora l-diﬀerence lea rning with function approximation,” IEEE T r ans. Autom. Contr. , vol. 42, no. 5, pp. 67 4 – 690 , May 19 97. [16] F. S. Melo , S. P . Meyn, and M. I. Ribeir o, “An a nalysis of reinfor c ement lea r ning with function approximation,” in Intl. Conf. on Mach. L e arn. , 2008 , pp. 664 –671. [17] N. Korda and P . La , “On TD(0) with function appr oximation: Co ncentration b ounds and a centered v ariant with exp onential conv ergence,” in In tl. Conf. on Mach. L e arn. , 2 015, pp. 626–6 34. [18] N. L . Naray anan and C. Szep esv´ ari, “Finite time b ounds fo r temp oral diﬀerence learning with function appr oximation: Pr oblems with so me “state-o f-the-art” results,” T ech. Rep., 201 7. [19] G. Dalal, B. Sz¨ or´ enyi, G. Thopp e, and S. Manno r, “ Finite sa mple analyses for TD(0) w ith function appr oximation,” in AAAI Conf. on Artif. Intel l. , 201 8 , pp. 6144 –615 2. [20] C. Lak shminaray anan and C . Szep esv ari, “Linear sto chastic approximation: How far do es co n- stant step-size and iterate av eraging g o?” in Intl. Conf. on Artif. Intel l. and Stat. , 20 18, pp. 1347– 1355 . [21] J. Bhandari, D. Russo, and R. Singal, “ A ﬁnite time analysis of temp or a l diﬀerence learning with linear function a pproximation,” in Conf. on L e arn. The ory , 2019 , pp. 1691 –1692 . [22] R. Srik ant and L. Ying, “Finite-time er ror b ounds for linear sto chastic appr oximation and TD learning,” 20 19. [23] S. Zou, T. Xu, and Y. Liang, “ Finite-sample analysis for SARSA and Q-Le arning with linea r function appr oximation,” in A dv. in Neura l Inf. Pr o c ess. Syst. , 2 019, pp. 8665 –8675 . 18 [24] B. Hu and U. A. Syed, “Chara cterizing the exact b ehaviors of temp oral diﬀerence lea rning algorithms using Markov jump linear s y stem theory ,” in A dv. in Neur al Inf. Pr o c ess. Syst . , 2019, pp. 847 7 –848 8. [25] N. Bof, R. Ca rli, and L. Schenato, “Lyapuno v theory for disc r ete time systems,” arXiv:180 9.0528 9 , 2018 . [26] Y. Qin, M. Cao , and B. D. O. Anderson, “Lyapuno v criterio n for sto chastic systems and its applications in distributed computation,” IEEE T r ans. Autom. Contr ol , pp. 1–15, 2019 (T o app ear). [27] Z. Chen, S. Z ha ng, T. T. Doan, S. T. Maguluri, and J .-P . Cla rke, “Finite-sample analy s is for Q-Lear ning with linear function a pproximation,” arXiv:1905.1 1425 , 201 9. [28] R. Durrett, Pr ob ability: The ory and Examples . Cambridge Universit y Press , 2 019, vol. 49. [29] S. Bhatnagar , D. Prec up, D. Silver, R. S. Sutton, H. R. Ma ei, a nd C. Szep es v ´ ari, “Convergen t tempo ral-diﬀerence lea rning with a rbitrary smo o th function a pproximation,” in A dv. in Neur al Inf. Pr o c ess. Syst. , 200 9 , pp. 120 4–12 1 2. [30] A. J oulin and Y. Ollivier , “ Curv atur e , concentration and er ror estimates for Markov ch ain Monte Car lo,” Ann. Pr ob. , vol. 38, no. 6, pp. 24 18–24 42, Sep. 20 10. [31] P . W. Glynn and R. J. W ang, “On the r ate of convergence to equilibrium for tw o-sided reﬂected Brownian motion and for the O rnstein–Uhlenbeck pr o cess,” Q ueue. Syst. , vol. 91 , no. 1 -2, pp. 1–14, F eb. 20 19. [32] B. Liu, J. Liu, M. Ghav amzadeh, S. Mahadev a n, and M. Petrik, “ Finite-sample ana lysis of proximal gradient TD a lgorithms,” in Conf. on U n c ertainty in Artif. Intel l. , 2 015, pp. 504 –513 . [33] D. Lee and N. He, “T arget-based tempo ral-diﬀerence lea rning,” in Intl. Conf. on Mach. Le arn. , 2019, pp. 371 3 –372 2. [34] C. J. W atkins and P . Dayan, “ Q -learning ,” Mach. L e arn. , vol. 8, no. 3 -4, pp. 279–2 9 2, May 1992. [35] G. W ang, G. B. Gia nnakis, and J. Chen, “Le arning ReLU netw orks on linearly sepa rable data: Algorithm, o ptimality , a nd generaliza tion,” IEEE T r ans. Signal Pr o c ess. , vol. 6 7, no . 9, pp. 2357– 2370 , Mar . 2019 . [36] D. A. Levin and Y. Peres, Markov Chains and Mixing Times . American Ma thematical So ciety , 2017, vol. 107. [37] G. J. Gor don, “Stable function approximation in dynamic prog ramming,” in Machine L e arning Pr o c e e dings . Els evier, 19 95, pp. 261 –268 . [38] C. Szepe sv´ ari, “The asymptotic conv e r gence-ra te o f Q-learning,” in A dv. in Neur al Inf. Pr o c ess. Syst. , 199 8, pp. 10 6 4–10 70. [39] Y. W a ng, W. Chen, Y. Liu, Z.-M. Ma, a nd T.-Y. Liu, “Finite sample analys is of the GTD po licy ev alua tion algor ithms in Marko v setting,” in Ad v. in Neur al Inf. Pr o c ess. Syst. , 2017, pp. 5504– 5513 . 19 [40] K. Zhang, Z. Y a ng, H. Liu, T. Zhang, and T. Ba¸ sar, “Finite-sample analys e s for fully decen- tralized multi-agent reinfor cement learning,” arXiv:1812. 02783 , 2018 . [41] M. J. W ainwrigh t, “ Sto chastic appr oximation with cone-co ntractive op era tors: Shar p ℓ ∞ - bo unds for Q-lea rning,” arXiv:19 05.0626 5 , 2019. A Pro of of Prop osition 1 W e start o ﬀ the pro of by introducing the following auxiliar y function g ( k , T , Θ k ) := Θ k + T − Θ k − ǫ k + T − 1 X j = k f (Θ k , X j ) , ∀ T ≥ 1 (63) which is evidently well deﬁned under o ur working Assumptions 1 and 3. Regarding the function g ( k , T , Θ k ) a b ove, w e pr esent the following useful b ound, whose pro o f details are, how ever, po stp oned to App endix E for r eadability . Lemma 3. F or any given Θ k ∈ R d , the function g ( k, T , Θ k ) satisﬁes for al l k ≥ 0 k g ( k , T , Θ k ) k ≤ ǫ 2 L 2 T 2 (1 + ǫL ) T − 2 ( k Θ k k + 1) , ∀ T ≥ 1 . (64) On the other hand, note fro m (8) that g ′ ( k , T , Θ k ) = Θ k + T − Θ k − ǫT f (Θ k ) (65) which, in conjunction with (6 3), sugges ts that we can write g ′ ( k , T , Θ k ) = g ( k, T , Θ k ) + ǫ k + T − 1 X j = k f (Θ k , X j ) − ǫT f (Θ k ) = g ( k , T , Θ k ) + ǫ k + T − 1 X j = k  f (Θ k , X j ) − f (Θ k )  . (66) Considering any given Θ k , if one takes exp ectation of b oth sides o f (66), we o btain that E  g ′ ( k , T , Θ k )   Θ k  = E [ g ( k, T , Θ k )] + ǫ E   k + T − 1 X j = k  f (Θ k , X j ) − f (Θ k )     Θ k   = E  g ( k , T , Θ k )   Θ k  + ǫT   1 T k + T − 1 X j = k E h f (Θ k , X j ) − f (Θ k )   Θ k i   ≤ ǫLT h ǫLT (1 + ǫ L ) T − 2 + σ ( T ; k ) i ( k Θ k k + 1) (67) where the las t inequality follows from Lemma 3 as well as the pro pe r ty of the av eraged op erator f in (7) under o ur working Assumption 3. This concludes the pro of. 20 B Pro of of Th eorem 1 W e prove this theorem by ca refully co nstructing function for W ′ ( k , Θ k ) fro m W (Θ k ) (rec all under our working assumption 2 that W (Θ k ) exists a nd satisﬁes prop er ties (75)—(6c)). T oward this ob jectiv e, let us start with the following ca ndidate W ′ ( k , Θ k ) = k + T − 1 X j = k W (Θ j ( k , Θ k )) (68) where, to make the dep endence of Θ j ≥ k on Θ k explicit, we maintain the notation Θ j = Θ j ( k , Θ k ), which is understo o d as the state of the rec ur sion (1 ) a t time instant j ≥ k , with an initial condition Θ k at time instant k . In the following, we will show tha t ther e exists and a lso deter mine a v alue for the pa rameter T ∈ N + such that the ineq ualities (11) and (1 2) are satisﬁed. F or ease o f exp osition, we start by proving the second inequa lit y (12). T o this end, observe from the deﬁnition of W ′ ( k , Θ k ) in (68 ) that W ′ ( k + 1 , Θ k + ǫf (Θ k , X k )) − W ′ ( k , Θ k ) = k + T X j = k +1 W (Θ j ( k , Θ k )) − k + T − 1 X j = k W (Θ j ( k , Θ k )) = W (Θ k + T ( k , Θ k )) − W (Θ k ( k , Θ k )) = W (Θ k + T ( k , Θ k )) − W (Θ k ) (69) where the last equality is due to the fac t that Θ k ( k , Θ k ) = Θ k . T o upp er b ound the term in (69), we will fo cus on b ound the ﬁrs t term W (Θ k + T ( k , Θ k )). Recall from (8) that Θ k + T ( k , Θ k ) = Θ k + ǫT f (Θ k ) + g ′ ( k , T , Θ k ) based on which we c a n ﬁnd the second-o rder T aylor expa nsion of W (Θ k + T ( k , Θ k )) (which is twice diﬀerentiable under Assumption 2) ar ound Θ k , as follows W (Θ k + T ( k , Θ k )) = W (Θ k ) + ∂ W ∂ θ     Θ k ! ⊤ h ǫT f (Θ k ) + g ′ ( k , T , Θ k ) i + h ǫT f (Θ k ) + g ′ ( k , T , Θ k ) i ⊤ ∇ 2 W (Θ ′ k ) h ǫT f (Θ k ) + g ′ ( k , T , Θ k ) i (70) where we have employ ed the so-c a lled mean-v alue theor em, sugg esting that (70) holds with Θ ′ k := Θ k + η  ǫT f (Θ k ) + g ′ ( k , T , Θ k )  for some constant η ∈ [0 , 1 ]. Next, we will pursue a n upp er b o und for each individua l term on the rig ht hand side of (70) by conditioning on the σ -ﬁeld F k . Aga in, inv o king (6b), we hav e tha t E " ǫT  ∂ W ∂ θ     Θ k  ⊤ f (Θ k )    Θ k # ≤ − c 3 ǫLT k Θ k k 2 . (71) One can fur ther verify the following b ounds E   ∂ W ∂ θ     Θ k ! ⊤ g ′ ( k , T , Θ k )    Θ k   = ∂ W ∂ θ     Θ k ! ⊤ E  g ′ ( k , T , Θ k )   Θ k  21 ≤      ∂ W ∂ θ     Θ k      ·   E  g ′ ( k , T , Θ k )   Θ k    (72) ≤ c 4 k Θ k k · ǫLT β k ( T , ǫ )( k Θ k k + 1) (73) ≤ 2 c 4 ǫLT β k ( T , ǫ )( k Θ k k 2 + 1) . (74 ) In particula r, (72) uses the Cauch y -Sch wartz inequality , (73) calls for P rop osition 1 , and the last one fo llows from the inequality k θ k ( k θ k + 1 ) ≤ 2( k θ k 2 + 1). As far as the last ter m of (69 ) is concerned, it is cle ar that E  h ǫT f (Θ k ) + g ′ ( k , T , Θ k ) i ⊤ ∇ 2 W (Θ ′ k ) h ǫT f (Θ k ) + g ′ ( k , T , Θ k ) i    Θ k  ≤ c 4 E     ǫT f (Θ k ) + g ′ ( k , T , Θ k )    2    Θ k  (75) ≤ 2 c 4 ǫ 2 T 2    f (Θ k )    2 + 2 c 4 E h   g ′ ( k , T , Θ k )   2    Θ k i (76) ≤ 2 c 4 ǫ 2 T 2 L 2 k Θ k k 2 + 2 c 4 E h   g ′ ( k , T , Θ k )   2    Θ k i (77) where (75) leverages the upp er b ound on the Hessian ma trix of W ( θ ) arising from the prop er t y (6 c ), (76) follows from the inequa lity k a + b k 2 ≤ 2( k a k 2 + k b k 2 ) for any rea l-v alued vectors a, b ∈ R d , and (77) uses the Lipschitz pro pe rty of function f ( θ ) that can be eas ily veriﬁed since f ( θ , x ) is Lipschitz in θ . T o further upp er b ound the last term of (77), we esta blish the fo llowing helpful result who se pro of is also p os tp oned to App endix F for readability . Lemma 4. The fol lowing b ound holds for any ﬁxe d θ k ∈ R d E h   g ′ ( k , T , Θ k )   2   Θ k i ≤ ǫ 2 L 2 T 2 h ǫ 2 L 2 T 2 (1 + ǫL ) 2 T − 4 + 12 i k Θ k k 2 + 8 ǫ 2 L 2 T 2 . (78) Coming back to ineq ua lity (77), Lemma 4 ho lds true. Plugging (78) into (77), we esta blish an upper b ound on the last term o f (69) a s follows E  h ǫT f (Θ k ) + g ′ ( k , T , Θ k ) i ⊤ ∇ 2 W (Θ ′ k ) h ǫT f (Θ k ) + g ′ ( k , T , Θ k ) i    Θ k  ≤ 2 c 4 ǫ 2 T 2 L 2 h ǫ 2 L 2 T 2 (1 + ǫL ) 2 T − 4 + 13 i k Θ k k 2 + 16 c 4 ǫ 2 L 2 T 2 . (79) Putting together the b ounds in (71), (74) , and (79), it follows from (70) tha t E  W (Θ k + T ( k , Θ k )) − W (Θ k )   Θ k  = E   ǫT ∂ W ∂ θ     Θ k ! ⊤ f (Θ k ) + ∂ W ∂ θ     Θ k ! ⊤ g ′ ( k , T , Θ k )    Θ k   + E  h ǫT f (Θ k ) + g ′ ( k , T , Θ k ) i ⊤ ∇ 2 W (Θ ′ k ) h ǫT f (Θ k ) + g ′ ( k , T , Θ k ) i    Θ k  ≤ − ǫLT n c 3 − 2 c 4 β k ( T , ǫ ) − 2 c 4 ǫLT h ǫ 2 L 2 T 2 (1 + ǫL ) 2 T − 4 + 13 io k Θ k k 2 + 2 c 4 ǫLT β k ( T , ǫ ) + 16 c 4 ǫ 2 L 2 T 2 22 = − ǫLT [ c 3 − c 4 ρ k ( T , ǫ )] k Θ k k 2 + c 4 ǫLT κ k ( T , ǫ ) (80) where in the la st equa lity , we hav e deﬁned for notational brevity the following t wo functions ρ k ( T , ǫ ) := 2 β k ( T , ǫ ) + 2 ǫLT h ǫ 2 L 2 T 2 (1 + ǫL ) 2 T − 4 + 13 i (81) κ k ( T , ǫ ) := 2 β k ( T , ǫ ) + 16 ǫLT (82) bo th of which dep end o n parameter s T ∈ N + and ǫ > 0. In the seque l, we will show that there exist parameters ǫ > 0 and T ≥ 1 such that the co eﬃcient of (80) obeys c 3 − c 4 ρ k ( T , ǫ ) > 0 for all k ∈ N + . F o r mally , such a result is summarized in Pr op osition 2 b elow, whose pro of is relegated to App endix G . Prop ositio n 2. Consider functions β k ( T , ǫ ) and ρ k ( T , ǫ ) deﬁne d in ( 10) and (81) , r esp e ctively. Then for any δ > 0 , ther e exist c onstants ǫ δ > 0 and T δ ≥ 1 , such that the fol lowing ine quality holds for e ach ǫ ∈ (0 , ǫ δ ) σ ( T δ , k ) < ρ k ( T δ , ǫ ) < ρ 0 ( T δ , ǫ ) < ρ 0 ( T δ , ǫ δ ) ≤ δ, ∀ k ≥ 1 . (83) As such, b y taking any δ < c 3 /c 4 , feasible pa rameter v alues T ∗ and ǫ c can b e obta ined according to (13 7) and (13 9), resp ectively . Now by cho osing T ∗ = T δ (84) ǫ c = ǫ δ (85) it follows that c ′ 3 := LT ∗ [ c 3 − c 4 ρ 0 ( T ∗ , ǫ δ )] = L T ∗ ( c 3 − c 4 δ ) > 0 . (86) It follows from (80) that E  W (Θ k + T ( k , Θ k )) − W (Θ k )   Θ k  ≤ − c ′ 3 ǫ k Θ k k 2 + c 4 ǫLT ∗ κ k ( T ∗ , ǫ ) = − c ′ 3 ǫ k Θ k k 2 + c ′ 4 ǫ 2 + c ′ 5 σ ( T ∗ ; k ) ǫ (87) where we hav e deﬁned cons tant s c ′ 4 := c 4 LT ∗  2 L (1 + ǫ δ L ) T ∗ − 2 + 16 LT ∗  , and c ′ 5 := 2 c 4 LT ∗ . Finally , recalling (69), we deduce that E  W ′ ( k + 1 , Θ k + ǫf (Θ k , X k )) − W ′ ( k , Θ k )   Θ k  ≤ − c ′ 3 ǫ k Θ k k 2 + c ′ 4 ǫ 2 + c ′ 5 σ ( T ∗ ; k ) ǫ (88) concluding the pro of o f (12). Now, we turn to s how the ﬁrs t inequality . It is evident from the prop erties o f W (Θ k ) in Assump- tion 2 that W ′ ( k , Θ k ) = k + T − 1 X j = k W (Θ j ( k , Θ k )) ≥ W (Θ k ( k , Θ k )) ≥ c 1 k Θ k ( k , Θ k ) k 2 = c 1 k Θ k k 2 (89) where the sec o nd inequality follows from (6 a), and the la st equality fro m the fact that Θ k ( k , Θ k ) = Θ k . Therefore, by taking c ′ 1 = c 1 , we have s hown that the ﬁr st part of inequality (11) holds true. F or the s e c ond part, it follows that k Θ j +1 k = k Θ j + ǫf (Θ j , X j ) k ≤ (1 + ǫL ) k Θ j k + ǫL, ∀ j ≥ k (90) 23 yielding by means o f telescoping series k Θ j ( k , Θ k ) k ≤ (1 + ǫL ) j − k k Θ k k + j − k X j =1 (1 + ǫL ) j − 1 ǫL ≤ (1 + ǫL ) j − k k Θ k k + (1 + ǫL ) j − k − 1 , ∀ j ≥ k. Using further the inequality ( a + b ) 2 ≤ 2( a 2 + b 2 ), we deduce that k Θ j ( k , Θ k ) k 2 ≤ 2(1 + ǫL ) 2( j − k ) k Θ k k 2 + 2 h (1 + ǫL ) j − k − 1 i 2 . (91) T aking adv a nt age of the prop er ties of W (Θ k ) in Assumption 2 and (9 1 ), it follows that W ′ ( k , Θ k ) = k + T − 1 X j = k W (Θ j ( k , Θ k )) ≤ k + T − 1 X j = k c 2 k Θ j ( k , Θ k ) k 2 ≤ 2 c 2 k + T − 1 X j = k (1 + ǫL ) 2( j − k ) k Θ k k 2 + 2 c 2 k + T − 1 X j = k h (1 + ǫL ) j − k − 1 i 2 . (92) Let us now exa mine the tw o co eﬃcients of (92 ) mor e carefully . Note tha t k + T − 1 X j = k (1 + ǫL ) 2( j − k ) = (1 + ǫL ) 2 T − 1 (1 + ǫL ) 2 − 1 = T 2 + (2 T − 1 )(1 + ǫ ′ L ) 2 T − 2 ǫL 2 + ǫL (93) k + T − 1 X j = k  (1 + ǫL ) j − k − 1  2 = k + T − 1 X j = k +1  ( j − k ) ǫL  1 + 1 2 ( j − k − 1)  1 + ǫ ′ j − k L  j − k − 2 ǫL  2 (94) = ( ǫL ) 2 T − 1 X j =1 j 2  1 + 1 2 ( j − 1 )  1 + ǫ ′ j L  j − 2  2 (95) where b oth (93) and (94) follow from the mea n-v alue theo rem (1 + ǫL ) j − k = 1 + ( j − k ) ǫL + 1 2 ( j − k − 1 )(1 + ǫ ′ j − k L ) j − k − 2 ( ǫL ) 2 for any j − k ≥ 1 a nd some co nstants ǫ ′ j ∈ [0 , ǫ ]. According to Prop osition 2, or more sp eciﬁcally , the inequalities (84) and (85 ), we see that ǫ ′ j ≤ ǫ ≤ ǫ δ for all 1 ≤ j ≤ T − 1. On the o ther hand, it is easy to check that b oth ter ms [(93) and (95)] are monotonically increas ing functions o f ǫ > 0. Therefor e, if we deﬁne cons ta nt s c ′ 2 := 2 c 2 T ∗ 2 + (2 T ∗ − 1)(1 + ǫ δ L ) 2 T ∗ − 2 ǫ δ L 2 + ǫ δ L (96) c ′′ 2 := 2 c 2 T ∗ − 1 X j =1 j 2  1 + 1 2 ( j − 1 )(1 + ǫ δ L ) j − 2  2 (97) which are indepe nden t of ǫ , then we draw from (9 2), (93), and (95) that W ′ ( k , Θ k ) ≤ c ′ 2 k Θ k k 2 + c ′′ 2 ( ǫL ) 2 . (98) concluding the pro of o f the second pa rt of (1 1). 24 C Pro of of Lemma 1 T aking exp ectation o f b oth sides o f (11) conditioned o n Θ k gives rise to E  W ′ ( k , Θ k ) | Θ k  ≤ c ′ 2 k Θ k k 2 + c ′′ 2 ( ǫL ) 2 . (99) On the other hand, it is e v ident from (12) that E  W ′ ( k + 1 , Θ k +1 ) | Θ k  ≤ E  W ′ ( k , Θ k ) | Θ k  − c ′ 3 ǫ k Θ k k 2 + c ′ 4 ǫ 2 + c ′ 5 σ ( T ∗ ; k ) ǫ = E  W ′ ( k , Θ k ) | Θ k  − c ′ 3 ǫ c ′ 2  c ′ 2 k Θ k k 2 + c ′′ 2 ( ǫL ) 2  + c ′ 3 c ′ 2 c ′′ 2 ǫ ( ǫL ) 2 + c ′ 4 ǫ 2 + c ′ 5 σ ( T ∗ ; k ) ǫ ≤ E  W ′ ( k , Θ k ) | Θ k  − c ′ 3 ǫ c ′ 2 E  W ′ ( k , Θ k ) | Θ k  + c ′ 3 c ′ 2 c ′′ 2 ǫ δ ( ǫL ) 2 + c ′ 4 ǫ 2 + c ′ 5 σ ( T ∗ ; k ) ǫ (100) =  1 − c ′ 3 ǫ c ′ 2  E  W ′ ( k , Θ k ) | Θ k  + c ′′ 4 ǫ 2 + c ′ 5 σ ( T ∗ ; k ) ǫ (101) where, in order to obtain (100), we hav e employ ed the inequality in (99), a nd used the fact tha t ǫ < ǫ δ to derive (101); a nd the las t equality follows fro m c ′′ 4 := c ′ 4 + c ′ 3 c ′′ 2 ǫ δ L 2 /c ′ 2 . Finally , taking exp ecta tio n of b o th sides of (101) with resp ect to Θ k , co ncludes the pr o of. D Pro of of Theorem 2 Let us star t with a basic Lemma, whose pr o of is elementary a nd is henc e omitted here. Lemma 5. Consid er t he r e cursion z t +1 = az t + b , wher e a 6 = 1 and b ar e given c onstants. Then the fol lowing holds for al l t ≥ t 0 ≥ 0 z t = a t − t 0 z t 0 + b ( a t − t 0 − 1) a − 1 . (102) Pro of o f Theor em 2 is established in tw o phase s dep ending o n the k v alues. Spe c iﬁcally , let us deﬁne k ǫ := min { k ∈ N + | σ ( T ∗ ; k ) ≤ ǫ } ; then the ﬁrst phase is fr om k = 0 to k ǫ , while the se cond phase consis ts of all k > k ǫ . Phase I ( k ≤ k ǫ ) . W e hav e from 2 that σ ( T ∗ ; k ) ≤ δ for all 0 ≤ k ( ≤ k ǫ ). Then, ﬁxing t 0 = 0, a nd substituting a := 1 − c ′ 3 ǫ/c ′ 2 > 0 and b := c ′′ 4 ǫ 2 + c ′ 5 δ ǫ in (1 02), the re c ursion { E [ W ′ ( k , Θ k )] } in (14) can b e recur sively expressed as follows E  W ′ ( k , Θ k )  ≤  1 − c ′ 3 ǫ c ′ 2  E  W ′ ( k − 1 , Θ k − 1 )  + c ′′ 4 ǫ 2 + c ′ 5 δ ǫ ≤  1 − c ′ 3 ǫ c ′ 2  k E  W ′ (0 , Θ 0 )  +  1 −  1 − c ′ 3 ǫ c ′ 2  k  c ′ 2 c ′ 3  c ′′ 4 ǫ + c ′ 5 δ  (103) ≤  1 − c ′ 3 ǫ c ′ 2  k E  W ′ (0 , Θ 0 )  + c ′ 2 c ′ 3  c ′′ 4 ǫ + c ′ 5 δ  ≤  1 − c ′ 3 ǫ c ′ 2  k E  W ′ (0 , Θ 0 )  + c ′ 2 c ′ 3  c ′′ 4 + c ′ 5  δ (104) 25 ≤ c ′ 2  1 − c ′ 3 ǫ c ′ 2  k k Θ 0 k 2 + c ′′ 2 L 2 ǫ 2 + c 6 δ (10 5) where the last inequality follows from ǫ ≤ δ and the fact [cf. (1 1)] that E [ W ′ (0 , Θ 0 )] ≤ c ′ 2 E [ k Θ 0 k 2 ] + c ′′ 2 ǫ 2 L 2 ≤ c ′ 2 k Θ 0 k 2 + c ′′ 2 ǫ 2 L 2 (106) where the initial guess Θ 0 ∈ R d is assumed given for simplicity; and c 6 := c 2 ( c ′′ 4 + c ′ 5 ) /c ′ 3 . On the other hand, using (11), the term E  W ′ ( k , Θ k )  can b e low ered b ounded a s follows E  W ′ ( k , Θ k )  ≥ c ′ 1 k Θ k k 2 (107) which, comb ined with (105), yie lds the ﬁnite-time error b ound for itera tions k ≤ k ǫ E [ k Θ k k 2 ] ≤ c ′ 2 c ′ 1  1 − c ′ 3 ǫ c ′ 2  k k Θ 0 k 2 + c ′′ 2 L 2 c ′ 1 ǫ 2 + c 6 c ′ 1 δ. (108) Phase I I ( k > k ǫ ) . Using now the fac t that σ ( T ∗ ; k ) ≤ ǫ due to the deﬁnitio n of k ǫ , the recursio n { E [ W ′ ( k , Θ k )] } for all k > k ǫ bec omes E  W ′ ( k + 1 , Θ k +1 )  ≤  1 − c ′ 3 ǫ c ′ 2  E  W ′ ( k , Θ k )  + c ′′ 4 ǫ 2 + c ′ 5 σ ( T ∗ ; k ) ǫ (109) ≤  1 − c ′ 3 ǫ c ′ 2  E  W ′ ( k , Θ k )  + ( c ′′ 4 + c ′ 5 ) ǫ 2 . (1 10) Letting t 0 = k ǫ , and r eplacing a and b in (1 0 2) by constants (1 − c ′ 3 ǫ/c ′ 2 ) a nd ( c ′′ 4 + c ′ 5 ) ǫ 2 according ly , we arrive at E  W ′ ( k , Θ k )  ≤  1 − c ′ 3 ǫ c ′ 2  k − k ǫ E  W ′ ( k ǫ , Θ k ǫ )  +  1 −  1 − c ′ 3 ǫ c ′ 2  k − k ǫ  c ′ 2 c ′ 3  c ′′ 4 + c ′ 5  ǫ ≤  1 − c ′ 3 ǫ c ′ 2  k − k ǫ "  1 − c ′ 3 ǫ c ′ 2  k ǫ E  W ′ (0 , Θ 0 )  + c ′ 2 c ′ 3  c ′′ 4 ǫ + c ′ 5 δ  # + c ′ 2 ( c ′′ 4 + c ′ 5 ) c ′ 3 ǫ ≤  1 − c ′ 3 ǫ c ′ 2  k E  W ′ (0 , Θ 0 )  +  1 − c ′ 3 ǫ c ′ 2  k − k ǫ c ′ 2  c ′′ 4 + c ′ 5  c ′ 3 δ + c ′ 2 ( c ′′ 4 + c ′ 5 ) c ′ 3 ǫ (111) ≤ c ′ 2  1 − c ′ 3 ǫ c ′ 2  k k Θ 0 k 2 + c ′′ 2 ǫ 2 L 2 +  1 − c ′ 3 ǫ c ′ 2  k − k ǫ c 6 δ + c 6 ǫ (112) where we hav e used the following b o und at k = k ǫ from Phas e I in (103) a long with (10 6) E [ W ′ ( k ǫ , Θ k ǫ )] ≤  1 − c ′ 3 ǫ c ′ 2  k ǫ E  W ′ (0 , Θ 0 )  +  1 −  1 − c ′ 3 ǫ c ′ 2  k ǫ  c ′ 2 c ′ 3  c ′′ 4 ǫ + c ′ 5 δ  . (113) Plugging (107) into (112), y ie lds the ﬁnite-time error b ound for k ≥ k ǫ E  k Θ k k 2  ≤ c ′ 2 c ′ 1  1 − c ′ 3 ǫ c ′ 2  k k Θ 0 k 2 + c ′′ 2 L 2 c ′ 1 ǫ 2 +  1 − c ′ 3 ǫ c ′ 2  k − k ǫ c 6 c ′ 1 δ + c 6 c ′ 1 ǫ (114) 26 which conv er ges to a small (size- ǫ ) neighborho o d of the optimal s olution Θ ∗ = 0 at a linear rate. Combining the results in the t wo phases , we deduce the following er ror b ound that holds at any k ∈ N + E  k Θ k k 2  ≤ c ′ 2 c ′ 1  1 − c ′ 3 ǫ c ′ 2  k k Θ 0 k 2 + c ′′ 2 L 2 c ′ 1 ǫ 2 +  1 − c ′ 3 ǫ c ′ 2  max { k − k ǫ , 0 } c 6 c ′ 1 δ + c 6 c ′ 1 ǫ (115) concluding the pro of o f Theor em 2. E Pro of of Lemma 3 When T = 1 and for a ny Θ k ∈ R d , one can eas ily chec k that g ( k , 1 , Θ k ) = Θ k +1 − Θ k − ǫf (Θ k , X k ) = 0 implying G 1 := k g ( k , 1 , Θ k ) k = 0. T o pro ceed, let us start by introducing the function h ( k , T , Θ k ) := k + T − 1 X j = k f (Θ k , X j ) which can b e b ounded as follows   h ( k , T , Θ k )   =       k + T − 1 X j = k f (Θ k , X j )       ≤ k + T − 1 X j = k   f (Θ k , X j )   ≤ L k + T − 1 X j = k ( k Θ k k + 1) = T L ( k Θ k k + 1) (11 6) where the second ineq ua lity follows from (5 ) in Assumption 1. It is e vident tha t g ( k , T + 1 , Θ k ) = Θ k + T +1 − Θ k − ǫ k + T X j = k f (Θ k , X j ) = Θ k + T + ǫf (Θ k + T , X k + T ) − Θ k − ǫ  f (Θ k , X k + T 0 ) + k + T − 1 X j = k f (Θ k , X j )  = g ( k , T , Θ k ) + ǫ  f (Θ k + T , X k + T ) − f (Θ k , X k + T )  . (117) By mea ns of tria ngle inequality , it follows that G T +1 = k g ( k , T + 1 , Θ k ) k ≤   g ( k , T , Θ k )   + ǫ   f (Θ k + T , X k + T ) − f (Θ k , X k + T )   ≤ G T + ǫL   Θ k + T − Θ k   (118) ≤ G T + ǫL  ǫ   h ( k , T , Θ k )   +   g ( k , T , Θ k )    (119) ≤ (1 + ǫL ) G T + ǫ 2 L 2 T ( k Θ k k + 1) (120) 27 ≤ ǫ 2 L 2 ( k Θ k k + 1) T X k =0 (1 + ǫL ) T − k k (121) where the inequality (118) follows from the Lipschitz contin uity of f ( θ, x ) in θ , (119) fro m the fact that Θ k + T = Θ k + ǫh ( k , T , Θ k ) + g ( k , T , Θ k ), (120) from (116) as w ell a s the deﬁnition G T := k g ( k , T , Θ k ) k , a nd the la s t inequality is obtained by telescoping series and uses G 1 = 0. Lemma 6. Given any p ositive c onstant d 6 = 1 , the fol lowi ng holds for al l T ≥ 1 S T +1 = T X k =0 k d k = d (1 − d T ) (1 − d ) 2 − T d T +1 1 − d . (122) T aking d = (1 + ǫL ) − 1 in (12 2), then (1 21) can be simpliﬁed as follows G T ≤ ǫ 2 L 2 (1 + ǫL ) T − 1 ( k Θ k k + 1) T − 1 X k =0 (1 + ǫL ) − k k =  (1 + ǫL ) T − ǫLT − 1  ( k Θ k k + 1) . (123) T o further simplify this bound, the T aylor expansion along with the mean-v alue theorem conﬁrms that the following holds for some ǫ ′ ∈ (0 , 1) (1 + ǫL ) T = 1 + ǫ LT + 1 2 T ( T − 1 )(1 + ǫ ′ L ) T − 2 ( ǫL ) 2 , ∀ T ≥ 1 (124) or equiv alently , (1 + ǫL ) T − 1 − ǫLT = 1 2 T ( T − 1)(1 + ǫ ′ L ) T − 2 ( ǫL ) 2 (125) ≤ ǫ 2 L 2 T 2 (1 + ǫL ) T − 2 . (126) F Pro of of Lemma 4 Recalling that g ′ ( k , T , Θ k ) = g ( k, T , Θ k ) + ǫ P k + T − 1 j = k [ f (Θ k , X j ) − f (Θ k )], we hav e   g ′ ( k , T , Θ k )   2 =     g ( k , T , Θ k ) + ǫ k + T − 1 X j = k  f (Θ k , X j ) − f (Θ k )      2 ≤ 2   g ( k , T , Θ k )   2 + 2 ǫ 2 T 2     1 T k + T − 1 X j = k f (Θ k , X j ) − f (Θ k )     2 (127) ≤ 4  (1 + ǫL ) T − ǫLT − 1  2 ( k Θ k k 2 + 1) + 4 ǫ 2 T 2     1 T k + T − 1 X j = k f (Θ k , X j )     2 + 4 ǫ 2 T 2   f (Θ k )   2 (128) where w e have used the pr op erty k a + b k 2 ≤ 2( k a k 2 + k b k 2 ) for any r eal-v a lued vectors a, b in deriving (127) and (128), as well as P rop osition 1. 28 Squaring b oth sides of (12 5) yields  (1 + ǫL ) T − 1 − ǫT L  2 = 1 4 T 2 ( T − 1 ) 2 ( ǫL ) 4 (1 + ǫ ′ L ) 2 T − 4 ≤ 1 4 ǫ 4 L 4 T 4 (1 + ǫL ) 2 T − 4 . (129) Thu s, the ﬁrs t term of (128) ca n b e upp er b ounded by 4  (1 + ǫL ) T − ǫL T − 1  2 ( k Θ k k 2 + 1) ≤ ǫ 4 L 4 T 4 (1 + ǫL ) 2 T − 4 ( k Θ k k 2 + 1) . (130) Regarding the second ter m of (12 8), we hav e that     1 T k + T − 1 X j = k f (Θ k , X j )     2 ≤ 1 T k + T − 1 X j = k   f (Θ k , X j )   2 (131) ≤ 1 T k + T − 1 X j = k L 2 ( k Θ k k + 1) 2 (132) ≤ 2 L 2 k Θ k k 2 + 2 L 2 (133) where (131) and (1 33) follow from the inequality k P n i =1 z i k 2 ≤ n P n i =1 k z i k 2 for all r eal-v a lued vectors { z i } n i =1 , and (132) fro m our working ass umption on function f ( θ , x ). Withe rega rds to the last ter m of (128), it follows directly fr o m the Lipschitz pr o p e rty o f the av erage op e rator f ( θ ) that    f (Θ k )    2 ≤ L 2 k Θ k k 2 . (134) Substituting the b ounds in (13 0), (133), and (134) into (12 8), we arrive at k g ′ ( k , T , Θ k ) k 2 ≤ ǫ 2 L 2 T 2 h ǫ 2 L 2 T 2 (1 + ǫL ) 2 T − 4 + 12 i k Θ k k 2 + 8 ǫ 2 L 2 T 2 (135) concluding the pro of. G Pro of of Prop osition 2 W e prov e this claim by construction. By deﬁnition, it follows that for all k ∈ N + ρ k ( T , ǫ ) ≤ ρ 0 ( T , ǫ ) = 2 ǫL T h (1 + ǫL ) T − 2 + 13 i + 2( ǫLT ) 3 (1 + ǫL ) 2 T − 4 + 2 σ ( T ; 0) . (136) Under the a s sumption that lim T → + ∞ σ ( T ; 0) = 0, the function v alue σ ( T ; 0) ≥ 0 can b e made arbitrar ily small b y taking a suﬃcien tly large integer T ∈ N + in co nstructing the function W ′ ( k , Θ k ). Without loss of generality , let us work with T suc h that T δ := min  T ∈ N +     σ ( T ; 0) ≤ δ 4  . (137) It is c lear that T δ ≥ 1. Deﬁne function ν ( ǫ ) := ǫLT δ h (1 + ǫL ) T δ − 2 + 13 i + ( ǫLT δ ) 3 (1 + ǫL ) 2 T δ − 4 (138) 29 which can be easily shown to be a mono tonically decreasing function of ǫ > 0, and which attains its minim um ν = 0 at ǫ = 0. Let ǫ δ be the unique solutio n to the equa tio n ν ( ǫ ) = δ 4 , ǫ > 0 . (139) As a result, for all ǫ ∈ (0 , ǫ δ ], it holds that ν ( ǫ ) ≤ δ 4 . (140) Combining (137) a nd (140) co ncludes the pro of of Pro p osition 2. 30

A Multistep Lyapunov Approach for Finite-Time Analysis of Biased Stochastic Approximation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment