Actor-Critic Provably Finds Nash Equilibria of Linear-Quadratic Mean-Field Games

We study discrete-time mean-field Markov games with infinite numbers of agents where each agent aims to minimize its ergodic cost. We consider the setting where the agents have identical linear state transitions and quadratic cost functions, while th…

Authors: Zuyue Fu, Zhuoran Yang, Yongxin Chen

Actor-Critic Pro v ably Finds Nash Equilibria of Linear-Q uadrati c Mean-Fiel d Games Zuyue F u ∗ Zh u o ran Y ang † Y ong xin Chen ‡ Zhaor an W ang ∗ Abstract W e stu d y discrete-time mean-field Marko v games with infinite num b ers of agen ts where eac h agen t aims to minimize its ergo d ic cost. W e consider the setting where the agen ts hav e id en tical linear state transitions and quadratic cost functions, wh ile the aggregat ed effect of the agen ts is captured b y th e p op- ulation mean of their state s , namely , the mean-field state . F or su c h a game, based on the Nash certain ty equiv alence principle, w e pr o vide sufficient condi- tions for the existence and uniqueness of its Nash equ ilibrium. Moreo v er, to find the Nash equilibr ium, w e prop ose a mean-field actor-critic algorithm with linear fun ction appro xim ation, whic h do es not require knowing the mo del of dynamics. Sp ecifically , at eac h iteration of our algo rithm, we use th e single- agen t actor-critic algo rithm to appr o ximately obtain th e optimal p olicy of the eac h agen t giv en th e curren t mean-field state, and then up date the mean-field state. In particular, w e prov e that our algorithm conv erges to the Nash equi- librium at a linear rate. T o th e b est of our kno wledge, this is the first success of applying mo del-free reinforcemen t learning with function appro ximation to discrete-time m ean-field Mark ov games w ith pro v able non-asymptotic global con ve rgence guaran tees. 1 In tro ductio n In reinforcemen t learning (RL) ( Sutton and Barto , 2018 ), an agent learns to mak e decisions that minimize its exp ected total cost through sequen tial in teractions with the en vironmen t. ∗ Department of Industr ial Engineering and Manag ement Sciences, Northw ester n Universit y † Department of O p erations Research and Financial Engineering, Princeton Universit y ‡ School of Aero space Engineering, Georg ia Institute o f T echnology 1 Multi-agen t reinforcemen t learning (MARL) ( Shoham et al. , 2003 , 200 7 ; Busoniu et al. , 2008 ) aims to extend RL to sequen tial decision-making problems in volving m ultiple agen ts. In a non-co op erativ e game, w e are intere sted in the Nash equilibrium ( Nash , 1 951 ), whic h is a join t p olicy o f all the agen ts suc h that each agent cannot decrease its exp ected to t al cost b y unilaterally deviating f rom its Na sh p o licy . The Nash equilibrium pla ys a critical role in understanding the so cial dynamics of self-in terested agen ts ( Ash , 2000 ; Axtell , 2002 ) and constructing the optimal p olicy of a particular a gen t via fictitio us self-pla y ( Bo wling and V eloso , 2000 ; Ga nzfried and Sandholm , 200 9 ). In the presence of the recen t dev elopmen t in deep learning ( LeCun et al. , 2015 ) , MARL with function appro ximation ac hieve s tremendous empirical successes in applications, including Go ( Silv er et al. , 201 6 , 2017 ), D ota ( Op enAI , 2018 ), Star Craft ( Vin y als et a l. , 2019 ), Pok er ( Heinrich and Silv er , 20 16 ; Mora vˇ c ´ ık et a l. , 2017 ), m ulti- rob otic systems ( Y a ng and Gu , 200 4 ), auto nomous driving ( Shalev-Sh w artz et al. , 2016 ), and solving so cial dilemmas ( de Cote et al. , 2006 ; Leib o et al. , 2017 ; Hughes et al. , 2 018 ). Ho w ev er, since the capacity of the join t state and action spaces grow s exp onen- tially in the num b er of ag en ts, suc h MARL a pproac hes b ecome computatio na lly in tractable when the n um b er of agents is la rge, whic h is common in real- w orld applications ( Sandholm , 2010 ; Calderone , 20 17 ; W ang et al. , 2017a ). Mean-field game is prop osed b y Huang et al. ( 2 0 03 , 2006 ); Lasry and Lions ( 2006 a , b , 2007 ) with the idea of utilizing mean-field approx imation to mo del the strategic interactions within a la r ge p opulation. In a mean-field g ame, eac h agent ha s the same cost function and state transition, whic h depend on the ot her agen ts only thro ug h their aggregated effect. As a result, the optimal p olicy of eac h agen t dep ends solely on its own state and the aggregated effect o f the p opulation, and suc h an optimal p olicy is symmetric across all the agents . Moreo ve r, if t he aggregated effect of the p opulation corresp onds to the Nash equilibrium, then the optimal p olicy of eac h ag en t jointly constitutes a Nash equilibrium. Although suc h a Nash equilibrium corresp onds to an infinite n um b er of ag ents, it w ell appro ximates the Nash equilibrium for a sufficien tly large n umber of agen ts ( Bensoussan et al. , 2016 ). Also, as the aggregat ed effect of t he p o pulation abstracts a w ay the strategic in teractions b et wee n individual agen ts, it circumv en ts the computational intractabilit y of the MARL approac hes that do not exploit symmetry . Ho we v er, most existing work on mean-field games fo cuses on c ha r a cterizing the existence and uniqueness of the Nash equilibrium rather tha n designing pro v ably efficien t algorithms. In particular, most existing w ork considers the con tinuous-time setting, whic h requires solv- ing a pair of Ha milto n-Jacobi-Bellman ( HJB) and F okke r-Planc k (FP) equations, whereas 2 the discrete-time setting is more common in practice, e.g., in the afo remen tioned applica- tions. Moreo ver, mo st existing approaches , including the ones based on solving the HJB and FP equations, require kno wing the mo del of dynamics ( Bardi and Priuli , 2014 ), or having the access to a simulator, whic h generates the next state given a ny state-a ctio n pair and aggregated effect of the p opulatio n ( Guo et al. , 2019 ), whic h is oft en unav ailable in practice. T o address these c hallenges, w e dev elop a n efficien t mo del-free RL a pproac h to mean- field game, whic h pro v ably a ttains the Nash equilibrium. In particular, w e fo cus o n discrete- time mean-field games with linear state transitions and quadratic cost functions, where the aggregated effect of the p opulation is quantified by the mean-field state. Such games capture the fundamental difficulties of general mean-field games and w ell a ppro ximates a v ariet y o f real-w orld systems such as p ow er grids ( Minciardi a nd Sacile , 201 1 ), sw arm rob ots ( F ang , 2014 ; Araki et al. , 2017 ; D o err et al. , 2 018 ), and financial systems ( Zhou and Li , 2000 ; Huang and Li , 2018 ). In detail, based on the Nash certaint y equiv alenc e (NCE) principle ( Huang et al. , 20 06 , 20 07 ), w e prop o se a mean-field actor- critic alg orithm whic h, at eac h iteration, give n the mean-field state µ , appro ximately a ttains the optimal p o licy π ∗ µ of eac h agen t, and then up dates the mean-field state µ assuming tha t all the agen ts follo w π ∗ µ . W e parametrize the a ctor and critic b y linear and quadratic functions, resp ectiv ely , and pro ve that suc h a pa r a meterization encompasses the optimal p olicy of eac h agen t. Sp ecifically , w e up date the actor parameter using p olicy gradient ( Sutton et al. , 20 00 ) and natural p olicy gradien t ( Kak ade , 2002 ; Pe ters and Sch aal , 2008 ; Bhatnag ar et al. , 200 9 ) and up date the critic parameter using primal-dual gradien t tempo r a l difference ( Sutton et al. , 2009a , b ). In particular, we pro ve that giv en the mean-field state µ , the sequence o f p olicies generated b y the actor con verges linearly to the optimal p olicy π ∗ µ . Moreo ver, when alternatingly up date the p o licy and mean-field stat e, w e pro ve that the sequence of p olicies and its corresponding sequence of mean-field states con verge to the unique Nash equilibrium at a linear rate. Our approac h can b e in terpreted f rom b oth “passiv e” a nd “activ e” p ersp ectiv es: ( i) Assuming that eac h self-interes ted agen t emplo ys the single-agen t actor - critic algorithm, the p olicy of eac h agen t conv erges to the unique Nash p olicy , whic h c haracterizes the so cial dynamics of a large p opulation of mo del- f ree RL agents. ( ii) F or a particular agent, our approac h serv es as a fictitious self-pla y metho d for it to find its Nash p olicy , assuming the other agen ts giv e their b est respo nses. T o the b est of our know ledge, our w ork establishes the first efficien t mo del- free RL approach with function approximation that prov ably attains the Nash equilibrium of a discrete-time mean-field game. As a b ypro duct, w e also sho w that the seque nce of p olicies generated b y the single-agent actor-critic algorithm con v erges at a linear rate to the optimal 3 p olicy o f a linear-quadratic regulato r (LQR) problem in the presence o f drift, whic h may be of indep enden t interes t . Related W ork. Mean-field game is first in tro duced in Huang et al. ( 2 0 03 , 20 06 ); Lasry and Lions ( 2006a , b , 200 7 ). In the last decade, there is grow ing intere st in understanding con tinuous-time mean-field games. See, e.g., Gu´ ean t et a l. ( 2011 ); Bensoussan et al. ( 2 013 ); Gomes et al. ( 2 0 14 ); Carmona and Delarue ( 2013 , 2018 ) and the references therein. Due to their simple structures, con tin uous-time linear-quadratic mean-field games are extensiv ely studied under v arious mo del assumptions. See Li and Zhang ( 2008 ) ; Ba rdi ( 2011 ); W ang and Zhang ( 2012 ); Bardi and Priuli ( 2014 ); Huang et al. ( 20 16a , b ); Bensouss an et al. ( 2016 , 2017 ); Caines and Kizilk ale ( 2017 ) ; Huang and Huang ( 2017 ); Mo on and Ba¸ sar ( 20 18 ); Huang and Zhou ( 2019 ) for examples of t his line of work. Mean while, the literature on discrete- time linear-quadratic mean- field g a mes remains relatively scarce. Most of this line of w ork fo cuses on c haracterizing t he existe nce of a Nash equilibrium and the behavior o f suc h a Nash equilibrium when the num b er of agen ts go es to infinity ( Gomes et al. , 2010 ; T em bine and Huang , 20 11 ; Mo on and Ba¸ sar , 2014 ; Bisw as , 2015 ; Saldi et al. , 2018a , b , 2019 ). See also Y ang et al. ( 2018a ), whic h applies maximu m entrop y in v erse R L ( Ziebart et al. , 2008 ) to infer the cost function and social dynamics o f discrete-time mean-field ga mes with finite state a nd action spaces. Our work is most related to Guo et al. ( 2019 ), where they pro p ose a mean- field Q-learning alg orithm ( W atkins and D a y an , 1 992 ) for discrete-time mean-field games with finite state and action spaces. Suc h an algorithm requires the access to a sim ulator, whic h, giv en an y state-actio n pair and mean-field state, outputs the next state. In con trast, b oth o ur state and action spaces a re infinite, a nd w e do not require suc h a sim ulator but only observ ations of tra jectories. Corresp ondingly , we study the mean-field a cto r - critic alg orithm with linear function approximation, whereas their alg o rithm is tailored to the tabular setting. Also, our w ork is closely related to Mguni et al. ( 2018 ), whic h fo cuses on a more restrictiv e setting where the state transition do es not inv olv e the mean-field state. In suc h a setting, mean-field games are p oten tia l games, whic h is, ho we v er, not true in more g eneral settings ( Li et al. , 2017 ; Briani and Cardalia guet , 2018 ) . In comparison, w e allo w the state tr a nsition to dep end on the mean-field state. Mean while, they pro p ose a fictitious self-play metho d based on the single-a gen t actor- critic algo rithm and establishes its asymptotic conv ergence. Ho we v er, their pro of of con ve r gence relies on the assumption that the single-agent actor-critic algorithm conv erges to the optimal p olicy , whic h is unv erified therein. Meanwh ile, a mo del- based alg o rithm is prop osed in uz Zaman et al. ( 2019 ) for the discoun ted linear-quadratic mean-field games, where they only sho w that the algorithm con v erges a symptotically to the 4 Nash equilibrium. In addition, o ur w or k is related to Ja y akumar and Adit y a ( 201 9 ), where the prop osed algorithm is only sho wn to con ve rge asymptotically t o a stationa ry p oint of the mean-field ga me. Our work also extends the line of w ork on finding the Nash equilibria of Mark ov ga mes using MARL. Due t o the computatio na l in t ractabilit y in tro duced b y the large n umber of agen ts, suc h a line of w ork fo cuses o n finite-agen t Marko v games ( Littman , 1 994 , 2001 ; Hu and W ellman , 1998 ; Bowling , 2001 ; L a goudakis and P arr , 200 2 ; Hu and W ellman , 2003 ; Conitzer and Sandholm , 2007 ; P erolat et al. , 2015 ; P ´ e rolat et al. , 2016 b , a , 2018 ; W ei et al. , 2017 ; Z ha ng et al. , 2018 ; Zo u et al. , 2019 ; Casgrain et al. , 2019 ). See also Shoham et al. ( 2003 , 2007 ); Busoniu et al. ( 2008 ); Li ( 2018 ) for detailed surv eys. Our w ork is related to Y ang et al. ( 2018b ), where they com bine the mean-field appro ximatio n of actions (r a ther than states) and Nash Q-learning ( Hu and W ellman , 2003 ) to study general- sum Marko v games with a large n um b er of agen ts. Ho w ev er, the Nash Q-learning algorithm is only applicable to finite state and action spaces, and its con vergence is established under rather strong assumptions. Also, when the num b er of ag en ts go es to infinit y , their approac h yields a v arian t of tabular Q-learning, whic h is differen t fro m our mean-field actor-critic algorithm. F or p olicy optimization, based on t he p olicy gr adien t theorem, Sutton et al. ( 2 000 ); Konda and Tsitsiklis ( 2000 ) prop ose the actor-critic algorithm, whic h is later generalized to the natural actor-critic a lgorithm ( P eters and Sc haa l , 2008 ; Bhatnagar et al. , 2 009 ). Most existing results on the conv ergence o f actor- critic algorithms are ba sed on sto chas tic appro x- imation using ordinary differen tial equations ( Bhatnagar et al. , 2009 ; Castro and Meir , 2 010 ; Konda and Tsitsiklis , 2000 ; Maei , 2018 ), whic h are asymptotic in nature. F or p olicy ev al- uation, the con v ergence of primal-dual gradient temporal difference is studied in Liu et al. ( 2015 ); Du et al. ( 20 17 ); W ang et al. ( 201 7b ); Y u ( 2017 ); W ai et al. ( 2 018 ). Ho we ver, this line of w ork assumes that the feature mapping is b ounded, whic h is not the case in our setting. Th us, the existing con ve rgence results are not applicable t o analyzing the critic up date in our setting. T o handle t he unbounded feature mapping, we utilize a truncation argumen t, whic h requires more delicate analysis. Finally , o ur work extends the line of w o r k that studies mo del-free RL for LQR. F or example, Bradtk e ( 19 9 3 ); Bradtke et al. ( 19 94 ) show that p o licy it eration con v erges to the optimal po licy , T u and Rec ht ( 2017 ); Dean et al. ( 2017 ) study the sample complexit y of least-squares temp oral-difference for p olicy ev aluation. More recen tly , F a zel et a l. ( 2018 ) ; Malik et al. ( 2018 ); T u and Rec h t ( 201 8 ) sho w that the p olicy gra dient algorithm con v erg es at a linear rate to the optimal p olicy . See as also Hardt et al. ( 2016 ); Dean et al. ( 20 18 ) for 5 more in this line of w ork. Our work is a lso closely related to Y ang et al. ( 2 019 ), where they sho w that the sequenc e of p olicies generated b y the natural actor-critic algorithm enjo ys a linear rate of con v ergence to the optimal p olicy . Compared with t his work, when fixing the mean-field state, w e use the a ctor-critic algorithm to study LQR in the presence of drift , whic h in tr o duces significan t difficulties in the ana lysis. As w e sho w in § 3 , the drift causes the optimal p o licy t o hav e an additional interce pt, whic h mak es the state- and action-v alue functions more complicated. Notations. W e denote by k M k 2 the spectral norm, ρ ( M ) the sp ectral radius, σ min ( M ) the minim um singular v alue, and σ max ( M ) the maxim um singular v alue of a matrix M . W e use k α k 2 to represen t the ℓ 2 -norm of a v ector α , and ( α ) j i to denote the sub-v ector ( α i , α i +1 , . . . , α j ) ⊤ , where α k is the k -th en try of t he v ector α . F or scalars a 1 , . . . , a n , w e denote b y p oly( a 1 , . . . , a n ) the p olynomial of a 1 , . . . , a n , and this p olynomial may v ary from line to line. W e use [ n ] t o denote the set { 1 , 2 , . . . , n } for a n y n ∈ N . 2 Linear-Quadratic Mean-Field Game A linear- quadratic mean- field game in volv es N a ∈ N agen ts. Their state transitions are giv en b y x i t +1 = Ax i t + B u i t + A · 1 N a N a X j =1 x j t + d i + ω i t , ∀ t ≥ 0 , i ∈ [ N a ] , where x i t ∈ R m and u i t ∈ R k are the state and action ve cto r s of agen t i , resp ectiv ely , the v ector d i ∈ R m is a drift term, and ω i t ∈ R m is an indep enden t random noise term follo wing the G aussian distribution N (0 , Ψ ω ). The agen ts are coupled thro ugh the mean-field state 1 / N a · P N a j =1 x j t . In the linear-quadratic mean-field game, the cost of a gen t i ∈ [ N a ] at time t ≥ 0 is given by c i t = ( x i t ) ⊤ Qx i t + ( u i t ) ⊤ Ru i t + 1 N a N a X j =1 x j t ! ⊤ Q 1 N a N a X j =1 x j t ! , where u i t is generated by π i , i.e., the p olicy of agen t i . T o measure the p erfor ma nce of agen t i follow ing its p olicy π i under the influence of the other agents, we define the expected total cost of agent i as J i ( π 1 , π 2 , . . . , π N a ) = lim T →∞ E 1 T T X t =0 c i t ! . 6 W e are interested in finding a Nash equilibrium ( π 1 , π 2 , . . . , π N a ), whic h is defined b y J i ( π 1 , . . . , π i − 1 , π i , π i +1 , . . . , π N a ) ≤ J i ( π 1 , . . . , π i − 1 , e π i , π i +1 , . . . , π N a ) , ∀ e π i , i ∈ [ N a ] . That is, a gen t i cannot furt her decrease its exp ected total cost b y unilaterally deviating from its Nash p olicy . F or the simplicit y of discussion, w e assume that t he drift term d i is identic al for eac h agen t. By the symmetry of the agents in terms of their state transitions and cost functions, w e fo cus o n a fixed agent and drop the sup erscript i hereafter. F urther taking the infinite- p opulation limit N a → ∞ leads to the following fo r m ulation of linear-quadratic mean-field game (LQ-MF G ). Problem 2.1 (LQ- MF G ) . W e consider the follow ing fo rm ulatio n, x t +1 = Ax t + B u t + A E x ∗ t + d + ω t , c ( x t , u t ) = x ⊤ t Qx t + u ⊤ t Ru t + ( E x ∗ t ) ⊤ Q ( E x ∗ t ) , J ( π ) = lim T →∞ E " 1 T T X t =0 c ( x t , u t ) # , where x t ∈ R m is the state v ector, u t ∈ R k is the action v ector generated b y the p olicy π , { x ∗ t } t ≥ 0 is the tra j ectory generated by a Nash p olicy π ∗ (assuming it exists), ω t ∈ R m is an indep enden t random noise term follo wing the Ga ussian distribution N (0 , Ψ ω ), and d ∈ R m is a drift term. Here the exp ectation E x ∗ t is take n across all the agen t s. W e a im to find π ∗ suc h that J ( π ∗ ) = inf π ∈ Π J ( π ). The form ulatio n in Problem 2.1 is studied b y Lasry and Lions ( 2007 ); Bensoussan et al. ( 2016 ); Saldi et al. ( 201 8a , b ). W e prop ose a more general form ulation in Problem C.1 (see § C of the app endix fo r details), where an additional in teraction term betw een the state vec tor x t and the mean-field state E x ∗ t is incorp orated into the cost function. According to our analysis in § C , up to minor mo dification, the results in the following sections also carry o ve r to Problem C.1 . Therefore, for the sake of simplicit y , w e fo cus on Problem 2.1 in the sequel. Note that the mean-field state E x ∗ t con v erges to a constan t vec tor µ ∗ as t → ∞ , whic h serv es as a fixed mean-field state, since the Mark ov c hain of states g enerated b y the Nash p olicy π ∗ admits a stationa ry distribution. As w e consider the ergo dic setting, it suffices to study Problem 2.1 with t sufficien t ly large, whic h mo t iv ates the follo wing drifted LQ R (D-LQR) problem, where the mean-field state acts as another drift term. 7 Problem 2.2 (D-LQR) . Giv en a mean- field state µ ∈ R m , w e consider the f o llo wing form u- lation, x t +1 = Ax t + B u t + Aµ + d + ω t , c µ ( x t , u t ) = x ⊤ t Qx t + u ⊤ t Ru t + µ ⊤ Qµ, J µ ( π ) = lim T →∞ E " 1 T T X t =0 c µ ( x t , u t ) # , where x t ∈ R m is the state ve ctor, u t ∈ R k is the action v ector generated by the p olicy π , ω t ∈ R m is an indep enden t random noise term following the G aussian distribution N (0 , Ψ ω ), and d ∈ R m is a drift term. W e aim to find an o ptimal p olicy π ∗ µ suc h that J µ ( π ∗ µ ) = inf π ∈ Π J µ ( π ). F or the mean-field state µ = µ ∗ , whic h corresp onds to the Nash equilibrium, solving Problem 2.2 giv es π ∗ µ ∗ , whic h coincides with the Nash p olicy π ∗ defined in Problem 2.1 . Compared with t he most studied LQR problem ( Lewis et al. , 2012 ), b oth the state transition and the cost function in Problem 2.2 ha ve drift terms, whic h act as the mean- field “force” that driv es the stat es a wa y from zero. Suc h a mean-field “fo rce” in tro duces additional c hallenges when solving Problem 2.2 in the mo del- f ree setting (see § 3.3 for details). On the o ther hand, the unique o ptimal p olicy π ∗ µ of Problem 2.2 admits a linear f o rm π ∗ µ ( x t ) = − K π ∗ µ x t + b π ∗ µ ( Anderson and Mo ore , 20 07 ), where the matrix K π ∗ µ ∈ R k × m and the v ector b π ∗ µ ∈ R k are the parameters of π ∗ µ . Motiv ated b y suc h a linear form of the optimal p olicy , w e define the class of linear- Gaussian p olicies as Π = { π ( x ) = − K x + b + σ · η : K ∈ R k × m , b ∈ R k } , (2.1) where the standar d Gaussian noise term η ∈ R k is included to encourage exploration. T o solv e Problem 2.2 , it suffices to find t he optimal p o licy π ∗ µ within Π. No w, w e in tro duce the definition of the Nash equilibrium pair ( Saldi et a l. , 2018 a , b ). The Nash equilibrium pair is c haracterized b y the NCE principle, whic h states that it suffices to find a pair o f π ∗ and µ ∗ , such that the p olicy π ∗ is optimal for each agen t when the mean-field state is µ ∗ , while all the agen ts f o llo wing the p olicy π ∗ generate the mean-field state µ ∗ as t → ∞ . T o presen t it s f o rmal definition, w e define Λ 1 ( µ ) as the optimal p olicy in Π giv en the mean-field state µ , and define Λ 2 ( µ, π ) as the mean-field state generated b y the p olicy π giv en the curren t mean-field state µ as t → ∞ . Definition 2.3 (Nash Equilibrium Pair) . The pair ( µ ∗ , π ∗ ) ∈ R m × Π constitutes a Nash equilibrium pair o f Problem 2.1 if it satisfies π ∗ = Λ 1 ( µ ∗ ) and µ ∗ = Λ 2 ( µ ∗ , π ∗ ). Here µ ∗ is called t he Nash mean-field state and π ∗ is called the Nash p olicy . 8 3 Mean-Field Actor-Crit i c W e first c ha r a cterize the existence and uniquen ess of the Nash equilibrium pair of Problem 2.1 under mild regularit y conditions, and then prop ose a mean-field actor-critic algo r it hm to obtain suc h a Nash equilibrium. As a building blo c k of the mean-field a ctor-critic, w e prop ose the natural acto r - critic to solve Problem 2.2 . 3.1 Existence and Uniqueness of Nash Equ ilibrium P air W e now establish the existence and uniqueness of the Nash equilibrium pair defined in Definition 2.3 . W e imp ose the follow ing regula rit y conditio ns. Assumption 3.1. W e assume that the follo wing statemen ts hold: (i) The algebraic R iccati equation X = A ⊤ X A + Q − A ⊤ X B ( B ⊤ X B + R ) − 1 B ⊤ X A admits a unique symmetric p ositiv e definite solution X ∗ ; (ii) It holds for L 0 = L 1 L 3 + L 2 that L 0 < 1, where L 1 =    ( I − A ) Q − 1 ( I − A ) ⊤ + B R − 1 B ⊤  − 1 A   2 ·    K ∗ Q − 1 ( I − A ) ⊤ − R − 1 B ⊤    2 , L 2 =  1 − ρ ( A − B K ∗ )  − 1 · k A k 2 , L 3 =  1 − ρ ( A − B K ∗ )  − 1 · k B k 2 . Here K ∗ = − ( B ⊤ X ∗ B + R ) − 1 B ⊤ X ∗ A . The first assumption is implied b y mild regularity conditions on the matrices A , B , Q , and R . See Theorem 3 .2 in De Souza et al. ( 1986 ) for details. The second a ssumption is standard in the literature ( Bensoussan et a l. , 201 6 ; Saldi et al. , 2018b ), whic h ensures the stabilit y of the LQ-MF G . In the follow ing prop osition, w e sho w that Problem 2.1 admits a unique Nash equilibrium pair. Prop osition 3.2 (Existence and Uniqueness of Nash Equilibrium P air) . Under Assumption 3.1 , the op erator Λ( · ) = Λ 2 ( · , Λ 1 ( · )) is L 0 -Lipsc hitz, where L 0 is g iven in Assumption 3.1 . Moreo ve r, there exists a unique Nash equilibrium pair ( µ ∗ , π ∗ ) of Problem 2 .1 . Pr o of. See § E.1 for a detailed pro o f. 3.2 Mean-Field Actor-Critic for LQ-MFG The NCE principle motiv ates a fixed-p oin t approa c h to solv e Problem 2 .1 , whic h generates a sequence of p olicies { π s } s ≥ 0 and mean-field stat es { µ s } s ≥ 0 satisfying the following t w o 9 prop erties: (i) G iven the mean-field state µ s , the p olicy π s is optimal. (ii) The mean-field state b ecomes µ s +1 as t → ∞ , if all the a gen ts follo w π s under the current mean-field state µ s . Here (i) requires solving Problem 2.2 giv en the mean-field state µ s , while (ii) requires sim ulating the agen ts following the p olicy π s giv en the curren t mean-field µ s . Based on suc h prop erties, w e prop ose the mean-field actor-critic in Algorithm 1 . Algorithm 1 Mean-Field Actor-Critic for solving LQ-MF G. 1: Input: • Initial mean-field state µ 0 and Initial p olicy π 0 with parameters K 0 and b 0 . • Num b ers of iterations S , { N s } s ∈ [ S ] , { H s } s ∈ [ S ] , { e T s,n , T s,n } s ∈ [ S ] ,n ∈ [ N s ] , { e T b s,h , T b s,h } s ∈ [ S ] ,h ∈ [ H s ] . • Stepsizes { γ s } s ∈ [ S ] , { γ b s } s ∈ [ S ] , { γ s,n,t } s ∈ [ S ] ,n ∈ [ N s ] ,t ∈ [ T s,n ] , { γ b s,h,t } s ∈ [ S ] ,h ∈ [ H s ] ,t ∈ [ T b s,h ] . 2: for s = 0 , 1 , 2 , . . . , S − 1 do 3: P olicy Up date: Solv e for the optimal p olicy π s +1 with parameters K s +1 and b s +1 of Problem 2.2 via Algorithm 2 with µ s , π s , N s , H s , { e T s,n , T s,n } n ∈ [ N s ] , { e T b s,h , T b s,h } h ∈ [ H s ] , γ s , γ b s , { γ s,n,t } n ∈ [ N s ] ,t ∈ [ T s,n ] , and { γ b s,h,t } h ∈ [ H s ] ,t ∈ [ T b s,h ] , whic h giv es the estimated mean-field state b µ K s +1 ,b s +1 . 4: Mean-Fiel d State Up date: Up date the mean-field state via µ s +1 ← b µ K s +1 ,b s +1 . 5: end for 6: Output: P air ( π S , µ S ). Algorithm 1 requires solving Problem 2 .2 at each iteration to o btain π s = Λ 1 ( µ s ) and µ s +1 = Λ 2 ( µ s , π s ). T o this end, we intro duce the natural actor-critic in § 3.3 that solv es Problem 2.2 . 3.3 Natural Actor-Critic for D-LQR No w we fo cus on solving Problem 2.2 for a fixed mean- field state µ , w e th us drop the subscript µ hereafter. W e write π K,b ( x ) = − K x + b + σ η to emphasize the dep endence o n K and b , and J ( K , b ) = J ( π K,b ) consequen tly . Now, w e prop ose the natural actor-critic to solv e Problem 2.2 . F or an y p olicy π K,b ∈ Π, b y the state transition in Problem 2.2 , w e hav e x t +1 = ( A − B K ) x t + ( B b + Aµ + d ) + ǫ t , ǫ t ∼ N (0 , Ψ ǫ ) , (3.1) where Ψ ǫ = σ B B ⊤ + Ψ ω . It is know n that if ρ ( A − B K ) < 1, then the Mark o v c hain { x t } t ≥ 0 induced b y ( 3.1 ) has a unique stationary distribution N ( µ K,b , Φ K ) ( Anderson and Mo ore , 10 2007 ), where the mean-field state µ K,b and the cov a r iance Φ K satisfy tha t µ K,b = ( I − A + B K ) − 1 ( B b + Aµ + d ) , (3.2) Φ K = ( A − B K )Φ K ( A − B K ) ⊤ + Ψ ǫ . (3.3) Mean while, the Bellman equation for Problem 2.2 tak es the follo wing form P K = ( Q + K ⊤ RK ) + ( A − B K ) ⊤ P K ( A − B K ) . (3.4) Then b y calculation (see Prop osition B.2 in § B.1 of the app endix for details), it holds that the expected total cost J ( K , b ) is decomp osed as J ( K , b ) = J 1 ( K ) + J 2 ( K , b ) + σ 2 · tr( R ) + µ ⊤ Qµ, (3.5) where J 1 ( K ) and J 2 ( K , b ) are defined as J 1 ( K ) = tr  ( Q + K ⊤ RK )Φ K  = tr( P K Ψ ǫ ) , J 2 ( K , b ) = µ K,b b ! ⊤ Q + K ⊤ RK − K ⊤ R − RK R ! µ K,b b ! . (3.6) Here J 1 ( K ) is t he expected t o tal cost in the most studied LQR problems ( Y ang et al. , 201 9 ; F azel et al. , 2018 ), where the state transition do es not hav e drif t terms. Mean while, J 2 ( K , b ) corresp onds to the exp ected cost induced by the drift terms. The follo wing tw o prop ositions c haracterize the prop erties of J 2 ( K , b ). First, w e sho w that J 2 ( K , b ) is strongly con v ex in b . Prop osition 3.3. Giv en a n y K , the function J 2 ( K , b ) is ν K -strongly con ve x in b . Here ν K = σ min ( Y ⊤ 1 ,K Y 1 ,K + Y ⊤ 2 ,K Y 2 ,K ), where Y 1 ,K = R 1 / 2 K ( I − A + B K ) − 1 B − R 1 / 2 and Y 2 ,K = Q 1 / 2 ( I − A + B K ) − 1 B . Also, J 2 ( K , b ) has ι K -Lipsc hitz contin uous gradien t in b , where ι K is upper b ounded as ι K ≤ [1 − ρ ( A − B K )] − 2 · ( k B k 2 2 · k K k 2 2 · k R k 2 + k B k 2 2 · k Q k 2 ). Pr o of. See § E.4 for a detailed pro o f. Second, w e show that min b J 2 ( K , b ) is indep enden t of K . Prop osition 3.4. W e define b K = argmin b J 2 ( K , b ), where J 2 ( K , b ) is defined in ( 3.6 ). It holds that b K =  K Q − 1 ( I − A ) ⊤ − R − 1 B ⊤  ·  ( I − A ) Q − 1 ( I − A ) ⊤ + B R − 1 B ⊤  − 1 · ( Aµ + d ) . 11 Moreo ve r, J 2 ( K , b K ) ta kes the form of J 2 ( K , b K ) = ( Aµ + d ) ⊤  ( I − A ) Q − 1 ( I − A ) ⊤ + B R − 1 B ⊤  − 1 · ( Aµ + d ) , whic h is indep enden t of K . Pr o of. See § E.2 for a detailed pro o f. Since min b J 2 ( K , b ) is independen t of K by Prop o sition 3.4 , it holds that the o pt ima l K ∗ is the same as argmin K J 1 ( K ). This motiv ates us to minimize J ( K, b ) b y first up dating K follo wing the gradien t direction ∇ K J 1 ( K ) to the o ptimal K ∗ , then up dating b follo wing the gradien t direction ∇ b J 2 ( K ∗ , b ). W e now design our algorithm based on this idea. W e define Υ K , p K,b , and q K,b as Υ K = Q + A ⊤ P K A A ⊤ P K B B ⊤ P K A R + B ⊤ P K B ! = Υ 11 K Υ 12 K Υ 21 K Υ 22 K ! , p K,b = A ⊤  P K · ( Aµ + d ) + f K,b  , q K,b = B ⊤  P K · ( Aµ + d ) + f K,b  , (3.7) where f K,b = ( I − A + B K ) −⊤ [( A − B K ) ⊤ P K ( B b + Aµ + d ) − K ⊤ Rb ]. By calculatio n (see Prop osition B.3 in § B.1 of the app endix for details), the gr a dien ts of J 1 ( K ) and J 2 ( K , b ) tak e the forms of ∇ K J 1 ( K ) = 2(Υ 22 K K − Υ 21 K ) · Φ K , ∇ b J 2 ( K , b ) = Υ 22 K ( − K µ K,b + b ) + Υ 21 K µ K,b + q K,b . Our algorithm fo llows the na t ural actor- critic metho d ( Bhatnagar et al. , 2009 ) and actor- critic metho d ( Konda a nd Tsitsiklis , 2000 ). Sp ecifically , (i) T o obta in t he optimal K ∗ , in the critic up da te step, w e estimate the matrix Υ K b y b Υ K via a p o licy ev aluation algorithm, e.g., Algor ithm 3 or Algorithm 4 (see § B.2 and § B.3 o f the app endix f o r details); in the actor up date step, w e upda t e K via K ← K − γ · ( b Υ 22 K K − b Υ 21 K ), where the term b Υ 22 K K − b Υ 21 K is the estimated nat ur a l gradient. (ii) T o obta in the optimal b ∗ giv en K ∗ , in the critic up date step, w e estimate Υ K ∗ , q K ∗ ,b , and µ K ∗ ,b b y b Υ K ∗ , b q K ∗ ,b , and b µ K ∗ ,b via a p olicy ev aluation a lg orithm; In the actor up da t e step, w e up date b via b ← b − γ · b ∇ b J 2 ( K ∗ , b ), where b ∇ b J 2 ( K ∗ , b ) = b Υ 22 K ∗ ( − K ∗ b µ K ∗ ,b + b ) + b Υ 21 K ∗ b µ K ∗ ,b + b q K ∗ ,b is the estimated gradien t. Com bining the ab ov e pro cedure, we obtain the natural actor-critic for Problem 2.2 , whic h is stated in Algorithm 2 . 12 Algorithm 2 Natural Actor-Critic Algorithm for D -LQR. 1: Input: • Mean-field state µ a nd initial p olicy π K 0 ,b 0 . • Num b ers of iterations N , H , { e T n , T n } n ∈ [ N ] , { e T b h , T b h } h ∈ [ H ] . • Stepsizes γ , γ b , { γ n,t } n ∈ [ N ] ,t ∈ [ T n ] , { γ b h,t } h ∈ [ H ] ,t ∈ [ T b h ] . 2: for n = 0 , 1 , 2 , . . . , N − 1 do 3: Critic Up date: Compute b Υ K n via Algo rithm 3 with π K n ,b 0 , µ , e T n , T n , { γ n,t } t ∈ [ T n ] , K 0 , and b 0 as inputs. 4: Actor Up date: Up da t e the para meter via K n +1 ← K n − γ · ( b Υ 22 K n K n − b Υ 21 K n ) . 5: end for 6: for h = 0 , 1 , 2 , . . . , H − 1 do 7: Critic Up date: Compute b µ K N ,b h , b Υ K N , b q K N ,b h via Algorithm 3 with π K N ,b h , µ , e T b h , T b h , { γ b h,t } t ∈ [ T b h ] , K 0 , and b 0 . 8: Actor Up date: Up da t e the para meter via b h +1 ← b h − γ b ·  b Υ 22 K N ( − K N b µ K,b h + b h ) + b Υ 21 K N b µ K N ,b h + b q K N ,b h  . 9: end for 10: Output: Policy π K,b = π K N ,b H , estimated mean-field state b µ K,b = b µ K N ,b H . 4 Global C on v e r g ence Results The follo wing theorem establishes the rate of conv ergence of Algorithm 1 to the Nash equi- librium pair ( µ ∗ , π ∗ ) of Problem 2.1 . Theorem 4.1 (Conv ergence of Algor it hm 1 ) . F or a sufficien tly small to lera nce ε > 0, we set the num b er of iterations S in Algorithm 1 suc h that S > log  k µ 0 − µ ∗ k 2 · ε − 1  log(1 / L 0 ) . (4.1) F or an y s ∈ [ S ], w e define ε s = min n  1 − ρ ( A − B K ∗ )  4  k B k 2 + k A k 2  − 4  k µ s k − 2 2 + k d k − 2 2  · σ min (Ψ ǫ ) · σ min ( R ) · ε 2 , ν K ∗ ·  1 − ρ ( A − B K ∗ )  4 · k B k − 2 2 · M b ( µ s ) · ε 2 , ε o · 2 − s − 10 , (4.2) 13 where ν K ∗ is defined in Prop o sition 3.3 and M b ( µ s ) = 4    Q − 1 ( I − A ) ⊤ ·  ( I − A ) Q − 1 ( I − A ) ⊤ + B R − 1 B ⊤  − 1 · ( Aµ s + d )    2 ·  ν − 1 K ∗ + σ − 1 min (Ψ ǫ ) · σ − 1 min ( R )  1 / 2 . (4.3) In the s -th p olicy up date step in Line 3 o f Alg o rithm 1 , w e set the inputs via Theorem B.4 suc h that J µ s ( π s +1 ) − J µ s ( π ∗ µ s ) < ε s , where the exp ected total cost J µ s ( · ) is defined in Problem 2.2 , and π ∗ µ s = Λ 1 ( µ s ) is the optimal p olicy under the mean-field state µ s . Then it holds with probability at least 1 − ε 5 that k µ S − µ ∗ k 2 ≤ ε, k K S − K ∗ k F ≤ ε, k b S − b ∗ k 2 ≤ (1 + L 1 ) · ε. Here µ ∗ is the Nash mean-field state, K S and b S are parameters of the p olicy π S , and K ∗ and b ∗ are parameters o f the Nash p olicy π ∗ . Pr o of. See § D.1 for a detailed pro o f. W e highlight that if the inputs of Algo r ithm 1 satisfy the conditions stated in Theorem B.4 , it holds that J µ s ( π s +1 ) − J µ s ( π ∗ µ s ) < ε s for an y s ∈ [ S ]. See Theorem B.4 in § B.1 o f the app endix for details. By Theorem 4.1 , Algo rithm 1 conv erges linearly to the unique Nash equilibrium pair ( µ ∗ , π ∗ ) of Problem 2.1 . T o the b est of our kno wledge, this theorem is the first successful attempt to establish that reinfor cemen t learning with function approx imation finds the Nash equilibrium pairs in mean-field games with theoretical guarante e, whic h la ys the theoretical foundations for a pplying mo dern reinforcemen t learning tec hniques to general mean-field games. 5 Conclus ion F or the discrete-time linear-quadratic mean-field ga mes, we pro vide sufficien t conditions for the existe nce and uniqueness of the Nash equilibrium pair . Moreo ver, w e prop ose the mean- field actor- critic algorithm with linear function appro ximation tha t is sho wn con verges to the Nash equilibrium pair with linear rate of con v ergence. Our algorithm can be mo dified to use other parametrized function classes, including deep neural net w orks, for solving mean-field games. F o r future researc h, we aim to extend o ur a lgorithm to other v ariations of mean-field games including risk-sensitiv e mean-field games ( Saldi et al. , 2018 a ; T embine et al. , 2014 ), robust mean-field games ( Bauso et al. , 2016 ), and partia lly observ ed mean-field ga mes ( Sa ldi et al. , 20 1 9 ). 14 References Alizadeh, F., Haeb erly , J.-P . A. and Ov erton, M. L. (1 9 98). Primal-dual in terior- p oin t meth- o ds for semidefinite programming: con ve rgence rates, stability and nume r ical results . SIAM Journal on Optimiz a tion , 8 746–768 . Anderson, B. D. a nd Mo ore, J. B. (200 7 ). Optimal c ontr ol: line ar quadr atic metho ds . Courier Corp oration. Araki, B., Strang , J., Pohorec ky , S., Qiu, C., Naegeli, T. a nd Rus, D. (2017 ) . Multi-rob o t path planning for a sw arm of rob ots tha t can b oth fly and drive. In 2017 IEEE Interna- tional Confer enc e on R ob otics and A utomation (I CRA) . IEEE. Ash, C. (2000). So cial-self-in terest. Annals of public and c o op er ative e c onom i c s , 71 2 6 1–284. Axtell, R. L. (2002). Non-co op erativ e dynamics of multi-agen t t eams. In Autonomous A gents and Multiagent Systems . Bardi, M. (2 011). Explicit solutions of some linear-quadratic mean field g ames. Networks and heter o gene ous me dia , 7 2 4 3–261. Bardi, M. and Priuli, F. S. (2014). Linear-quadratic n -p erson and mean-field games with ergo dic cost. SI AM Journal on Contr ol and Optimization , 52 30 22–3052. Bauso, D., T em bine, H. and Ba¸ sar, T. (2 016). Ro bust mean field games. Dynamic gam e s and applic ations , 6 277–303. Bensoussan, A., Chau, M., La i, Y. and Y am, S. C. P . ( 2 017). Linear-quadratic mean field Stac ke lb erg games with state and control dela ys. SIAM Journal on C o n tr ol and Optimiza- tion , 55 274 8–2781. Bensoussan, A., F rehse, J. and Y am, P . (2013). Me an fiel d games and me an field typ e c ontr ol the ory . Springer. Bensoussan, A., Sung, K., Y a m, S. C. P . a nd Y ung, S.-P . (2016). Linear- quadratic mean field g ames. Journal of O p timization The ory and Applic ations , 169 496 – 529. Bhandari, J., Russo, D. and Singal, R. (2018). A finite time analysis of temp oral difference learning with linear function approximation. arXiv pr eprint arXiv:1806.0245 0 . 15 Bhatnagar, S., Sutton, R. S., G ha v amzadeh, M. and Lee, M. (2009). Natural actor–critic algorithms. Automatic a , 45 2471–2 4 82. Bisw as, A. (201 5 ). Mean field games with ergo dic cost fo r discrete time mar ko v pro cess es. arXiv pr eprint arXiv:1510.08968 . Bork ar, V. S. and Meyn, S. P . (2000 ) . The OD E method for con v ergence of sto c hastic ap- pro ximation and r einforcemen t learning. SIAM Journal on Contr ol and Optimization , 38 447–469. Bo wling, M. (2001). Ra tional and conv ergen t learning in sto c ha stic games. In I n ternational Confer enc e on Artificial Intel ligen c e . Bo wling, M. and V eloso, M. (200 0 ). An ana lysis of sto c hastic game theory for multiagen t reinforcemen t learning. T ec h. rep., Carnegie Mellon Univ ersit y . Bradtk e, S. J. (1 993). Reinforcemen t learning applied to linear quadrat ic regulation. In A dvanc es in Neur al Information Pr o c essing Systems . Bradtk e, S. J., Ydstie, B. E. and Bar to, A. G. (1994). Adaptiv e linear quadratic con trol using p olicy iteration. In Americ an Contr ol Confe r enc e , v ol. 3. IEEE. Briani, A. and Cardaliaguet, P . (20 18). Sta ble solutions in p oten tial mean field game sys- tems. Nonli n e ar Differ ential Equations and Applic ations , 25 1. Busoniu, L., Babusk a, R. and De Sc h utter, B. (2008). A comprehensiv e surv ey o f multiagen t reinforcemen t learning. IEEE T r ansactions on Systems, Man, and Cyb ernetics, Part C: Applic ations and R eviews , 38 156–172. Caines, P . E. a nd Kizilk ale, A. C. (2017). ǫ -nash equilibria for partially observ ed LQG mean field games with a ma jor pla yer. IEEE T r ansactions on A utomatic Contr ol , 62 3225–3 234. Calderone, D. J. (2 017). Mo dels of Com p etition for Intel lig e n t T r ansp ortation Infr astructur e: Parking, Ridesharing, and External F actors in R outing De cisions . Univ ersit y of California, Berk eley . Carmona, R. a nd Delarue, F. (2013). Probabilistic analysis of mean-field games. S IAM Journal on Contr ol and Optimization , 51 2705–27 34. 16 Carmona, R. a nd Delarue, F. (2018). Pr ob abilistic The ory of Me an Field Games with Appli- c ations I-II . Springer. Casgrain, P ., Ning, B. and Jaim unga l, S. (20 19). Deep Q-learning for Nash equilibria: Nash- DQN. arXiv pr eprint arXiv:1904 . 1 0554 . Castro, D. D . and Meir, R. (2010). A con ve rgen t online single time scale actor critic algo - rithm. Journal of Machine L e arning R ese ar ch , 11 367–410 . Conitzer, V. and Sandholm, T. (2007). A WESOME: A general multiagen t learning algo- rithm that con v erges in self-pla y and learns a b est resp onse against stationary opp o nen ts. Machine L e arning , 67 23– 43. de Cote, E. M., Lazaric, A. and R estelli, M. (200 6). Learning to co o p erate in m ulti- agen t so cial dilemmas. In International Confer enc e on A utonomo us A gents and Multiagent Sys- tems . ACM . De Souza, C., Gev ers, M. and G o o dwin, G. (1986). Riccati equations in optimal filtering of nonstabilizable sys tems having singular state transition matrices. IEEE T r an s actions on A utomatic Contr ol , 31 83 1–838. Dean, S. , Mania, H., Matni, N., R ec h t, B. and T u, S. (2017 ) . On the sample complexit y of the linear quadrat ic regulator . arXiv pr eprint arXiv:1710.016 88 . Dean, S. , Mania , H., Matni, N., Rec h t, B. and T u, S. (20 18). Regret b ounds for robust adaptiv e control o f the linear quadratic regulator. In A dvanc es in Neur al Inf o rmation Pr o c essing Systems . Do err, B., Linares, R ., Zh u, P . and F erra ri, S. (20 1 8). Random finite set theory and optimal con trol for large spacecraft swarms . arXiv pr eprint arXiv:1810.006 96 . Du, S. S., Chen, J., Li, L., Xiao, L. and Zhou, D. (2017) . Stochastic v ariance reduction metho ds fo r p olicy ev aluation. In Pr o c e e dings of the 34th International Confer enc e on Machine L e arning-V olume 70 . JMLR. org. F ang, J. (2014) . The LQR con troller design of t w o-wheeled self-balancing rob ot based on the particle swarm optimization algorithm. Mathematic al Pr oblems in Engine ering , 2014 . F azel, M., Ge, R., Kak ade, S. M. and Mesbahi, M. (2018). Global con ve rgence o f p olicy gradien t metho ds for the linear quadratic regula t or. arXiv pr eprint arXiv:1801.05039 . 17 Ganzfried, S. and Sandholm, T. (2009). Computing equilibria in mu ltipla yer sto chastic games of imperfect information. In Twenty-First International Joint C onfer enc e on A rti- ficial In tel lig e n c e . Gomes, D. A., Mohr, J. and Souza, R. R. (2010) . D iscrete time, finite state space mean field games. Journal de m ath´ ematiques pur es et appliq u´ ees , 93 308–328. Gomes, D. A. et al. (2014). Mean field games mo delsa brief surv ey . Dynamic Games and Applic ations , 4 110–15 4. Gu ´ ean t, O., Lasry , J.-M. and Lions, P .-L . (2011). Mean field ga mes and applications. In Paris-Princ eton le ctur es on mathematic al financ e 20 10 . Springer, 20 5 –266. Guo, X., Hu, A., Xu, R. and Zhang, J. (2019). Learning mean-field games. arXiv pr eprint arXiv:1901.09585 . Hardt, M., Ma, T. and Rec h t , B. (2016). Gradien t descen t learns linear dynamical systems. arXiv pr eprint arXiv:1609.05191 . Heinric h, J. and Silv er, D. (2 016). Deep reinforcemen t learning fr o m self-pla y in imp erfect- information games. arXiv pr eprint arXiv:1603.01 121 . Hu, J. and W e llman, M. P . (19 9 8). Multiagen t reinforcemen t learning: Theoretical f r a me- w ork and an algorithm. In Internationa l Confer enc e on Machine L e arning . Morga n Kauf- mann Publishers Inc. Hu, J. and W e llman, M. P . (2003). Nash Q-learning for general- sum sto c hastic games. Jour- nal of machine le arning r ese ar ch , 4 103 9–1069. Huang, J. and Huang, M. (201 7). Robust mean field linear-quadratic-gaussian games with unkno wn L 2 -disturbance. SIAM Journal on Contr ol and Optimization , 55 281 1–2840. Huang, J. and Li, N. (201 8). Linear–quadratic mean-field game for sto chas tic delay ed sys- tems. I EEE T r ansactions on Aut omatic Contr ol , 63 2722–272 9. Huang, J., Li, X. and W ang, T. (2016a ). Mean-field linear-quadratic-Gaussian (LQ G) games for sto chastic in t egral systems . IEEE T r ansactions on Aut omatic Contr ol , 61 267 0–2675. Huang, J., W ang, S. and W u, Z. (2016 b). Ba c kw ar d mean-field linear-quadratic-G aussian (LQG) games: F ull and partial infor mation. IEEE T r ansactions on A utomatic Contr ol , 61 37 8 4–3796. 18 Huang, M., Caines, P . E. a nd Malham ´ e, R. P . (2003). Individual and mass b ehav iour in large p opulation sto chastic wireles s p o w er con trol problems: cen tralized and nash equilibrium solutions. In Con f e r enc e on De cisio n an d Contr ol . IEEE. Huang, M., Caines, P . E. and Malham ´ e, R. P . (2007). Large- p opulation cost-coupled LQG problems with non uniform a g en ts: individual-mass b eha vior and decen tr alized ε -Nash equilibria. IEEE tr ansactions o n automatic c ontr ol , 52 1 5 60–1571 . Huang, M., Malham ´ e, R. P ., Caines, P . E. et a l. (2006). Large p o pula t io n sto chas tic dy- namic games: closed-lo op Mc k ean-Vlasov systems and the Nash certain ty equiv alence principle. C o mmunic ations in Information & Systems , 6 221–252 . Huang, M. and Zhou, M. (20 1 9). Linear quadratic mean field games: Asymptotic solv abilit y and relatio n to the fixed p oin t approac h. arXiv p r eprint arXiv:1903.08776 . Hughes, E., Leib o, J. Z., Phillips, M., T uyls, K., D ue ˜ nez-Guzman, E., Casta ˜ neda, A. G., Dunning, I. , Z h u, T., McKee, K., Koster, R. et al. (2018). Inequit y av ersion improv es co op eratio n in intertemporal so cial dilemmas. In A dvanc es in Neur al Information Pr o- c essing Systems . Ja yakum ar, S. and Adity a, M. (2019). Reinforcemen t learning in stationary mean-field games. In International Confer enc e on Autonomous A gents an d Multiagent Systems . Kak ade, S. M. (200 2 ). A natural p olicy gradient. In A dvanc es in neur al information pr o- c essing systems . Konda, V. R. and Tsitsiklis, J. N. (200 0). Actor- critic a lgorithms. In A dvanc es in neur al information pr o c essing systems . Korda, N. a nd La, P . (2015). On TD(0) with function approximation: Concen tration b ounds and a cen tered v arian t with exp onen tia l conv ergence. In I nternational Confe r enc e on Machine L e arning . Kushner, H. and Yin, G. G. (2003) . Sto chastic appr oxim a tion and r e cursive algorithms and applic ations . Springer Science & Business Media. Lagoudakis, M. G . and P a r r , R. (2002). V alue function a pproximation in zero-sum Marko v games. In Unc ertainty in A rtifici a l Intel ligenc e . 19 Lasry , J.-M. and Lions, P .-L. (2006a). Jeux ` a c hamp mo y en. I–le cas statio nnaire. Co m ptes R endus Math´ ema tique , 343 6 19–625. Lasry , J.-M. and Lio ns, P .- L. ( 2 006b). Jeux ` a c hamp moy en. I I–ho r izon fini et con t r ˆ ole optimal. Comptes R endus Math´ ematique , 343 679–6 84. Lasry , J.-M. and L ions, P .- L. (2007). Mean field games. Jap anese journal of mathem atics , 2 229–260. LeCun, Y., Bengio, Y. and Hin ton, G. (2015). Deep learning. Natur e , 521 436–444. Leib o, J. Z ., Z am baldi, V., La nctot, M., Marec ki, J. and Graep el, T. (2017 ). Multi-agen t reinforcemen t learning in sequen tial so cial dilemmas. In Internation a l Confer enc e on A utonomous A gents and Multiagent Systems . In ternational F oundation for Autonomous Agen ts and Multiagen t Systems. Lewis, F. L., V rabie, D. and Syrmos, V. L. (201 2 ). Optimal c ontr o l . John Wiley & Sons. Li, S., Zhang, W. and Zhao, L. (2017). Connections b etw een mean-field game a nd so cial w elfare optimization. arXiv pr eprint arXiv:1703.102 1 1 . Li, T. and Z hang, J.-F. ( 2 008). Asymptotically optimal decen tralized con trol for larg e p opu- lation sto c hastic multiagen t system s. I EEE T r ansactions on Automatic Contr ol , 53 16 43– 1660. Li, Y. (201 8). Deep reinforcemen t learning. arXiv pr eprint arXiv:181 0 .06339 . Littman, M. L. (1994). Mark o v games as a framew ork fo r m ulti- agen t reinforcemen t learning. In Machine L e arning Pr o c e e dings 1 9 94 . Elsevier, 1 57–163. Littman, M. L. (2001). F riend-or-fo e Q-learning in general-sum g ames. In Pr o c e e di n gs o f the Eighte enth International Confer enc e on Machine L e arning . Liu, B., Liu, J., Ghav amzadeh, M., Mahadev an, S. and Pe trik, M. (2015). Finite-sample analysis of proximal gradien t TD algorithms. In Confer enc e on Unc ertainty in A rtifici a l Intel ligenc e . Maei, H. R. (2018 ) . Con vergen t actor- critic alg o rithms under off - p olicy training and function appro ximation. arXiv pr eprint arXiv:1802.078 4 2 . 20 Magn us, J. R. (1979). The exp ectation o f pro ducts of quadratic fo rms in normal v ariables: the practice. Statistic a Ne erlandic a , 33 131–136. Magn us, J. R. et al. (1978) . The momen ts of pr o ducts of quadr atic forms in norm al variab le s . Univ., Instituut voo r Actuariaa t en Econometrie. Malik, D ., Pananjady , A., Bhatia, K., K ha maru, K., Bartlett, P . L. and W ain wrigh t, M. J. (2018). Deriv ativ e-free methods for p olicy optimization: Guara n tees for linear quadratic systems . arXiv pr eprint arXiv:1812.08 3 0 5 . Mguni, D., Jennings, J. and de Cote, E. M. (2018 ). D ecen tralised learning in systems with man y , man y strategic agen ts. In Thirty-Se c ond AAAI Confer enc e on Art ificial Intel ligenc e . Minciardi, R. a nd Sacile, R. (2011 ) . Optimal con trol in a co o p erativ e net w or k of smart p ow er grids. I EEE Systems Jo urna l , 6 126–13 3 . Mo on, J. and Ba¸ sar, T. (2014). Discrete-time LQ G mean field games with unreliable com- m unication. In Confer enc e on De cision and Contr ol . IEEE. Mo on, J. and Ba¸ sar, T. (2018). Linear quadratic mean field stac kelberg differen tial games. A utomatic a , 97 200–2 1 3. Mora vˇ c ´ ık, M., Sc hmid, M., Burc h, N., Lis` y, V., Morrill, D., Bard, N., Davis , T., W augh, K., Johanson, M. and Bow ling, M. (2 0 17). Deepstac k: Exp ert-lev el artificial in telligence in heads-up no-limit p ok er. Scienc e , 356 50 8–513. Nash, J. (1 9 51). Non- co op erativ e games. Annals of ma them atics 286–295 . Op enAI (2018). O p enai fiv e. https://blo g.openai.co m/openai- f ive/ . P ´ erolat, J., Piot, B., Geist, M., Sc herrer, B. and Pietquin, O. (2016a). Softened appro ximate p olicy iteration f or ma r ko v games. In International Confe r enc e on Machine L e arning . P ´ erolat, J., Piot, B. and Pietquin, O . (2018). Actor-critic fictitious play in sim ultaneous mo ve m ultistage games. In International Con fer enc e on Art ificial I ntel ligenc e a nd Statis- tics . P ´ erolat, J., Piot , B., Sc herrer, B. and Pietquin, O. (2016b). On the use of non-statio nary strategies f o r solving t wo-pla y er zero-sum Mark ov games. In I nternational Confer enc e on A rtificial Intel ligen c e and Statistics . 21 P erolat, J., Sc herrer, B., Piot, B. and Pietquin, O. (2015) . Appro ximate dynamic program- ming for t wo-pla y er zero-sum Mark o v games. In I n ternational Confer enc e on Machine L e arning (ICML 20 1 5) . P eters, J. and Sc haal, S. (2008 ) . Natural actor-critic. Neur o c omputing , 71 1180–11 9 0. Rudelson, M., V ersh ynin, R. et al. (2013). Hanson-wrigh t inequalit y and sub-Ga ussian con- cen tration. Ele ctr onic Com m unic ations in Pr ob ability , 18 . Saldi, N., Basar, T. and Raginsky , M. (2018a) . Discrete-time risk-sensitiv e mean-field games. arXiv pr eprint arXiv:1808.03929 . Saldi, N., Basar, T. and Raginsky , M. (2018b). Mark ov–Nash equilibria in mean-field games with discoun ted cost. SIAM Journal on Contr ol and Optimization , 56 42 56–4287. Saldi, N., Basar, T. and Raginsky , M. (2019) . Approximate Nash equilibria in partially ob- serv ed sto chastic games with mean-field in t era ctio ns. Mathematics of Op er ations R ese ar ch . Sandholm, W. H. (20 10). Population Games and Evolutionary Dynamics . MIT Press . Shalev-Sh w artz, S., Shammah, S. and Shash ua, A. (2016). Safe, m ulti-agent, reinforcemen t learning fo r autonomous driving. arXiv pr eprint arXiv:1610.03295 . Shoham, Y., P o wers , R. and Grenager, T. (2003). Multi-agen t reinforcemen t learning: a critical surv ey . Shoham, Y., Po w ers, R . a nd Grenager, T. (2007). If m ulti-agent learning is the answe r , what is the question? Art ificial Intel ligenc e , 171 365–377. Silv er, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., V an D en Driessc he, G., Sc hrittw ieser, J., An tonog lo u, I. , Panneers helv am, V., Lanctot, M. et al. (2016). Master- ing the game of Go with deep neural net works and tree searc h. Natur e , 529 484–489. Silv er, D., Sc hrittwies er, J. , Simony an, K., An tonoglou, I., Huang, A., Guez, A., Hub ert, T., Bak er, L., Lai, M., Bo lt on, A. et al. (20 17). Mastering the game of Go without human kno wledge. Natur e , 550 3 5 4. Sutton, R. S. and Barto, A. G. (20 18). R einfor c ement le arni n g: An intr o duction . MIT Press. 22 Sutton, R. S. , Maei, H. R., Precup, D., Bhatnagar , S., Silve r, D ., Szep esv´ ar i, C. and Wiewiora, E. (2009a ) . F ast gradien t- descen t metho ds for temp oral- difference learning with linear function appro ximation. In Pr o c e e di n gs of the 26th Annual International Confe r enc e on Machine L e a rning . A CM. Sutton, R. S. , Maei, H. R. and Szep esv´ ari, C. (2009 b). A con v ergent o ( n ) temp o ral- difference algorithm fo r o ff-p olicy learning with linear function appro ximation. In A dvanc es in neur al information pr o c essing systems . Sutton, R. S. , McAllester, D . A., Singh, S. P . and Mansour, Y. (2000). P olicy gradien t meth- o ds for reinforcemen t learning with function approximation. In A dvanc es in neur al infor- mation pr o c essing systems . Sznitman, A.-S. (19 91). T opics in propa g ation of c ha os. In Ec ole d’ ´ et ´ e de p r ob abilit ´ es de Saint-Flour XIX1989 . Springer, 165–251 . T em bine, H. and Huang, M. (2011). Mean field difference games: Mck ean-Vlaso v dynamics. In Confe r enc e on De cisio n and Contr ol and Eur op e an Contr ol Conf e r enc e . IEEE. T em bine, H., Zh u, Q. and Ba¸ sar, T. (2 0 14). Risk-sensitiv e mean- field games. IEEE T r ans- actions on A utomatic Contr ol , 59 835–85 0 . T u, S. and Rech t, B. ( 2 017). Least-squares temp oral difference learning for the linear quadratic regulator. a rXiv pr eprint arXiv:1712.08642 . T u, S. and Rec h t , B. (2018). The g ap b et w een mo del-based and mo del-free metho ds on the linear quadratic regulator: An asymptotic viewp oin t. arXiv pr eprint arXiv:1812.03565 . uz Zaman, M. A., Zhang, K., Miehling, E. and Basar, T. (2019). Appro ximate equilibrium computation for discrete-time linear- quadra t ic mean-field games. Manuscri p t . Vin y als, O., Babus c hkin, I., Ch ung, J., Mathieu, M., Jaderberg, M., Cz arnec ki, W., Dudzik, A., Huang, A., Georgiev, P ., P ow ell, R. et al. (20 1 9). Alphastar: Mastering the real-time strategy game starcraft ii. W ai, H.-T., Y ang, Z., W a ng, P . Z. and Hong, M. (2018 ). Multi- a gen t reinforcemen t learn- ing via double av eraging primal-dual optimizatio n. In A dvanc es i n Neur al I nformation Pr o c essing Systems . 23 W ang, B.-C. and Zhang, J.-F. (201 2). Mean field games for large-p opulation m ultia g en t systems with Mark ov jump par a meters. SI AM Journal on Contr ol and Optimization , 50 2308–233 4. W ang, J., Zhang, W., Y uan, S. et al. (2017a ) . D ispla y adv ertising with real-time bidding (R TB) and b eha vioural targeting. F oundation s and T r ends R  in Information R etrieval , 11 29 7 –435. W ang, Y., Chen, W., Liu, Y., Ma, Z.- M. and Liu, T.-Y. (2017b). Finite sample analysis of the GTD p olicy ev a luation algorithms in Mark ov setting. In A dvanc es in Neur al Informa- tion Pr o c essing Systems . W atkins, C. J. and Day an, P . (19 92). Q- learning. Mach ine le arning , 8 279–292. W ei, C.-Y., Hong, Y.-T. and Lu, C.-J. (2 017). Online reinforcemen t learning in sto c ha stic games. In A dvanc es in Neur al Information Pr o c es s ing Systems . Y ang, E. and Gu, D. (2004). Multiagent r einforcemen t learning for m ulti-rob ot system s: A surv ey . Manuscript . Y ang, J., Y e, X. , T rived i, R ., Xu, H. and Zha , H. (2018a). D eep mean field ga mes f o r learning optimal b ehavior p olicy of large p opulations. In International Confer enc e on L e arning R epr esentations . Y ang, Y., Luo, R., Li, M., Zhou, M., Zhang, W. and W ang, J. (20 18b). Mean field m ulti- agen t reinforcemen t learning. arXiv pr eprint arXiv:180 2 . 05438 . Y ang, Z., Chen, Y. , Hong, M. and W ang, Z. (201 9 ). On the global con v ergence of actor -critic: A case for linear quadra t ic regulator with ergo dic cost. arXiv pr eprint arXiv:1 9 07.06246 . Y u, H. (2017). On conv ergence of some gradien t- based temp oral- differences algorithms for off-p olicy learning. arXiv pr eprint arXiv:1 712.09652 . Zhang, K., Y ang, Z., Liu, H., Zhang, T. and Ba¸ sar, T. (2018) . Finite-sample a na lyses for fully decen tralized m ulti-agen t reinforcemen t learning. arXiv pr eprint arXiv:1812.0 2 783 . Zhou, X. Y. and Li, D. (2 000). Con tinuous -time mean-v ariance p ortfolio selection: A sto c ha stic LQ framew o rk. Applie d Mathematics and Optimization , 42 19–33. 24 Ziebart, B. D ., Maas, A. L., Bagnell, J. A. and Dey , A. K. (2008). Maximum en tro p y inv erse reinforcemen t learning. In AAAI Co n fer enc e on Artificial In tel lige nc e , v ol. 3. Zou, S., Xu, T. and Liang, Y. (2019). Finite- sample analysis for SARSA and Q-learning with linear function approximation. arXiv pr eprint arXiv:1902.022 3 4 . 25 A Notations in the App endix In the pro of , for conv enience, for a n y in vertible matr ix M , w e denote b y M −⊤ = ( M − 1 ) ⊤ = ( M ⊤ ) − 1 and k M k F the F robenius norm. W e also denote b y sv ec( M ) the symmetric vec - torization of the symmetric matrix M , which is the ve ctorization of the upp er triangula r matrix of the symmetric matrix M , with off-diago nal en tries scaled b y √ 2. W e denote by smat( · ) the in v erse op eratio n. F or an y matrices G and H , w e denote b y G ⊗ H the Kronec k er pro duct, and G ⊗ s H the symmetric Kro nec k er pro duct, whic h is defined a s a mapping on a v ector sv ec( M ) such that ( G ⊗ s H )sv ec( M ) = 1 / 2 · sv ec ( H M G ⊤ + GM H ⊤ ). F or nota tional simplicit y , w e write E π ( · ) to emphasize tha t the exp ectation is ta k en follo wing the p olicy π . B Auxiliary Algorith ms and Analysis B.1 Results in D- L QR In this section, we pro vide auxiliary results in analyzing Problem 2.2 . First, w e introduce the v alue functions of the Marko v decision pro cess (MDP) induced b y Problem 2.2 . W e define the state- and action-v alue functions V K,b ( x ) and Q K,b ( x, u ) as follows V K,b ( x ) = ∞ X t =0 n E  c ( x t , u t ) | x 0 = x  − J ( K , b ) o , (B.1) Q K,b ( x, u ) = c ( x, u ) − J ( K , b ) + E  V K,b ( x 1 ) | x 0 = x, u 0 = u  , (B.2) where x t follo ws the state transition, and u t follo ws the p o licy π K,b giv en x t . In other w ords, w e hav e u t = − K x t + b + σ η t , where η t ∼ N (0 , I ). The follo wing prop osition establishes the close forms of these v alue functions. Prop osition B.1. The state-v alue function V K,b ( x ) tak es the form of V K,b ( x ) = x ⊤ P K x − tr( P K Φ K ) + 2 f ⊤ K,b ( x − µ K,b ) − µ ⊤ K,b P K µ K,b , (B.3) and the a ction-v alue function Q K,b ( x, u ) t a k es the form of Q K,b ( x, u ) = x u ! ⊤ Υ K x u ! + 2 p K,b q K,b ! ⊤ x u ! − tr( P K Φ K ) − σ 2 · tr( R + P K B B ⊤ ) − b ⊤ Rb + 2 b ⊤ RK µ K,b − µ ⊤ K,b ( Q + K ⊤ RK + P K ) µ K,b + 2 f ⊤ K,b  ( Aµ + d ) − µ K,b  + ( Aµ + d ) ⊤ P K ( Aµ + d ) , (B.4) 26 where f K,b = ( I − A + B K ) −⊤ [( A − B K ) ⊤ P K ( B b + Aµ + d ) − K ⊤ Rb ], and Υ K , p K,b , and q K,b are defined in ( 3.7 ). Pr o of. See § E.6 for a detailed pro o f. By Propo sition B.1 , we kno w that V K,b ( x ) is quadratic in x , while Q K,b ( x, u ) is quadratic in ( x ⊤ , u ⊤ ) ⊤ . No w, w e show that ( 3.5 ) holds. Prop osition B .2. The exp ected total cost J ( K , b ) defined in Problem 2.2 tak es the f o rm of J ( K , b ) = J 1 ( K ) + J 2 ( K , b ) + σ 2 · tr( R ) + µ ⊤ Qµ, where J 1 ( K ) = tr  ( Q + K ⊤ RK )Φ K  = tr( P K Ψ ǫ ) , J 2 ( K , b ) = µ K,b b ! ⊤ Q + K ⊤ RK − K ⊤ R − RK R ! µ K,b b ! . Here µ K,b is defined in ( 3.2 ), Φ K is defined in ( 3.3 ), and P K is defined in ( 3.4 ). Pr o of. See § E.3 for a detailed pro o f. The follo wing prop osition establishes the gradien ts of J 1 ( K ) and J 2 ( K , b ), resp ectiv ely . Prop osition B.3. The gradien t of J 1 ( K ) and the gradien t of J 2 ( K , b ) with resp ect to b ta ke the forms o f ∇ K J 1 ( K ) = 2( Υ 22 K K − Υ 21 K ) · Φ K , ∇ b J 2 ( K , b ) = 2  Υ 22 K ( − K µ K,b + b ) + Υ 21 K µ K,b + q K,b  , where Υ K and q K,b are defined in ( 3.7 ). Pr o of. See § E.5 for a detailed pro o f. The follo wing theorem establishes the con v ergence of Algorithm 2 . 27 Theorem B.4 (Con vergenc e of Algorithm 2 ) . Ass ume that ρ ( A − B K 0 ) < 1. Let ε > 0 b e a sufficien tly small tolerance. W e set γ ≤  k R k 2 + k B k 2 2 · J ( K 0 , b 0 ) · σ − 1 min (Ψ ǫ )  − 1 , N ≥ C · k Φ K ∗ k 2 · γ − 1 · log n 4  J ( K 0 , b 0 ) − J ( K ∗ , b ∗ )  · ε − 1 o , T n ≥ p oly  k K n k F , k b 0 k 2 , k µ k 2 , J ( K 0 , b 0 )  · λ − 4 K n ·  1 − ρ ( A − B K n )  − 9 · ε − 5 , e T n ≥ p oly  k K n k F , k b 0 k 2 , k µ k 2 , J ( K 0 , b 0 )  · λ − 2 K n ·  1 − ρ ( A − B K n )  − 12 · ε − 12 , γ n,t = γ 0 · t − 1 / 2 , γ b ≤ min n 1 − ρ ( A − B K N ) ,  1 − ρ ( A − B K N )  − 2 ·  k B k 2 2 · k K N k 2 2 · k R k 2 + k B k 2 2 · k Q k 2  o , H ≥ C 0 · ν − 1 K N · ( γ b ) − 1 · log n 4  J ( K N , b 0 ) − J ( K N , b K N )  · ε − 1 o , T b h ≥ p oly  k K N k F , k b h k 2 , k µ k 2 , J ( K N , b 0 )  · λ − 4 K N · ν − 4 K N ·  1 − ρ ( A − B K N )  − 11 · ε − 5 , e T b h ≥ p oly  k K N k F , k b h k 2 , k µ k 2 , J ( K N , b 0 )  · λ − 4 K N · ν − 2 K N ·  1 − ρ ( A − B K N )  − 17 · ε − 8 , γ b h,t = γ 0 · t − 1 / 2 , where C , C 0 , and γ 0 are p ositiv e absolute constan ts, { K n } n ∈ [ N ] and { b h } h ∈ [ H ] are the se- quences generated by Algo rithm 2 , λ K n is specified in Prop osition B.6 , and ν K N is specified in Prop osition 3.3 . Then it holds with pro babilit y a t least 1 − ε 10 that J ( K N , b H ) − J ( K ∗ , b ∗ ) < ε, k b H − b ∗ k 2 ≤ M b ( µ ) · ε 1 / 2 , k K N − K ∗ k F ≤  σ − 1 min (Ψ ǫ ) · σ − 1 min ( R ) · ε  1 / 2 , k b µ K N ,b H − µ K ∗ ,b ∗ k 2 ≤ ε, where M b ( µ ) is defined in ( 4.3 ). Pr o of. See § D.2 for a detailed pro o f. By Theorem B.4 , given an y mean-field state µ , Algorithm 2 conv erges linearly to the optimal p olicy π ∗ µ of Pro blem 2.2 . B.2 Primal-Dual P olicy Ev aluat ion Algorithm Note that the critic up date steps in Algorithm 2 a re built up on the estimators of the matrix Υ K and the ve ctor q K,b . W e now deriv e a p olicy ev aluation algorithm to establish the estimators of Υ K and q K,b , whic h is based on gradien t temp oral difference algorithm ( Sutton et al. , 200 9 a ). 28 W e define t he feature v ector as ψ ( x, u ) =    ϕ ( x, u ) x − µ K,b u − ( − K µ K,b + b )    , (B.5) where ϕ ( x, u ) = sv ec " x − µ K,b u − ( − K µ K,b + b ) ! x − µ K,b u − ( − K µ K,b + b ) ! ⊤ # . Recall sv ec( M ) giv es the symmetric v ectorization of the symmetric matrix M . W e also define α K,b =    sv ec(Υ K ) Υ K µ K,b − K µ K,b + b ! + p K,b q K,b !    , (B.6) where Υ K , p K,b , and q K,b are defined in ( 3.7 ). T o estimate Υ K and q K,b , it suffices to estimate α K,b . Mean while, w e define Θ K,b = E π K,b n ψ ( x, u )  ψ ( x, u ) − ψ ( x ′ , u ′ )  ⊤ o , (B.7) where ( x ′ , u ′ ) is the state-action pair after ( x, u ) f o llo wing the p olicy π K,b and the state transition. The fo llowing prop osition c haracterizes the connection b et we en Θ K,b and α K,b . Prop osition B.5. It holds that 1 0 E π K,b  ψ ( x, u )  Θ K,b ! J ( K , b ) α K,b ! = J ( K , b ) E π K,b  c ( x, u ) ψ ( x, u )  ! , where ψ ( x, u ) is defined in ( B.5 ), α K,b is defined in ( B.6 ), and Θ K,b is defined in ( B.7 ). Pr o of. See § E.7 for a detailed pro of. By Prop osition B.5 , to obtain α K,b , it suffices to solv e the follow ing linear system in ζ = ( ζ 1 , ζ ⊤ 2 ) ⊤ , e Θ K,b · ζ = J ( K , b ) E π K,b  c ( x, u ) ψ ( x, u )  ! , (B.8) where for not a tional con v enience, w e define e Θ K,b = 1 0 E π K,b  ψ ( x, u )  Θ K,b ! . (B.9) The follo wing prop osition show s that Θ K,b is in v ertible. 29 Prop osition B.6. If ρ ( A − B K ) < 1, then the matrix Θ K,b is in ve rtible, and k Θ K,b k 2 ≤ 4(1 + k K k 2 F ) 2 · k Φ K k 2 2 . Also, σ min ( e Θ K,b ) ≥ λ K , where λ K only depends on k K k 2 and ρ ( A − B K ). Pr o of. See § E.8 for a detailed pro of. By Prop osition B.6 , Θ K,b is inv ertible. Therefore, ( B.8 ) admits the unique solution ζ K,b = ( J ( K , b ) , α ⊤ K,b ) ⊤ . No w, we presen t the primal-dual gra dient temp ora l difference a lgorithm. Primal-Dual Gradien t Method. Instead of solving ( B.8 ) directly , we minimize the fol- lo wing loss function with resp ect to ζ = (( ζ 1 ) ⊤ , ( ζ 2 ) ⊤ ),  ζ 1 − J ( K , b )  2 +    E π K,b  ψ ( x, u )  ζ 1 + Θ K,b ζ 2 − E π K,b  c ( x, u ) ψ ( x, u )     2 2 . (B.10) By F enc hel’s duality , the minimization of ( B.10 ) is equiv a len t to the follo wing primal-dual min-max problem, min ζ ∈V ζ max ξ ∈V ξ F ( ζ , ξ ) = n E π K,b  ψ ( x, u )  ζ 1 + Θ K,b ζ 2 − E π K,b  c ( x, u ) ψ ( x, u )  o ⊤ ξ 2 (B.11) +  ζ 1 − J ( K , b )  · ξ 1 − k ξ k 2 2 / 2 , where w e r estrict t he primal v ariable ζ in a compact set V ζ and the dual v ariable ξ in a compact set V ξ , whic h are sp ecified in Definition B.7 . It holds that ∇ ζ 1 F = ξ 1 + E π K,b  ψ ( x, u )  ⊤ ξ 2 , ∇ ζ 2 F = Θ ⊤ K,b ξ 2 , ∇ ξ 1 F = ζ 1 − J ( K , b ) − ξ 1 , ∇ ξ 2 F = E π K,b  ψ ( x, u )  ζ 1 + Θ K,b ζ 2 − E π K,b  c ( x, u ) ψ ( x, u )  − ξ 2 . (B.12) The primal-dual g radien t metho d up dates ζ a nd ξ via ζ 1 ← ζ 1 − γ · ∇ ζ 1 F ( ζ , ξ ) , ζ 2 ← ζ 2 − γ · ∇ ζ 2 F ( ζ , ξ ) ξ 1 ← ξ 1 − γ · ∇ ξ 1 F ( ζ , ξ ) , ξ 2 ← ξ 2 − γ · ∇ ξ 2 F ( ζ , ξ ) . (B.13) Estimation of Mean-Field State µ K,b . T o utilize the primal-dual g r a dien t metho d in ( B.13 ), it remains to ev aluate the feature v ector ψ ( x, u ). Note that b y ( B.5 ), the ev aluation of the feature v ector ψ ( x, u ) requires the mean-field state µ K,b . In what follows, w e establish the estimator b µ K,b of the mean-field state µ K,b b y sim ulating the MDP following the p olicy π K,b for e T steps, and calculate the estimated f eature v ector b ψ ( x, u ) b y b ψ ( x, u ) =    b ϕ ( x, u ) x − b µ K,b u − ( − K b µ K,b + b )    , (B.14) 30 where b ϕ ( x, u ) tak es the form of b ϕ ( x, u ) = sv ec " x − b µ K,b u − ( − K b µ K,b + b ) ! x − b µ K,b u − ( − K b µ K,b + b ) ! ⊤ # . W e no w define the sets V ζ and V ξ in ( B.11 ). Definition B.7. G iv en K 0 and b 0 suc h t ha t ρ ( A − B K 0 ) < 1 and J ( K 0 , b 0 ) < ∞ , w e define the sets V ζ and V ξ as V ζ = n ζ : 0 ≤ ζ 1 ≤ J ( K 0 , b 0 ) , k ζ 2 k 2 ≤ M ζ , 1 + M ζ , 2 · (1 + k K k F ) ·  1 − ρ ( A − B K )  − 1 o , V ξ = n ξ : | ξ 1 | ≤ J ( K 0 , b 0 ) , k ξ 2 k 2 ≤ M ξ ·  1 + k K k 2 F  3 ·  1 − ρ ( A − B K )  − 1 o . Here M ζ , 1 , M ζ , 2 , and M ξ are constants indep enden t o f K and b , whic h tak e the forms of M ζ , 1 = h  k Q k F + k R k F  +  k A k 2 F + k B k 2 F  · √ d · J ( K 0 , b 0 ) · σ − 1 min (Ψ ω ) i +  k A k 2 + k B k 2  · J ( K 0 , b 0 ) 2 · σ − 1 min (Ψ ω ) · σ − 1 min ( Q ) , + h  k Q k 2 + k R k 2  +  k A k 2 + k B k 2  2 · J ( K 0 , b 0 ) · σ − 1 min (Ψ ω ) i · J ( K 0 , b 0 ) ·  σ − 1 min ( Q ) + σ − 1 min ( R )  M ζ , 2 =  k A k 2 + k B k 2  · ( κ Q + κ R ) , M ξ = C · ( M ζ , 1 + M ζ , 2 ) · J ( K 0 , b 0 ) 2 · σ − 2 min ( Q ) , where C is a p ositiv e absolute constan t, and κ Q and κ R are condition num b ers of Q and R , resp ectiv ely . W e summarize the primal-dual gradient temp or al difference a lgorithm in Algo r it hm 3 . Hereafter, for notational con venie nce, w e denote by b ψ t the estimated feature v ector b ψ ( x t , u t ). W e no w ch aracterize the rate of con vergence of Algorithm 3 . Theorem B.8 (Con v ergence o f Alg orithm 3 ) . G iv en K 0 , b 0 , K , and b suc h that ρ ( A − B K 0 ) < 1 and J ( K , b ) ≤ J ( K 0 , b 0 ), w e define the sets V ζ and V ξ through D efinition B.7 . Let γ t = γ 0 t − 1 / 2 , where γ 0 is a p ositiv e absolute constan t. Let ρ ∈ ( ρ ( A − B K ) , 1). F or e T ≥ p oly 0 ( k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )) · (1 − ρ ) − 6 and a sufficien tly lar g e T , it holds with probabilit y at least 1 − T − 4 − e T − 6 that k b α K,b − α K,b k 2 2 ≤ λ − 2 K · p o ly 1  k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )  ·  log 6 T T 1 / 2 · (1 − ρ ) 4 + log e T e T 1 / 4 · (1 − ρ ) 2  , 31 Algorithm 3 Primal-Dual Gradien t T emp o ral Difference Alg orithm. 1: Input: P olicy π K,b , mean-field stat e µ , n umbers of iteration e T a nd T , stepsizes { γ t } t ∈ [ T ] , parameters K 0 and b 0 . 2: Define the sets V ζ and V ξ via Definition B.7 with K 0 and b 0 . 3: Initialize the parameters b y ζ 0 ∈ V ζ and ξ 0 ∈ V ξ . 4: Sample e x 0 from t he the stationary distribution N ( µ K,b , Φ K ). 5: for t = 0 , . . . , e T − 1 do 6: Giv en the mean- field state µ , tak e action e u t follo wing π K,b and generate the next state e x t +1 . 7: end for 8: Set b µ K,b ← 1 / e T · P e T t =1 e x t and compute the estimated feature v ector b ψ via ( B.14 ). 9: Sample x 0 from t he the stationary distribution N ( µ K,b , Φ K ). 10: for t = 0 , . . . , T − 1 do 11: Giv en the mean-field stat e µ , tak e a ction u t follo wing π K,b , observ e the cost c t , and generate the next state x t +1 . 12: Set δ t +1 ← ζ 1 t + ( b ψ t − b ψ t +1 ) ⊤ ζ 2 t − c t . 13: Up date parameters via ζ 1 t +1 ← ζ 1 t − γ t +1 · ( ξ 1 t + b ψ ⊤ t ξ 2 t ) , ζ 2 t +1 ← ζ 2 t − γ t +1 · b ψ t ( b ψ t − b ψ t +1 ) ⊤ ξ 2 t , ξ 1 t +1 ← (1 − γ t +1 ) · ξ 1 t + γ t +1 · ( ζ 1 t − c t ) , ξ 2 t +1 ← (1 − γ t +1 ) · ξ 2 t + γ t +1 · δ t +1 · b ψ t . 14: Pro ject ζ t +1 and ξ t +1 to V ζ and V ξ , respective ly . 15: end for 16: Set b α K,b ← ( P T t =1 γ t ) − 1 · ( P T t =1 γ t · ζ 2 t ), and b Υ K ← smat( b α K,b, 1 ) , b p K,b b q K,b ! ← b α K,b, 2 − b Υ K b µ K,b − K b µ K,b + b ! , where b α K,b, 1 = ( b α K,b ) ( k + d +1)( k + d ) / 2 1 and b α K,b, 2 = ( b α K,b ) ( k + d +3)( k + d ) / 2 ( k + d +1)( k + d ) / 2+1 . 17: Output: Estimators b µ K,b , b Υ K , and b q K,b . 32 where λ K is defined in Prop o sition B.6 . Same b ounds for k b Υ K − Υ K k 2 F , k b p K,b − p K,b k 2 2 , and k b q K,b − q K,b k 2 2 hold. Mean while, it holds with probabilit y at least 1 − e T − 6 that k b µ K,b − µ K,b k 2 ≤ log e T e T 1 / 4 · (1 − ρ ) − 2 · p oly 2  k Φ K k 2 , k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )  . Pr o of. See § D.3 for a detailed pro o f. B.3 T emp oral Difference P olicy Ev aluation A lgorithm Besides the prima l- dual gradien t temp ora l difference alg orithm, w e can also ev aluate α K,b b y TD(0) metho d ( Sutton and Barto , 2018 ) in practice, whic h is presen ted in Algo rithm 4 . Algorithm 4 T emp oral Difference P olicy Ev aluation Algorithm. 1: Input: P olicy π K,b , n um b er of iteration e T and T , stepsizes { γ t } t ∈ [ T ] . 2: Sample e x 0 from t he stationary distribution N ( µ K,b , Φ K ). 3: for t = 0 , . . . , e T − 1 do 4: T ak e action e u t under the p olicy π K,b and generate t he next state e x t +1 . 5: end for 6: Set b µ K,b ← 1 / e T · P e T t =1 e x t . 7: Sample x 0 from t he the stationary distribution N ( µ K,b , Φ K ). 8: for t = 0 , . . . , T do 9: Giv en the mean-field state µ , take action u t follo wing π K,b , observ e the cost c t , and generate the next state x t +1 . 10: Set δ t +1 ← ζ 1 t + ( b ψ t − b ψ t +1 ) ⊤ ζ 2 t − c t . 11: Up date parameters via ζ 1 t +1 ← (1 − γ t +1 ) · ζ 1 t + γ t +1 · c t and ζ 2 t +1 ← ζ 2 t − γ t +1 · δ t +1 · b ψ t . 12: Pro ject ζ t to V ′ ζ , where V ′ ζ is a compact set. 13: end for 14: Set b α K,b ← ( P T t =1 γ t ) − 1 · ( P T t =1 γ t · ζ 2 t ), and b Υ K ← smat( b α K,b, 1 ) , b p K,b b q K,b ! ← b α K,b, 2 − b Υ K b µ K,b − K b µ K,b + b ! , where b α K,b, 1 = ( b α K,b ) ( k + d +1)( k + d ) / 2 1 and b α K,b, 2 = ( b α K,b ) ( k + d +3)( k + d ) / 2 ( k + d +1)( k + d ) / 2+1 . 15: Output: Estimators b µ K,b , b Υ K , and b q K,b . Note tha t in related literature ( Bhandari et al. , 2 018 ; Ko rda and La , 2015 ), non- asymptotic con v ergence analysis o f TD(0) metho d with linear function approximation is only applied 33 to discoun ted MDP . As f or our ergo dic setting, the conv ergence o f TD(0) metho d is only sho wn asymptotically ( Bork ar and Meyn , 2000 ; Kushner and Yin , 200 3 ) using ordinary dif- feren tial equation metho d. Therefore, in the con ve rgence theorem prop osed in § 3 , w e only fo cus on the primal-dual gradien t temp oral difference metho d (Algorithm 3 ) to establish non-asymptotic con vergenc e result. C General F orm ulation In t his section, w e study a g eneral form ulatio n of LQ-MFG. Compared with Problem 2.1 , suc h a general form ula t io n includes an additiona l term x ⊤ t P E x ∗ t in the cost function. W e define the g eneral form ulation as follo ws. Problem C.1 (G eneral LQ-MF G ) . W e consider the follo wing form ulation, x t +1 = Ax t + B u t + A E x ∗ t + d + ω t , e c ( x t , u t ) = x ⊤ t Qx t + u ⊤ t Ru t + ( E x ∗ t ) ⊤ Q ( E x ∗ t ) + 2 x ⊤ t P ( E x ∗ t ) , e J ( π ) = lim T →∞ E " 1 T T X t =0 e c ( x t , u t ) # , where x t ∈ R m is the state v ector, u t ∈ R k is the action v ector generated b y the p olicy π , { x ∗ t } t ≥ 0 is the tra j ectory generated by a Nash p olicy π ∗ (assuming it exists), ω t ∈ R m is an indep enden t random noise term follo wing the Ga ussian distribution N (0 , Ψ ω ), and d ∈ R m is a drift term. Here the exp ectatio n in E x ∗ t is tak en across all the agen ts. W e aim to find π ∗ suc h that e J ( π ∗ ) = inf π ∈ Π e J ( π ). F ollow ing similar analysis in § 2 , it suffices to study Problem C.1 with t sufficien tly large, whic h motiv ates us to f o rm ulate the following general drifted LQ R (general D -LQR) problem. Problem C.2 (General D-LQ R ) . Giv en a mean-field state µ ∈ R m , we consider the followin g form ulat ion, x t +1 = Ax t + B u t + Aµ + d + ω t , e c µ ( x t , u t ) = x ⊤ t Qx t + u ⊤ t Ru t + µ ⊤ Qµ + 2 x ⊤ t P µ, e J µ ( π ) = lim T →∞ E " 1 T T X t =0 e c µ ( x t , u t ) # , 34 where x t ∈ R m is the state ve ctor, u t ∈ R k is the action v ector generated by the p olicy π , ω t ∈ R m is an indep enden t random noise term following the G aussian distribution N (0 , Ψ ω ), and d ∈ R m is a drift term. W e aim to find an o ptimal p olicy π ∗ µ suc h that e J µ ( π ∗ µ ) = inf π ∈ Π e J µ ( π ). In Problem C.2 , the unique optimal p o licy π ∗ µ admits a linear form π ∗ µ ( x t ) = − K π ∗ µ x t + b π ∗ µ ( Anderson and Mo ore , 20 07 ), where the matrix K π ∗ µ ∈ R k × m and the v ector b π ∗ µ ∈ R k are the para meters of the p olicy π . It then suffices to find the optimal p olicy in the class Π in tro duced in ( 2.1 ). Similar to § 3.3 , w e drop the subscript µ when w e fo cus on Problem C.2 for a fixed µ . W e write π K,b ( x ) = − K x + b + σ η t o emphasize the dep endence on K and b , and e J ( K , b ) = e J ( π K,b ) consequen tly . W e deriv e a close form of the exp ected total cost e J ( K , b ) in the follo wing prop osition. Prop osition C.3. The expected total cost e J ( K , b ) in Problem C.2 is decomp osed as e J ( K , b ) = e J 1 ( K ) + e J 2 ( K , b ) + σ 2 · tr( R ) + µ ⊤ Qµ, where e J 1 ( K ) and e J 2 ( K , b ) take the forms of e J 1 ( K ) = tr  ( Q + K ⊤ RK )Φ K  = tr( P K Ψ ǫ ) , e J 2 ( K , b ) = µ K,b b ! ⊤ Q + K ⊤ RK − K ⊤ R − RK R ! µ K,b b ! + 2 µ ⊤ P µ K,b . Here µ K,b is defined in ( 3.2 ), Φ K is defined in ( 3.3 ), and P K is defined in ( 3.4 ). Pr o of. The pro of is similar to the one of Prop osition B.2 . Thus w e omit it here. Compared with the form of J ( K , b ) in ( 3.5 ), the form o f e J ( K , b ) contains an additional term 2 µ ⊤ P µ K,b in e J 2 ( K , b ). Recall that µ K,b is linear in b by ( 3.2 ). Therefore, 2 µ ⊤ P µ K,b is linear in b , whic h sho ws that e J 2 ( K , b ) is still strongly con v ex in b . The follo wing pro p osition formally c haracterize the strong con vex it y of e J 2 ( K , b ) in b . Prop osition C .4. Giv en an y K , the function e J 2 ( K , b ) is ν K -strongly conv ex in b , here ν K = σ min ( Y ⊤ 1 ,K Y 1 ,K + Y ⊤ 2 ,K Y 2 ,K ), where Y 1 ,K = R 1 / 2 K ( I − A + B K ) − 1 B − R 1 / 2 and Y 2 ,K = Q 1 / 2 ( I − A + B K ) − 1 B . Also, e J 2 ( K , b ) has ι K -Lipsc hitz con tinuous gradient in b , where ι K ≤ [1 − ρ ( A − B K )] − 2 · ( k B k 2 2 · k K k 2 2 · k R k 2 + k B k 2 2 · k Q k 2 ). Pr o of. The pro of is similar to the one of Prop osition 3.3 . Th us w e omit it here. W e deriv e a similar prop osition to Prop osition 3.4 in t he seque l. 35 Prop osition C.5. W e define e b K = argmin b e J 2 ( K , b ). It holds that e b K =  K Q − 1 ( I − A ) ⊤ − R − 1 B ⊤  · S ·  ( Aµ + d ) + ( I − A ) Q − 1 P ⊤ µ  − K Q − 1 P ⊤ µ. Moreo ve r, e J 2 ( K , b K ) ta kes the form of e J 2 ( K , e b K ) = Aµ + d P ⊤ µ ! ⊤ S S ( I − A ) Q − 1 Q − 1 ( I − A ) ⊤ S 3 Q − 1 ( I − A ) ⊤ S ( I − A ) Q − 1 − Q − 1 ! Aµ + d P ⊤ µ ! , whic h is indep enden t of K . Here S = [( I − A ) Q − 1 ( I − A ) ⊤ + B R − 1 B ⊤ ] − 1 . Pr o of. The pro of is similar to the one of Prop osition 3.4 . Th us w e omit it here. Similar to Problem 2 .2 , we define the state- and a ction-v alue functions as e V K,b ( x ) = ∞ X t =0 n E  e c ( x t , u t ) | x 0 = x  − e J ( K , b ) o , e Q K,b ( x, u ) = e c ( x, u ) − e J ( K , b ) + E  e V K,b ( x 1 ) | x 0 = x, u 0 = u  , where x t follo ws the state transition, and u t follo ws the p o licy π K,b giv en x t . In other w ords, w e ha v e u t = − K x t + b + σ η t , where η t ∼ N (0 , I ). Similar t o Prop osition B.1 , the f ollo wing prop osition establishes the close fo rms of these v alue functions. Prop osition C.6. The state-v alue function e V K,b ( x ) tak es the form of e V K,b ( x ) = x ⊤ P K x − tr( P K Φ K ) + 2 e f ⊤ K,b ( x − µ K,b ) − µ ⊤ K,b P K µ K,b , and the a ction-v alue function e Q K,b ( x, u ) t a k es the form of e Q K,b ( x, u ) = x u ! ⊤ Υ K x u ! + 2 e p K,b e q K,b ! ⊤ x u ! − tr( P K Φ K ) − σ 2 · tr( R + P K B B ⊤ ) − b ⊤ Rb + 2 b ⊤ RK µ K,b − µ ⊤ K,b ( Q + K ⊤ RK + P K ) µ K,b + 2 e f ⊤ K,b  ( Aµ + d ) − µ K,b  + ( Aµ + d ) ⊤ P K ( Aµ + d ) − 2 µ ⊤ P µ K,b . Here Υ K is defined in ( 3.7 ), and e p K,b , e q K,b are defined as e p K,b = A ⊤  P K · ( Aµ + d ) + e f K,b  + P µ, e q K,b = B ⊤  P K · ( Aµ + d ) + e f K,b  , (C.1) where e f K,b = ( I − A + B K ) −⊤ [( A − B K ) ⊤ P K ( B b + Aµ + d ) − K ⊤ Rb + P µ ]. Pr o of. The pro of is similar to the one of Prop osition B.1 . Thus w e omit it here. 36 The follo wing prop osition establishes the gradien ts of e J 1 ( K ) and e J 2 ( K , b ), resp ectiv ely . Prop osition C.7. The gr a dien t o f e J 1 ( K ) and the gradient of e J 2 ( K , b ) with resp ect to b tak e the forms of ∇ K e J 1 ( K ) = 2( Υ 22 K K − Υ 21 K ) · Φ K , ∇ b e J 2 ( K , b ) = 2  Υ 22 K ( − K µ K,b + b ) + Υ 21 K µ K,b + e q K,b  , where Υ K and e q K,b are defined in ( 3.7 ) and ( C.1 ) , resp ectiv ely . Pr o of. The pro of is similar to the one of Prop osition B.3 . Thus w e omit it here. Equipped with ab ov e results, par allel to the analysis in § 3 , it is clear that by sligh t mo dification of Algo rithms 1 , 2 , and 3 , we deriv e similar actor-critic a lgorithms to solve b oth Problem C.1 and Problem C.2 , where all the non-asymptotic conv ergence results hold. W e omit the alg o rithms and the con ve rgence results here. D Pro ofs of Theore ms D.1 Pro of of Theorem 4.1 W e define µ ∗ s +1 = Λ( µ s ), whic h is the mean-field state generated b y the optimal p o licy π K ∗ ( µ s ) ,b ∗ ( µ s ) = Λ 1 ( µ s ) under the current mean-field state µ s . By Prop o sition 3 .4 , t he optimal K ∗ ( µ ) is independen t of the mean-field state µ . Therefore, w e write K ∗ = K ∗ ( µ ) hereafter for notational conv enience. By ( 3.2 ), w e kno w that µ ∗ s +1 = ( I − A + B K ∗ ) − 1 ·  B b ∗ ( µ s ) + Aµ s + d  . W e define e µ s +1 = ( I − A + B K s ) − 1 ( B b s + A µ s + d ) , whic h is the mean-field state generated by the p olicy π s under the curren t mean-field state µ s , where K s and b s are the parameters of the p olicy π s . By tria ngle inequality , w e ha ve k µ s +1 − µ ∗ k 2 ≤ k µ s +1 − e µ s +1 k 2 | {z } E 1 + k e µ s +1 − µ ∗ s +1 k 2 | {z } E 2 + k µ ∗ s +1 − µ ∗ k 2 | {z } E 3 , (D.1) where µ s +1 is generated by Algorit hm 1 . W e upp er b ound E 1 , E 2 , and E 3 in the sequel. 37 Upp er B ound of E 1 . By Theorem B.4 , it holds with probability at least 1 − ε 10 that E 1 = k µ s +1 − e µ s +1 k 2 < ε s ≤ ε/ 8 · 2 − s , (D.2) where ε s is giv en in ( 4.2 ). Upp er B ound of E 2 . By the triangle inequalit y , w e ha v e E 2 =    ( I − A + B K s ) − 1 ( B b s + Aµ s + d ) − ( I − A + B K ∗ ) − 1 ·  B b ∗ ( µ s ) + A µ s + d     2 ≤   B b ∗ ( µ s ) + Aµ s + d   2 ·     I − A + B K ∗ + B ( K s − K ∗ )  − 1 − ( I − A + B K ∗ ) − 1    2 +   ( I − A + B K s ) − 1   2 · k B k 2 ·   b s − b ∗ ( µ s )   2 . (D.3) By T aylor’s expansion, we hav e     I − A + B K ∗ + B ( K s − K ∗ )  − 1 − ( I − A + B K ∗ ) − 1    2 =    ( I − A + B K ∗ ) − 1  I + ( I − A + B K ∗ ) − 1 B ( K s − K ∗ )  − 1 − ( I − A + B K ∗ ) − 1    2 ≤ 2   ( I − A + B K ∗ ) − 1 B ( K s − K ∗ )( I − A + B K ∗ ) − 1   2 . (D.4) Mean while, by T ay lo r’s expansion, it holds with probability at least 1 − ε 10 that   ( I − A + B K s ) − 1   2 =     I − A + B K ∗ + B ( K s − K ∗ )  − 1    2 =    ( I − A + B K ∗ ) − 1  I + ( I − A + B K ∗ ) − 1 B ( K s − K ∗ )  − 1    2 ≤  1 − ρ ( A − B K ∗ )  − 1 ·  1 +   ( I − A + B K ∗ ) − 1 B   2 · k K ∗ − K s k 2  ≤ 2  1 − ρ ( A − B K ∗ )  − 2 , (D.5) where the last inequalit y comes from Theorem B.4 . By plugging ( D.4 ) and ( D.5 ) in ( D.3 ), it holds with probability a t least 1 − ε 10 that E 2 ≤ 2   B b ∗ ( µ s ) + Aµ s + d   2 ·   ( I − A + B K ∗ ) − 1 B ( K s − K ∗ )( I − A + B K ∗ ) − 1   2 +   ( I − A + B K s ) − 1   2 · k B k 2 ·   b s − b ∗ ( µ s )   2 ≤ 2   B b ∗ ( µ s ) + Aµ s + d   2 ·  1 − ρ ( A − B K ∗ )  − 2 · k B k 2 · k K s − K ∗ k 2 (D.6) + 2  1 − ρ ( A − B K ∗ )  − 2 · k B k 2 ·   b s − b ∗ ( µ s )   2 . By Prop o sition 3.4 , it ho lds that   B b ∗ ( µ s ) + Aµ s + d   2 ≤ L 1 · k B k 2 · k µ s k 2 + k A k 2 · k µ s k 2 + k d k 2 ≤  L 1 · k B k 2 + k A k 2  · k µ s k 2 + k d k 2 , (D.7) 38 where the scalar L 1 is defined in Assumption 3 .1 . Mean while, by Theorem B.4 , it ho lds with probabilit y at least 1 − ε 10 that k K s − K ∗ k F ≤  σ − 1 min (Ψ ǫ ) · σ − 1 min ( R ) · ε s  1 / 2 ,   b s − b ∗ ( µ s )   2 ≤ M b ( µ s ) · ε 1 / 2 s , (D.8) where M b ( µ s ) is defined in ( 4.3 ). Combining ( D.6 ), ( D.7 ), ( D.8 ), and the c hoice of ε s in ( 4.2 ), it holds with probability at least 1 − ε 10 that E 2 ≤ ε/ 8 · 2 − s . (D.9) Upp er B ound of E 3 . By Prop osition 3.2 , w e ha v e E 3 = k µ ∗ s +1 − µ ∗ k 2 =   Λ( µ s ) − Λ( µ ∗ )   2 ≤ L 0 · k µ s − µ ∗ k 2 , (D.10) where L 0 = L 1 L 3 + L 2 b y Assumption 3.1 . By plugging ( D.2 ), ( D.9 ), and ( D.10 ) in ( D.1 ), w e know that k µ s +1 − µ ∗ k 2 ≤ L 0 · k µ s − µ ∗ k 2 + ε · 2 − s − 2 , (D.11) whic h holds with probability a t least 1 − ε 10 . F ollowing from ( D.11 ) and a union b ound argumen t with S = O (log (1 /ε )), it holds with probabilit y at least 1 − ε 5 that k µ S − µ ∗ k 2 ≤ L S 0 · k µ 0 − µ ∗ k 2 + ε / 2 , where w e use the fact that L 0 < 1 b y Assump tion 3.1 . By the c hoice of S in ( 4.1 ), it further holds with probability at least 1 − ε 6 that k µ S − µ ∗ k ≤ ε. (D.12) By Theorem B.4 and t he c hoice of ε s in ( 4.2 ), it holds with probabilit y at least 1 − ε 5 that k K S − K ∗ k F =   K S − K ∗ ( µ S )   F ≤  σ − 1 min (Ψ ǫ ) · σ − 1 min ( R ) · ε S  1 / 2 ≤ ε. (D.13) Mean while, b y the tria ng le inequalit y and the c hoice of ε s in ( 4.2 ), it holds with probabilit y at least 1 − ε 5 that k b S − b ∗ k 2 ≤   b S − b ∗ ( µ S )   2 +   b ∗ ( µ S ) − b ∗   2 ≤ M b ( µ S ) · ε 1 / 2 S + L 1 · k µ S − µ ∗ k 2 ≤ (1 + L 1 ) · ε , (D.14) where the second inequalit y comes from Theorem B.4 and Prop osition 3.4 , and the last inequalit y comes from ( D .12 ). By ( D.12 ), ( D.13 ), and ( D .14 ), w e conclude the pro of of the theorem. 39 D.2 Pro of of Theorem B.4 Pr o of. W e first sho w that J 1 ( K N ) − J 1 ( K ∗ ) < ε/ 2 with a high probabilit y , then sho w that J 2 ( K N , b H ) − J 2 ( K ∗ , b ∗ ) < ε/ 2 with a high probability . Then w e hav e J ( K N , b N ) − J ( K ∗ , b ∗ ) = J 1 ( K N ) + J 2 ( K N , b H ) − J 1 ( K ∗ ) − J 2 ( K ∗ , b ∗ ) < ε with a high probability , whic h prov es Theorem B.4 . P art 1. W e show that J 1 ( K N ) − J 1 ( K ∗ ) < ε/ 2 with a high probability . W e first b ound J 1 ( K 1 ) − J 1 ( K 2 ) for an y K 1 and K 2 . By Prop osition B.2 , J 1 ( K ) take s the form of J 1 ( K ) = tr( P K Ψ ǫ ) = E y ∼ N (0 , Ψ ǫ ) ( y ⊤ P K y ) . (D.15) The follo wing lemma calculates y ⊤ P K 1 y − y ⊤ P K 2 y for a n y K 1 and K 2 . Lemma D.1. Assume that ρ ( A − B K 1 ) < 1 and ρ ( A − B K 2 ) < 1. F or an y state vec tor y , we denote by { y t } t ≥ 0 the sequence generated b y the state transition y t +1 = ( A − B K 2 ) y t with initial state y 0 = y . It holds that y ⊤ P K 2 y − y ⊤ P K 1 y = X t ≥ 0 D K 1 ,K 2 ( y t ) , where D K 1 ,K 2 ( y ) = 2 y ⊤ ( K 2 − K 1 )(Υ 22 K 1 K 1 − Υ 21 K 1 ) y + y ⊤ ( K 2 − K 1 ) ⊤ Υ 22 K 1 ( K 2 − K 1 ) y . Here Υ K is defined in ( 3.7 ). Pr o of. See § F.1 for a detailed pro of . The follo wing lemma sho ws that J 1 ( K ) is gradient dominan t. Lemma D.2. Let K ∗ b e the optimal para meter and K b e a parameter suc h that J 1 ( K ) < ∞ , then it ho lds that J 1 ( K ) − J 1 ( K ∗ ) ≤ σ − 1 min ( R ) · k Φ K ∗ k 2 · tr  (Υ 22 K K − Υ 21 K ) ⊤ (Υ 22 K K − Υ 21 K )  , (D.16) J 1 ( K ) − J 1 ( K ∗ ) ≥ σ min (Ψ ω ) · k Υ 22 K k − 1 2 · tr  (Υ 22 K K − Υ 21 K ) ⊤ (Υ 22 K K − Υ 21 K )  . (D.17) Pr o of. See § F.2 for a detailed pro of . 40 Recall that from Algor ithm 2 , the parameter K is up dated via K n +1 = K n − γ · ( b Υ 22 K n K n − b Υ 21 K n ) , (D.18) where b Υ K n is the o utput of Algo rithm 3 . W e upp er b ound | J 1 ( K n +1 ) − J 1 ( K ∗ ) | in t he sequ el. First, w e sho w that if J 1 ( K n ) − J 1 ( K ∗ ) ≥ ε/ 2 holds for an y n ≤ N , w e obtain that J 1 ( K N ) ≤ J 1 ( K N − 1 ) ≤ · · · ≤ J 1 ( K 0 ) , (D.19) whic h holds with probability at least 1 − ε 13 . W e prov e ( D.19 ) b y mathematical induction. Supp ose that J 1 ( K n ) ≤ J 1 ( K n − 1 ) ≤ · · · ≤ J 1 ( K 0 ) , (D.20) whic h holds for n = 0. In what fo llo ws, w e define e K n +1 as e K n +1 = K n − γ · (Υ 22 K n K n − Υ 21 K n ) , (D.21) where Υ K n is giv en in ( 3.7 ). By ( D.21 ), w e hav e J 1 ( e K n +1 ) − J 1 ( K n ) = E y ∼ N (0 , Ψ ǫ )  y ⊤ ( P e K n +1 − P K n ) y  = − 2 γ · tr  Φ e K n +1 · (Υ 22 K n K n − Υ 21 K n ) ⊤ (Υ 22 K n K n − Υ 21 K n )  + γ 2 · tr  Φ e K n +1 · (Υ 22 K n K n − Υ 21 K n ) ⊤ Υ 22 K n (Υ 22 K n K n − Υ 21 K n )  ≤ − 2 γ · tr  Φ e K n +1 · (Υ 22 K n K n − Υ 21 K n ) ⊤ (Υ 22 K n K n − Υ 21 K n )  (D.22) + γ 2 · k Υ 22 K n k 2 · tr  Φ e K n +1 · (Υ 22 K n K n − Υ 21 K n ) ⊤ (Υ 22 K n K n − Υ 21 K n )  , where the first equalit y comes from ( D.15 ), t he second equality comes fro m Lemma D.1 , and the last inequalit y comes from the trace inequality . By the definition of Υ K in ( 3.7 ), w e obtain t hat k Υ 22 K n k 2 ≤ k R k 2 + k B k 2 2 · k P K n k 2 ≤ k R k 2 + k B k 2 2 · J 1 ( K n ) · σ − 1 min (Ψ ǫ ) ≤ k R k 2 + k B k 2 2 · J 1 ( K 0 ) · σ − 1 min (Ψ ǫ ) , (D.23) where the second inequality comes f r o m Prop osition B.2 . By plugging ( D.23 ) and t he c hoice of stepsize γ ≤ [ k R k 2 + k B k 2 2 · J 1 ( K 0 ) · σ − 1 min (Ψ ǫ )] − 1 in to ( D.22 ), w e obtain that J 1 ( e K n +1 ) − J 1 ( K n ) ≤ − γ · tr  Φ e K n +1 · (Υ 22 K n K n − Υ 21 K n ) ⊤ (Υ 22 K n K n − Υ 21 K n )  ≤ − γ · σ min (Ψ ǫ ) · t r  (Υ 22 K n K n − Υ 21 K n ) ⊤ (Υ 22 K n K n − Υ 21 K n )  ≤ − γ · σ min (Ψ ǫ ) · σ min ( R ) · k Φ K ∗ k − 1 2 ·  J 1 ( K n ) − J 1 ( K ∗ )  < 0 , (D.24) where the la st inequalit y comes from L emma D.2 . The follo wing lemma upp er b ounds | J 1 ( e K n +1 ) − J 1 ( K n +1 ) | . 41 Lemma D.3. Assume that J 1 ( K n ) ≤ J 1 ( K 0 ). It holds with pro ba bilit y at least 1 − ε 15 that   J 1 ( e K n +1 ) − J 1 ( K n +1 )   ≤ γ · σ min (Ψ ǫ ) · σ min ( R ) · k Φ K ∗ k − 1 2 · ε/ 4 , where K n +1 and e K n +1 are defined in ( D.18 ) and ( D .21 ), respectiv ely . Pr o of. See § F.3 for a detailed pro of . Com bining ( D .2 4 ) and Lemma D.3 , if J 1 ( K n ) − J 1 ( K ∗ ) ≥ ε/ 2, it holds with probabilit y at least 1 − ε 15 that J 1 ( K n +1 ) − J 1 ( K n ) ≤ J 1 ( e K n +1 ) − J 1 ( K n ) +   J 1 ( e K n +1 ) − J 1 ( K n +1 )   ≤ − γ · σ min (Ψ ǫ ) · σ min ( R ) · k Φ K ∗ k − 1 2 · ε/ 4 < 0 . (D.25) Com bining ( D.20 ) and ( D .25 ), it holds with probability at least 1 − ε 15 that J 1 ( K n +1 ) ≤ J 1 ( K n ) ≤ · · · ≤ J 1 ( K 0 ) . Finally , following from a union b ound argumen t and the choice of N in Theorem B.4 , if J 1 ( K n ) − J 1 ( K ∗ ) ≥ ε/ 2 holds f o r a ny n ≤ N , w e hav e J 1 ( K N ) ≤ J 1 ( K N − 1 ) ≤ · · · ≤ J 1 ( K 0 ) , whic h holds with probability at least 1 − ε 13 . Th us, w e complete the pro of of ( D.19 ). Com bining ( D.24 ) a nd ( D .25 ), for J 1 ( K n ) − J 1 ( K ∗ ) ≥ ε/ 2, w e hav e J 1 ( K n +1 ) − J 1 ( K ∗ ) ≤  1 − γ · σ min (Ψ ǫ ) · σ min ( R ) · k Φ K ∗ k − 1 2  ·  J 1 ( K n ) − J 1 ( K ∗ )  , whic h holds with probability at least 1 − ε 13 . Mean while, following from a union b ound argumen t and the c hoice of N in Theorem B.4 , it holds with pr o babilit y at least 1 − ε 11 that J 1 ( K N ) − J 1 ( K ∗ ) ≤ ε/ 2 . (D.26) The follo wing lemma upp er b ounds k K N − K ∗ k F . Lemma D.4. F o r any K , we hav e k K − K ∗ k 2 F ≤ σ − 1 min (Ψ ǫ ) · σ − 1 min ( R ) ·  J 1 ( K ) − J 1 ( K ∗ )  . Pr o of. See § F.4 for a detailed pro of . 42 Com bining ( D.26 ) and Lemma D.4 , we hav e k K N − K ∗ k F ≤  σ − 1 min (Ψ ǫ ) · σ − 1 min ( R ) · ε/ 2  1 / 2 , (D.27) whic h holds with probability 1 − ε 11 . P art 2. W e sho w tha t J 2 ( K N , b H ) − J 2 ( K ∗ , b ∗ ) < ε/ 2 with high pro babilit y . F ollowing from Prop osition 3.4 , it holds t ha t J 2 ( K ∗ , b ∗ ) = J 2 ( K N , b K N ). Therefore, it suffices to sho w that J 2 ( K N , b H ) − J 2 ( K N , b K N ) < ε/ 2. First, w e sho w that if J 2 ( K N , b h ) − J 2 ( K N , b K N ) ≥ ε/ 2 for an y h ≤ H , w e obtain that J 2 ( K N , b H ) ≤ J 2 ( K N , b H − 1 ) ≤ · · · ≤ J 2 ( K N , b 1 ) ≤ J 2 ( K N , b 0 ) , (D.28) whic h holds with probability at least 1 − ε 13 . W e prov e ( D.28 ) b y mathematical induction. Supp ose that J 2 ( K N , b h ) ≤ J 2 ( K N , b h − 1 ) ≤ · · · ≤ J 2 ( K N , b 0 ) , (D.29) Recall that by Algor it hm 2 , the parameter b is upda ted via b h +1 = b h − γ b · b ∇ b J 2 ( K N , b h ) . (D.30) Here b ∇ b J 2 ( K N , b h ) = b Υ 22 K N ( − K N b µ K N ,b h + b h ) + b Υ 21 K N b µ K N ,b h + b q K N ,b h , (D.31) where b Υ K N and b q K N ,b h are the o utputs of Algorithm 3 . W e define e b h +1 as e b h +1 = b h − γ b · ∇ b J 2 ( K N , b h ) . (D.32) Here ∇ b J 2 ( K N , b h ) = Υ 22 K N ( − K N µ K N ,b h + b h ) + Υ 21 K N µ K N ,b h + q K N ,b h , (D.33) where Υ K N and q K N ,b h are defined in ( 3.7 ). W e upp er b ound J 2 ( K N , b h +1 ) − J 2 ( K N , b K N ) in the seque l. F ollowing from ( D.32 ) and Prop osition 3.3 , w e ha ve J 2 ( K N , e b h +1 ) − J 2 ( K N , b h ) ≤ − γ b / 2 ·   ∇ b J 2 ( K N , b h )   2 2 ≤ − ν K N · γ b ·  J 2 ( K N , b h ) − J 2 ( K N , b K N )  ≤ − ν K N · γ b · ε < 0 , (D.34) where ν K N is sp ecified in Prop osition 3.3 . The follo wing lemma upp er b ounds | J 2 ( K N , b h +1 ) − J 2 ( K N , e b h +1 ) | . 43 Lemma D.5. Assume that J 2 ( K N , b h ) ≤ J 2 ( K N , b 0 ). It holds with pro babilit y a t least 1 − ε 15 that   J 2 ( K N , b h +1 ) − J 2 ( K N , e b h +1 )   ≤ ν K N · γ b · ε/ 2 , where b h +1 and e b h +1 are defined in ( D.30 ) a nd ( D .32 ), respectiv ely . Pr o of. See § F.5 for a detailed pro of . Com bining ( D .34 ) and Lemma D .5 , we kno w tha t if J 2 ( K N , b h ) − J 2 ( K N , b K N ) ≥ ε , it holds with probability at least 1 − ε 15 that J 2 ( K N , b h +1 ) − J 2 ( K N , b h ) ≤ J 2 ( K N , e b h +1 ) − J 2 ( K N , b h ) +   J 2 ( K N , b h +1 ) − J 2 ( K N , e b h +1 )   ≤ − ν K N · γ b · ε/ 2 < 0 . (D.35) Com bining ( D.29 ) and ( D .35 ), it holds with probability at least 1 − ε 15 that J 2 ( K N , b h +1 ) ≤ J 2 ( K N , b h ) ≤ · · · ≤ J 2 ( K N , b 0 ) . F ollow ing from a union b o und ar g umen t and the c hoice of H in Theorem B.4 , if J 2 ( K N , b h ) − J 2 ( K N , b K N ) ≥ ε holds for any h ≤ H , w e ha v e J 2 ( K N , b H ) ≤ J 2 ( K N , b H − 1 ) ≤ · · · ≤ J 2 ( K N , b 0 ) , whic h holds with probability at least 1 − ε 13 . Th us, w e finish the pro of of ( D.28 ). Com bining ( D.34 ) and Lemma D.5 , fo r J 2 ( K N , b h ) − J 2 ( K N , b K N ) ≥ ε/ 2, w e hav e J 2 ( K N , b h +1 ) − J 2 ( K N , b K N ) ≤ (1 − ν K N · γ b ) ·  J 2 ( K N , b h ) − J 2 ( K N , b K N )  , whic h holds with probability at least 1 − ε 13 . Mean while, following from a union b ound argumen t and the c hoice of H in Theorem B.4 , it holds with pr o babilit y at least 1 − ε 11 that J 2 ( K N , b H ) − J 2 ( K N , b K N ) ≤ ε/ 2 . (D.36) By Prop o sition 3.3 and ( D .36 ), it holds with probability at least 1 − ε 11 that k b H − b K N k 2 ≤ (2 ε/ν K ∗ ) 1 / 2 . (D.37) F ollow ing from Prop osition 3.4 , w e kno w that b K N − b ∗ = ( K N − K ∗ ) Q − 1 ( I − A ) ⊤ (D.38) ·  ( I − A ) Q − 1 ( I − A ) ⊤ + B R − 1 B ⊤  − 1 · ( Aµ + d ) . 44 Com bining ( D.27 ), ( D.37 ), and ( D.3 8 ), it holds with probabilit y 1 − ε 10 that k b H − b K N k 2 ≤ M b · ε 1 / 2 , where M b ( µ ) = 4    Q − 1 ( I − A ) ⊤ ·  ( I − A ) Q − 1 ( I − A ) ⊤ + B R − 1 B ⊤  − 1 · ( Aµ + d )    2 ·  ν − 1 K ∗ + σ − 1 min (Ψ ǫ ) · σ − 1 min ( R )  1 / 2 . W e finish the pro of of the theorem. D.3 Pro of of Theorem B.8 Pr o of. W e follo w the pro of of Theorem 4 .2 in Y ang et al. ( 2019 ), where they only consider LQR without drift terms. Since our pro of requires muc h more delicate a na lysis, w e presen t it here. P art 1. W e denote by b ζ and b ξ the primal and dual v ariables generated by Algorithm 3 . W e define the primal- dual g ap of ( B.11 ) as gap( b ζ , b ξ ) = max ξ ∈V ξ F ( b ζ , ξ ) − min ζ ∈V ζ F ( ζ , b ξ ) . (D.39) In the sequel, w e upp er b ound k b α K,b − α K,b k 2 using ( D.39 ). W e define ζ K,b and ξ ( ζ ) as ζ K,b =  J ( K , b ) , α ⊤ K,b  ⊤ , ξ ( ζ ) = argmax ξ F ( ζ , ξ ) . (D.40) F ollow ing from ( B.12 ), w e kno w that ξ 1 ( ζ ) = ζ 1 − J ( K , b ) , ξ 2 ( ζ ) = E π K,b  ψ ( x, u )  ζ 1 + Θ K,b ζ 2 − E π K,b  c ( x, u ) ψ ( x, u )  . (D.41) The follo wing lemma sho ws that ζ K,b ∈ V ζ and ξ ( ζ ) ∈ V ξ for an y ζ ∈ V ζ . Lemma D.6. Under the assumptions in Theorem B.8 , it holds that ζ K,b = ( J ( K , b ) , α ⊤ K,b ) ⊤ ∈ V ζ . Also, for an y ζ ∈ V ζ , the vec tor ξ ( ζ ) defined in ( D.40 ) satisfies that ξ ( ζ ) ∈ V ξ . Pr o of. See § F.6 for a detailed pro of . 45 By ( B.12 ), w e kno w that ∇ ζ F ( ζ K,b , 0) = 0 and ∇ ξ F ( ζ K,b , 0) = 0. Com bining Lemma D.6 , it holds that ( ζ K,b , 0) is a saddle p oint of the f unction F ( ζ , ξ ) defined in ( B.11 ). F ollow ing from ( D.39 ), it holds tha t    E π K,b  ψ ( x, u )  b ζ 1 + Θ K,b b ζ 2 − E π K,b  c ( x, u ) ψ ( x, u )     2 2 +   b ζ 1 − J ( K , b )   2 = F  b ζ , ξ ( b ζ )  = max ξ ∈V ξ F ( b ζ , ξ ) = gap( b ζ , b ξ ) + min ζ ∈V ζ F ( ζ , b ξ ) , (D.42) where the first equalit y comes from ( D.41 ), a nd the sec ond equality comes from the fact that ξ ( b ζ ) = arg max ξ ∈V ξ F ( b ζ , ξ ) b y ( D.40 ) and Lemma D.6 . W e upp er b ound the RHS o f ( D.42 ) and lo we r b ound the LHS o f ( D.42 ) in the sequel. As f o r the RHS of ( D.42 ), it holds f o r a ny ξ ∈ V ξ that min ζ ∈V ζ F ( ζ , ξ ) ≤ min ζ ∈V ζ max ξ ∈V ξ F ( ζ , ξ ) = min ζ ∈V ζ F  ζ , ξ ( ζ )  = 1 2 min ζ ∈V ζ     E π K,b  ψ ( x, u )  ζ 1 + Θ K,b ζ 2 − E π K,b  c ( x, u ) ψ ( x, u )     2 2 +   ζ 1 − J ( K , b )   2  = 0 , (D.43) where the first equalit y comes from the fact tha t ξ ( ζ ) = argmax ξ ∈V ξ F ( ζ , ξ ) b y ( D.40 ) and Lemma D.6 , the second equalit y comes from ( D .41 ), and the la st equality holds b y taking ζ = ζ K,b ∈ V ζ . Meanw hile, w e lo w er b ound the LHS of ( D.42 ) as    E π K,b  ψ ( x, u )  b ζ 1 + Θ K,b b ζ 2 − E π K,b  c ( x, u ) ψ ( x, u )     2 2 +   b ζ 1 − J ( K , b )   2 =   e Θ K,b ( b ζ − ζ K,b )   2 2 ≥ λ 2 K · k b ζ − ζ K,b k 2 2 ≥ λ 2 K · k b α K,b − α K,b k 2 2 , (D.44) where the first equality comes fr o m the definition of e Θ K,b in ( B.9 ), and the first inequalit y comes fro m Prop osition B.6 . Here λ K is defined in Prop osition B.6 . Com bining ( D.42 ), ( D.43 ), and ( D .44 ), it holds that k b α K,b − α K,b k 2 2 ≤ λ − 2 K · gap( b ζ , b ξ ) , (D.45) whic h finishes the pro of of this part. P art 2. W e no w upp er b ound gap( b ζ , b ξ ). W e denote b y e z t = ( e x ⊤ t , e u ⊤ t ) ⊤ for t ∈ [ e T ], where e x t and e u t are generated in L ine 6 of Algorithm 3 . F ollo wing f rom the state transition in Problem 2.1 and the form of the linear p olicy , { e z t } t ∈ [ e T ] follo ws the following transition, e z t +1 = L e z t + ν + δ t , ( D .46) 46 where ν = Aµ + d − K ( Aµ + d ) + b ! , δ t = ω t − K ω t + σ η ! , L = A B − K A − K B ! . Note that we hav e L = A B − K A − K B ! = I − K !  A B  . Then b y the prop ert y of sp ectral radius, it holds that ρ ( L ) = ρ  A B  I − K !! = ρ ( A − B K ) < 1 . Th us, the Mark ov chain generated b y ( D.46 ) admits a unique stationary distribution N ( µ z , Σ z ), where µ z = ( I − L ) − 1 ν, Σ z = L Σ z L ⊤ + Ψ ω − Ψ ω K ⊤ − K Ψ ω K Ψ ω K ⊤ + σ 2 I ! . (D.47) The follo wing lemma c hara cterizes the av erage b µ z = 1 / e T · e T X t =1 e z t . (D.48) Lemma D.7. It holds that b µ z ∼ N  µ z + 1 e T µ e T , 1 e T e Σ e T  , where k µ e T k 2 ≤ M µ · (1 − ρ ) − 2 · k µ z k 2 and k e Σ e T k F ≤ M Σ · (1 − ρ ) − 1 · k Σ z k F . Here M µ and M Σ are p ositiv e absolute constan ts. Moreo ver, it holds with probabilit y at least 1 − e T − 6 that k b µ z − µ z k 2 ≤ log e T e T 1 / 4 · (1 − ρ ) − 2 · p o ly  k Φ K k 2 , k K k F , k b k 2 , k µ k 2  . Pr o of. See § F.7 for a detailed pro of . Lemma D.7 gives that k b µ K,b − µ K,b k 2 ≤ log e T e T 1 / 4 · (1 − ρ ) − 2 · p oly  k Φ K k 2 , k K k F , k b k 2 , k µ k 2  , whic h holds with probability at least 1 − e T − 6 . 47 W e now apply a truncation argumen t to sho w that gap( b ζ , b ξ ) is upp er b ounded. W e define the ev en t E in the sequel. F ollowing from Lemma D.7 , it holds for an y z ∼ N ( µ z , Σ z ) that z − b µ z + 1 / e T · µ e T ∼ N (0 , Σ z + 1 / e T · e Σ e T ) . By Lemma G.3 , there exists a p ositiv e a bsolute constan t C 0 suc h that P h   k z − b µ z + 1 / e T · µ e T k 2 2 − tr( e Σ z )   > τ i ≤ 2 exp h − C 0 · min  τ 2 k e Σ z k − 2 F , τ k e Σ z k − 1 2  i , (D .49) where w e write e Σ z = Σ z + 1 / e T · e Σ e T for notational con v enience. By taking τ = C 1 · log T · k e Σ z k F in ( D.49 ) fo r a sufficien tly large p ositiv e absolute constan t C 1 , it holds that P h   k z − b µ z + 1 / e T · µ e T k 2 2 − tr( e Σ z )   > C 1 · log T · k e Σ z k F i ≤ T − 6 . (D.50) W e define t he ev ent E t, 1 for an y t ∈ [ T ] as E t, 1 = n   k z t − b µ z + 1 / e T · µ e T k 2 2 − tr( e Σ z )   ≤ C 1 · log T · k e Σ z k F o . Then b y ( D.50 ), it holds for an y t ∈ [ T ] that P ( E t, 1 ) ≥ 1 − T − 6 . (D.51) Also, w e define E 1 = \ t ∈ [ T ] E t, 1 . (D.52) F ollow ing from a union b ound argumen t and ( D.51 ), it holds that P ( E 1 ) ≥ 1 − T − 5 . (D.53) Also, conditioning on E 1 , it holds for sufficien tly large e T that max t ∈ [ T ] k z t − b µ z k 2 2 ≤ C 1 · log T · k e Σ z k F + t r( e Σ z ) + k 1 / e T · µ e T k 2 2 ≤ 2 e C 1 ·  1 + M Σ (1 − ρ ) − 1 / e T 2  · log T · k Σ z k 2 + M µ (1 − ρ ) − 2 / e T 2 · k µ z k 2 2 ≤ C 2 · log T ·  1 + k K k 2 F  · k Φ K k 2 · (1 − ρ ) − 1 + C 3 ·  k b k 2 2 + k µ k 2 2  · (1 − ρ ) − 4 · e T − 2 ≤ 2 C 2 · log T ·  1 + k K k 2 F  · k Φ K k 2 · (1 − ρ ) − 1 , (D.54) 48 where e C 1 , C 2 , and C 3 are p ositiv e absolute constan ts. Here, the first inequalit y comes from the definition of E 1 in ( D .52 ), the second inequalit y comes from Lemma D.7 , and the third inequalit y comes from ( D.47 ). Also, w e define the follo wing ev en t E 2 =  k b µ z − µ z + 1 / e T · µ e T k 2 ≤ C 1  . (D.55) Then b y Lemma D.7 , w e kno w that P ( E 2 ) ≥ 1 − e T − 6 (D.56) for e T sufficien tly large. W e define the ev ent E a s E = E 1 \ E 2 . Then follo wing from ( D.53 ), ( D.56 ), a nd a union b ound argument, w e kno w that P ( E ) ≥ 1 − T − 5 − e T − 6 . No w, w e define t he truncated feature v ector e ψ ( x, u ) as e ψ ( x, u ) = b ψ ( x, u ) 1 E , the truncated cost function e c ( x, u ) as e c ( x, u ) = c ( x, u ) 1 E , and also the truncated ob j ectiv e f unction e F ( ζ , ξ ) as e F ( ζ , ξ ) = n E ( e ψ ) ζ 1 + E  ( e ψ − e ψ ′ ) e ψ ⊤  ζ 2 − E ( e c e ψ ) o ⊤ ξ 2 +  ζ 1 − E ( e c )  · ξ 1 − k ξ k 2 2 / 2 , (D .57) where w e write e ψ = e ψ ( x, u ) and e c = e c ( x, u ) for notatio nal con v enience. Here the exp ectation is tak en fo llo wing the p olicy π K,b and the state transition. The following lemma establishes the upp er bo und of | F ( ζ , ξ ) − e F ( ζ , ξ ) | , where F ( ζ , ξ ) and e F ( ζ , ξ ) are defined in ( B.11 ) and ( D.57 ), resp ectiv ely . Lemma D.8. It holds with probabilit y at least 1 − e T − 6 that   F ( ζ , ξ ) − e F ( ζ , ξ )   ≤  1 2 T + log e T e T 1 / 4  · (1 − ρ ) − 2 · p oly  k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )  . Pr o of. See § F.8 for a detailed pro of . F ollow ing from ( D .39 ) and Lemma D .8 , it holds with probabilit y at least 1 − e T − 6 that   gap( b ζ , b ξ ) − g gap( b ζ , b ξ )   ≤  1 2 T + log e T e T 1 / 4  · (1 − ρ ) − 2 · p oly  k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )  . (D.58) 49 where w e define g gap( b ζ , b ξ ) a s g gap( b ζ , b ξ ) = max ξ ∈V ξ e F ( b ζ , ξ ) − min ζ ∈V ζ e F ( ζ , b ξ ) . Therefore, to upp er b ound o f ga p( ζ , ξ ), w e only need to upp er b ound g gap( ζ , ξ ) . P art 3. W e upp er b ound g gap( ζ , ξ ) in the sequel. W e first sho w that the tra jectory generated b y the p olicy π K,b and the state transition in Problem 2.2 is β - mixing. Lemma D.9. Consider a linear system y t +1 = D y t + ϑ + υ t , where { y t } t ≥ 0 ⊂ R m , the matrix D ∈ R m × m satisfying ρ ( D ) < 1, the v ector ϑ ∈ R m , and υ t ∼ N (0 , Σ) is the Gaussians. W e denote b y  t the marginal distribution of y t for an y t ≥ 0. Mean while, assume that the stationary distribution of { y t } t ≥ 0 is a Gaussian distribution N ( ( I − D ) − 1 ϑ, Σ ∞ ), where Σ ∞ is the cov ariance mat r ix. W e define the β -mixing co efficien ts f or a n y n ≥ 1 as follo ws β ( n ) = sup t ≥ 0 E y ∼  t h   P y n ( · | y 0 = y ) − P N (( I − D ) − 1 ϑ, Σ ∞ ) ( · )   TV i . Then, for a ny ρ ∈ ( ρ ( D ) , 1), the β - mixing co efficien ts satisfy that β ( n ) ≤ C ρ,D ,ϑ ·  tr(Σ ∞ ) + m · (1 − ρ ) − 2  1 / 2 · ρ n , where C ρ,D ,ϑ is a constant, whic h only dep ends on ρ , D , and ϑ . W e sa y that the sequence { y t } t ≥ 0 is β -mixing with par a meter ρ . Pr o of. See Prop osition 3.1 in T u and Rec h t ( 2017 ) fo r details. Recall that b y ( 3.1 ), the sequence { x t } t ≥ 0 follo ws x t +1 = ( A − B K ) x t + ( B b + Aµ + d ) + ǫ t , ǫ t ∼ N (0 , Ψ ǫ ) , where the matrix A − B K satisfies that ρ ( A − B K ) < 1. Therefore, b y Lemma D .9 , the sequence { z t } t ≥ 0 is β -mixing with pa r ameter ρ ∈ ( ρ ( A − B K ) , 1), where z t = ( x ⊤ t , u ⊤ t ) ⊤ . The follo wing lemma upp er b ounds the primal- dual ga p for a con vex -conca ve problem. Lemma D.10. Let X and Y b e t wo compact and con v ex sets suc h that k x − x ′ k 2 ≤ M and k y − y ′ k 2 ≤ M for an y x, x ′ ∈ X and y , y ′ ∈ Y . W e consider solving the following minimax problem min x ∈X max y ∈Y H ( x, y ) = E ǫ ∼  ǫ  G ( x, y ; ǫ )  , 50 where the ob jectiv e function H ( x, y ) is conv ex in x and conca ve in y . In addition, we assume that the distribution  ǫ is β - mixing with β ( n ) ≤ C ǫ · ρ n , where C ǫ is a constan t. Mean while, w e assume that it ho lds almost surely that G ( x, y ; ǫ ) is e L 0 -Lipsc hitz in b oth x and y , the gradien t ∇ x G ( x, y ; ǫ ) is e L 1 -Lipsc hitz in x fo r an y y ∈ Y , t he gradien t ∇ y G ( x, y ; ǫ ) is e L 1 - Lipsc hitz in y for any x ∈ X , where C ǫ , e L 0 , e L 1 > 1. Eac h step of our gradien t- based method tak es the following forms, x t +1 = Γ X  x t − γ t +1 · ∇ x G ( x t , y t ; ǫ t )  , y t +1 = Γ Y  y t − γ t +1 · ∇ y G ( x t , y t ; ǫ t )  , where the op erators Γ X and Γ Y pro jects the v ariables back to X and Y , resp ectiv ely , and the stepsizes tak e the for m γ t = γ 0 · t − 1 / 2 for a constan t γ 0 > 0. Moreov er, let b x = ( P T t =1 γ t ) − 1 ( P T t =1 γ t x t ) and b y = ( P T t =1 γ t ) − 1 ( P T t =1 γ t y t ) b e the final output of the gr a - dien t metho d af ter T iterations, then there exists a p ositiv e absolute constan t C , such that for an y δ ∈ (0 , 1), the primal- dua l gap to the minimax problem is upp er b ounded as max x ∈X H ( b x, y ) − min y ∈Y H ( x, b y ) ≤ C · ( M 2 + e L 2 0 + e L 0 e L 1 M ) log(1 / ρ ) · log 2 T + log (1 /δ ) √ T + C · C ǫ e L 0 M T , whic h holds with probability at least 1 − δ . Pr o of. See Theorem 5.4 in Y ang et al. ( 2019 ) for details. T o use Lemma D .10 , we define the function G ( ζ , ξ ; e ψ, e ψ ′ ) as G ( ζ , ξ ; e ψ, e ψ ′ ) =  e ψ ζ 1 + ( e ψ − e ψ ′ ) e ψ ⊤ ζ 2 − e c e ψ  ⊤ ξ 2 + ( ζ 1 − e c ) · ξ 1 − 1 / 2 · k ξ k 2 2 , where e ψ = e ψ ( x, u ) and e ψ ′ = e ψ ( x ′ , u ′ ). Not e that the gradien ts of G ( ζ , ξ ; e ψ, e ψ ′ ) ta k e the form ∇ ζ G ( ζ , ξ ; e ψ, e ψ ′ ) = e ψ ⊤ ξ 2 + ξ 1 e ψ ( e ψ − e ψ ′ ) ⊤ ξ 2 ! , ∇ ξ G ( ζ , ξ ; e ψ, e ψ ′ ) = ζ 1 − e c − ξ 1 e ψ ζ 1 + ( e ψ − e ψ ′ ) e ψ ⊤ ζ 2 − e c e ψ − ξ 2 ! . By Definition B.7 and Lemma D.6 , w e know that   ∇ ζ G ( ζ , ξ ; e ψ, e ψ ′ )   2 ≤ p oly  k K k F , J ( K 0 , b 0 )  · log 2 T · (1 − ρ ) − 2 ,   ∇ ξ G ( ζ , ξ ; e ψ, e ψ ′ )   2 ≤ p oly  k K k F , k µ k 2 , J ( K 0 , b 0 )  · log 2 T · (1 − ρ ) − 2 . (D .59) This giv es the L ipschitz constan t e L 0 in Lemma D.10 for G ( ζ , ξ ; e ψ, e ψ ′ ). Also, the Hessians of G ( ζ , ξ ; e ψ, e ψ ′ ) ta ke the forms of ∇ 2 ζ ζ G ( ζ , ξ ; e ψ, e ψ ′ ) = 0 , ∇ 2 ξ ξ G ( ζ , ξ ; e ψ, e ψ ′ ) = − I , 51 whic h follows that   ∇ 2 ζ ζ G ( ζ , ξ ; e ψ, e ψ ′ )   2 = 0 ,   ∇ 2 ξ ξ G ( ζ , ξ ; e ψ, e ψ ′ )   2 = 1 . (D.60) This giv es the Lipsc hitz constant e L 1 in Lemma D.10 for ∇ ζ G ( ζ , ξ ; e ψ, e ψ ′ ) and ∇ ξ G ( ζ , ξ ; e ψ, e ψ ′ ). Moreo ve r, note that ( D.54 ) prov ides an upp er b ound of M , combining ( D.5 9 ), ( D .60 ) and Lemma D.10 , it holds with probabilit y at least 1 − T − 5 that g gap( b ζ , b ξ ) ≤ p oly  k K k F , k µ k 2 , J ( K 0 , b 0 )  · log 6 T (1 − ρ ) 4 · √ T . (D.61) Com bining ( D.45 ), ( D.58 ), a nd ( D.61 ), w e kno w that k b α K,b − α K,b k 2 2 ≤ λ − 2 K · p o ly 1  k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )  ·  log 6 T T 1 / 2 · (1 − ρ ) 4 + log e T e T 1 / 4 · (1 − ρ ) 2  . Same b ounds for k b Υ K − Υ K k 2 F , k b p K,b − p K,b k 2 2 , and k b q K,b − q K,b k 2 2 hold. W e finish the pro of of t he theorem. E Pro o fs of Prop osit ions E.1 Pro of of Prop osition 3.2 Pr o of. W e follo w a similar pro of as in the o ne of Theorem 1.1 in Sznitman ( 1 991 ) and Theorem 3.2 in Bensoussan et al. ( 201 6 ). Note that fo r an y p olicy π K,b ∈ Π, the parameters K and b uniquely determine the p olicy . W e define the follo wing metric on Π. Definition E.1. F or an y π K 1 ,b 1 , π K 2 ,b 2 ∈ Π, w e define the follo wing metric, k π K 1 ,b 1 − π K 2 ,b 2 k 2 = c 1 · k K 1 − K 2 k 2 + c 2 · k b 1 − b 2 k 2 , where c 1 and c 2 are p ositiv e constants. One can v erify that Definition E.1 satisfies the requirem en t of b eing a metric. W e first ev aluate the forms of the o p erators Λ 1 ( · ) and Λ 2 ( · , · ). F orms of t he op erators Λ 1 ( · ) and Λ 2 ( · , · ) . By the definition of Λ 1 ( µ ), whic h g iv es the optimal p olicy under t he mean-field state µ , it holds that Λ 1 ( µ ) = π ∗ µ , 52 where π ∗ µ solv es Problem 2 .2 . This giv es the f o rm of Λ 1 ( · ). W e now turn to Λ 2 ( µ, π ) , whic h giv es t he mean-field state µ new generated by the p olicy π under the curren t mean-field state µ . In Problem 2.2 , the sequence of states { x t } t ≥ 0 constitutes a Mark ov chain, whic h admits a unique stationary distribution. Thus , b y the state transition in Problem 2.2 and the form of t he linear-G aussian p o licy , w e hav e µ new = ( A − B K π ) µ new + ( B b π + Aµ + d ) , (E.1) where K π and b π are parameters o f the p olicy π . By solving ( E.1 ) fo r µ new , it holds that Λ 2 ( µ, π ) = µ new = ( I − A + B K π ) − 1 ( B b π + Aµ + d ) . This giv es the form of Λ 2 ( · , · ). Next, w e compute the Lipsc hitz constants for Λ 1 ( · ) and Λ 2 ( · , · ). Lipsc hitz constan t for Λ 1 ( · ) . By Prop osition 3.4 , for an y µ 1 , µ 2 ∈ R m , the o ptimal K ∗ is fixed for Problem 2.2 . Therefore, b y the form of the optimal b K giv en in Prop o sition 3.4 , it holds that   Λ 1 ( µ 1 ) − Λ 1 ( µ 2 )   2 ≤ c 2 ·     ( I − A ) Q − 1 ( I − A ) ⊤ + B R − 1 B ⊤  − 1 A    2 ·     K ∗ Q − 1 ( I − A ) ⊤ − R − 1 B ⊤     2 · k µ 1 − µ 2 k 2 = c 2 L 1 · k µ 1 − µ 2 k 2 , (E.2) where L 1 is defined in Assumption 3.1 . Lipsc hitz constants for Λ 2 ( · , · ) . By Prop osition 3.4 , for an y µ 1 , µ 2 ∈ R m , the optimal K ∗ is fixed for Problem 2.2 . Th us, fo r an y π ∈ Π suc h that π is an optimal p o licy under some µ ∈ R m , it holds that   Λ 2 ( µ 1 , π ) − Λ 2 ( µ 2 , π )   2 =   ( I − A + B K π ) − 1 · A · ( µ 1 − µ 2 )   2 ≤  1 − ρ ( A − B K ∗ )  − 1 · k A k 2 · k µ 1 − µ 2 k 2 = L 2 · k µ 1 − µ 2 k 2 , (E.3) where L 2 is defined in Assumption 3 .1 , and K π = K ∗ is the parameter of the p olicy π . Mean while, for an y mean-field state µ ∈ R m , and an y p oicies π 1 , π 2 ∈ Π that are optima l under some mean- field states µ 1 , µ 2 , resp ectiv ely , w e ha v e   Λ 2 ( µ, π 1 ) − Λ 2 ( µ, π 2 )   2 =   ( I − A + B K ∗ ) − 1 B · ( b π 1 − b π 2 )   2 ≤  1 − ρ ( A − B K ∗ )  − 1 · k B k 2 · k b π 1 − b π 2 k 2 = c − 1 2 L 3 · k π 1 − π 2 k 2 , (E.4) 53 where in the last equalit y , w e use the fa ct that K π 1 = K π 2 = K ∗ b y Prop o sition 3.4 . Here L 3 is define d in Assumption 3.1 , and b π 1 and b π 2 are the parameters of the p olicies π 1 and π 2 . No w w e sho w that the op erator Λ ( · ) is a contraction. F or an y µ 1 , µ 2 ∈ R m , it holds that   Λ( µ 1 ) − Λ( µ 2 )   2 =    Λ 2  µ 1 , Λ 1 ( µ 1 )  − Λ 2  µ 2 , Λ 1 ( µ 2 )     2 ≤    Λ 2  µ 1 , Λ 1 ( µ 1 )  − Λ 2  µ 1 , Λ 1 ( µ 2 )     2 +    Λ 2  µ 1 , Λ 1 ( µ 2 )  − Λ 2  µ 2 , Λ 1 ( µ 2 )     2 ≤ c − 1 2 L 3 ·   Λ 1 ( µ 1 ) − Λ 1 ( µ 2 )   2 + L 2 · k µ 1 − µ 2 k 2 ≤ c − 1 2 L 3 · c 2 L 1 · k µ 1 − µ 2 k 2 + L 2 · k µ 1 − µ 2 k 2 = ( L 1 L 3 + L 2 ) · k µ 1 − µ 2 k 2 , where the first inequality comes from triangle inequalit y , t he second inequality comes from ( E.3 ) and ( E.4 ), and the last inequality comes fro m ( E.2 ) . By Assum ption 3.1 , w e know that L 0 = L 1 L 3 + L 2 < 1, whic h sho ws that the op erator Λ( · ) is a con traction. Moreo ve r, by Banac h fixed-p o in t theorem, we obta in that Λ( · ) has a unique fixed p oint, which gives the unique equilibrium pair o f Problem 2.1 . W e finish the pro of of the prop osition. E.2 Pro of of Prop osition 3.4 Pr o of. By the definition of J 2 ( K , b ) in ( 3 .6 ) a nd the definition of µ K,b in ( 3.2 ), t he problem min b J 2 ( K , b ) is equiv alent to the following constrained problem, min µ,b µ b ! ⊤ Q + K ⊤ RK − K ⊤ R − RK R ! µ b ! s.t. ( I − A + B K ) µ − ( B b + Aµ + d ) = 0 . (E.5) F ollow ing from t he KKT conditions of ( E.5 ), it holds that 2 M K µ b ! + N K λ = 0 , N ⊤ K µ b ! + Aµ + d = 0 , (E.6) where M K = Q + K ⊤ RK − K ⊤ R − RK R ! , N K = − ( I − A + B K ) ⊤ B ⊤ ! . 54 By solving ( E.6 ), the minimizer of ( E.5 ) takes the form of µ K,b K b K ! = − M − 1 K N K ( N ⊤ K M − 1 K N K ) − 1 ( Aµ + d ) . (E.7) By substituting ( E.7 ) in to the definition of J 2 ( K , b ) in ( 3.6 ), w e hav e J 2 ( K , b K ) = ( Aµ + d ) ⊤ ( N ⊤ K M − 1 K N K ) − 1 ( Aµ + d ) . (E .8) Mean while, by calculation, we ha ve M − 1 K = Q − 1 Q − 1 K ⊤ K Q − 1 K Q − 1 K ⊤ + R − 1 ! . Therefore, the term N ⊤ K M − 1 K N K in ( E.8 ) tak es the form of N ⊤ K M − 1 K N K = ( I − A ) Q − 1 ( I − A ⊤ ) + B R − 1 B ⊤ . (E.9) By plugging ( E.9 ) in to ( E.8 ), w e hav e J 2 ( K , b K ) = ( Aµ + d ) ⊤  ( I − A ) Q − 1 ( I − A ⊤ ) + B R − 1 B ⊤  − 1 ( Aµ + d ) . Also, b y plugging ( E.9 ) in to ( E.7 ) , we ha ve µ K,b K b K ! = Q − 1 ( I − A ) ⊤ K Q − 1 ( I − A ) ⊤ − R − 1 B ⊤ !  ( I − A ) Q − 1 ( I − A ) ⊤ + B R − 1 B ⊤  − 1 ( Aµ + d ) . W e finish the pro of of the prop osition. E.3 Pro of of Prop osition B.2 Pr o of. By the definition of the cost function c ( x, u ) in Problem 2.2 (recall that w e drop the subscript µ when w e fo cus o n Problem 2.2 ), w e hav e E c t = E ( x ⊤ t Qx t + u ⊤ t Ru t + µ ⊤ Qµ ) = E ( x ⊤ t Qx t + x ⊤ t K ⊤ RK x t − 2 b ⊤ RK x t + b ⊤ Rb + σ 2 η ⊤ t Rη t + µ ⊤ Qµ ) = E  x ⊤ t ( Q + K ⊤ RK ) x t − 2 b ⊤ RK x t  + b ⊤ Rb + σ 2 · tr( R ) + µ ⊤ Qµ, (E.10) where we write c t = c ( x t , u t ) for notational con v enience. Here in the second line we use u t = π K,b ( x t ) = − K x t + b + σ η t . Therefore, com bining ( E.10 ) and the definition of J ( K , b ) 55 in Problem 2.2 , we ha ve J ( K , b ) = lim T →∞ 1 T T X t =0 n E  x ⊤ t ( Q + K ⊤ RK ) x t − 2 b ⊤ RK x t  + b ⊤ Rb + σ 2 · tr( R ) + µ ⊤ Qµ o = E x ∼N ( µ K,b , Φ K )  x ⊤ ( Q + K ⊤ RK ) x − 2 b ⊤ RK x  + b ⊤ Rb + σ 2 · tr( R ) + µ ⊤ Qµ = tr  ( Q + K ⊤ RK )Φ K  + µ ⊤ K,b ( Q + K ⊤ RK ) µ K,b − 2 b ⊤ RK µ K,b (E.11) + b ⊤ Rb + σ 2 · tr( R ) + µ ⊤ Qµ. No w, b y iterativ ely applying ( 3.3 ) and ( 3 .4 ), w e ha v e tr  ( Q + K ⊤ RK )Φ K  = tr( P K Ψ ǫ ) , (E.12) where P K is giv en in ( 3.4 ). Com bining ( E.11 ) and ( E.12 ) , w e kno w that J ( K , b ) = J 1 ( K ) + J 2 ( K , b ) + σ 2 · tr( R ) + µ ⊤ Qµ, where J 1 ( K ) = tr  ( Q + K ⊤ RK )Φ K  = tr( P K Ψ ǫ ) , J 2 ( K , b ) = µ K,b b ! ⊤ Q + K ⊤ RK − K ⊤ R − RK R ! µ K,b b ! . W e finish the pro of of the prop osition. E.4 Pro of of Prop osition 3.3 Pr o of. By calculating the Hessian matrix of J 2 ( K , b ), w e ha v e ∇ 2 bb J 2 ( K , b ) = B ⊤ ( I − A + B K ) −⊤ ( Q + K ⊤ RK )( I − A + B K ) − 1 B −  RK ( I − A + B K ) − 1 B + B ⊤ ( I − A + B K ) −⊤ K ⊤ R  + R =  R 1 / 2 K ( I − A + B K ) − 1 B − R 1 / 2  ⊤  R 1 / 2 K ( I − A + B K ) − 1 B − R 1 / 2  + B ⊤ ( I − A + B K ) −⊤ Q ( I − A + B K ) − 1 B , whic h is a p ositiv e definite matrix indep enden t of b . W e denote b y its minim um singular v alue as ν K . Also, not e that k∇ 2 bb J 2 ( K , b ) k 2 is upper b ounded as   ∇ 2 bb J 2 ( K , b )   2 ≤  1 − ρ ( A − B K )  − 2 ·  k B k 2 2 · k K k 2 2 · k R k 2 + k B k 2 2 · k Q k 2  . Therefore, it holds tha t ι K ≤  1 − ρ ( A − B K )  − 2 ·  k B k 2 2 · k K k 2 2 · k R k 2 + k B k 2 2 · k Q k 2  , where ι K is the maxim um singular v alue of ∇ 2 bb J 2 ( K , b ). W e finish the pro of of the prop osi- tion. 56 E.5 Pro of of Prop osition B.3 Pr o of. F ollowing from Prop osition B.2 , it holds that J 1 ( K ) = tr( P K Ψ ǫ ) = E y ∼ N (0 , Ψ ǫ ) ( y ⊤ P K y ) = E y ∼ N (0 , Ψ ǫ )  f K ( y )  , (E.13) where f K ( y ) = y ⊤ P K y . By the definition of P K in ( 3.4 ), w e obtain that ∇ K f K ( y ) = ∇ K n y ⊤ ( Q + K ⊤ RK ) y +  ( A − B K ) y  ⊤ P K  ( A − B K ) y  ⊤ o = 2 RK y y ⊤ + ∇ K h f K  ( A − B K ) y  i . (E .14) Also, w e ha v e ∇ K h f K  ( A − B K ) y  i = ∇ K f K  ( A − B K ) y  − 2 B ⊤ P K ( A − B K ) y y ⊤ . (E.15) By plugging ( E.15 ) into ( E.14 ) , we ha v e ∇ K f K ( y ) = 2  ( R + B ⊤ P K B ) K − B ⊤ P K A  y y ⊤ + ∇ K f K  ( A − B K ) y  . ( E.16 ) By iterativ ely applying ( E.16 ), it holds that ∇ K f K ( y ) = 2  ( R + B ⊤ P K B ) K − B ⊤ P K A  · ∞ X t =0 y t y ⊤ t , (E.17) where y t +1 = ( A − B K ) y t with y 0 = y . Now , combining ( E.13 ) a nd ( E.17 ), it holds that ∇ K J 1 ( K ) = 2  ( R + B ⊤ P K B ) K − B ⊤ P K A  Φ K = 2(Υ 22 K K − Υ 21 K ) · Φ K , where Υ K is defined in ( 3.7 ). Meanw hile, combining the fo r m of µ K,b in ( 3.2 ), it holds by calculation that ∇ b J 2 ( K , b ) = 2  Υ 22 K ( − K µ K,b + b ) + Υ 21 K µ K,b + q K,b  , where q K,b is defined in ( 3.7 ). W e finish the pro of of t he prop osition. E.6 Pro of of Prop osition B.1 Pr o of. F rom the definition of V K,b ( x ) in ( B.1 ) and the definition of the cost function c ( x, u ) in Problem 2.2 , it holds that V K,b ( x ) = ∞ X t =0 n E  x ⊤ t ( Q + K ⊤ RK ) x t − 2 b ⊤ RK x t + b ⊤ Rb + σ 2 η ⊤ t Rη t + µ ⊤ Qµ | x 0 = x  − J ( K , b ) o . 57 Com bining ( 3 .1 ), w e know that V K,b ( x ) is a quadratic f unction taking the form of V K,b ( x ) = x ⊤ Gx + r ⊤ x + h , where G , r , and h are functions of K and b . Note tha t V K,b ( x ) satisfies that V K,b ( x ) = c ( x, − K x + b ) − J ( K , b ) + E  V K,b ( x ′ ) | x  , (E.18) b y substituting the form o f c ( x, − K x + b ) in Problem 2.2 and J ( K , b ) in ( 3.5 ) into ( E.18 ), w e obtain that x ⊤ Gx + r ⊤ x + h = x ⊤ ( Q + K ⊤ RK ) x − 2 b ⊤ RK x + b ⊤ Rb + µ ⊤ Qµ (E.19) −  tr( P K Ψ ǫ ) + µ ⊤ K,b ( Q + K ⊤ RK ) µ K,b − 2 b ⊤ RK µ K,b + µ ⊤ Qµ + b ⊤ Rb  +  ( A − B K ) x + ( B b + Aµ + d )  ⊤ G  ( A − B K ) x + ( B b + Aµ + d )  + t r( G Ψ ǫ ) + r ⊤  ( A − B K ) x + ( B b + Aµ + d )  + h − σ 2 · tr( R ) . By comparing the quadratic terms and linear terms on b oth the LHS and R HS in ( E.19 ), w e obtain that G = P K , r = 2 f K,b , where f K,b = ( I − A + B K ) −⊤ [( A − B K ) ⊤ P K ( B b + Aµ + d ) − K ⊤ Rb ]. Also, b y the definition of V K,b ( x ) in ( B.1 ), we know that E [ V K,b ( x )] = 0, where the exp ectation is tak en follo wing the stationary distribution generated by the p o licy π K,b and the state transition. Therefore, w e hav e h = − 2 f K,b µ K,b − µ ⊤ K,b P K µ K,b − tr( P K Φ K ) , whic h sho ws that V K,b ( x ) = x ⊤ P K x − tr( P K Φ K ) + 2 f ⊤ K,b ( x − µ K,b ) − µ ⊤ K,b P K µ K,b . (E.20) F or the a ction-v alue f unction Q K,b ( x, u ), by plugg ing ( E.20 ) into ( B.2 ) , w e obtain that Q K,b ( x, u ) = x u ! ⊤ Υ K x u ! + 2 p K,b q K,b ! ⊤ x u ! − tr( P K Φ K ) − σ 2 · tr( R + P K B B ⊤ ) − b ⊤ Rb + 2 b ⊤ RK µ K,b − µ ⊤ K,b ( Q + K ⊤ RK + P K ) µ K,b + 2 f ⊤ K,b  ( Aµ + d ) − µ K,b  + ( Aµ + d ) ⊤ P K ( Aµ + d ) . W e finish the pro of of the prop osition. 58 E.7 Pro of of Prop osition B.5 Pr o of. By Prop osition B.1 , it holds that Q K,b tak es the following linear form Q K,b ( x, u ) = ψ ( x, u ) ⊤ α K,b + β K,b , (E.21) where β K,b is a scalar indep enden t of x and u . No t e that Q K,b ( x, u ) satisfies that Q K,b ( x, u ) = c ( x, u ) − J ( K , b ) + E π K,b  Q K,b ( x ′ , u ′ ) | x, u  , (E.22) where ( x ′ , u ′ ) is the state-action pair after ( x, u ) f o llo wing the p olicy π K,b and the state transition. Com bining ( E.21 ) and ( E.22 ) , we obtain that ψ ( x, u ) ⊤ α K,b = c ( x, u ) − J ( K , b ) + E π K,b  ψ ( x ′ , u ′ ) | x, u  ⊤ α K,b . (E.23) By left multiplying ψ ( x, u ) to b o th sides o f ( E.23 ), a nd taking the exp ectation, we ha ve E π K,b n ψ ( x, u )  ψ ( x, u ) − ψ ( x ′ , u ′ )  ⊤ o · α K,b + E π K,b  ψ ( x, u )  · J ( K , b ) = E π K,b  c ( x, u ) ψ ( x, u )  . Com bining the definition of the matrix Θ K,b in ( B.7 ), we hav e 1 0 E π K,b  ψ ( x, u )  Θ K,b ! J ( K , b ) α K,b ! = J ( K , b ) E π K,b  c ( x, u ) ψ ( x, u )  ! , whic h concludes the pro o f of the prop osition. E.8 Pro of of Prop osition B.6 Pr o of. Inv ertibilit y and Upp er B ound. W e denote by z t = ( x ⊤ t , u ⊤ t ) ⊤ for an y t ≥ 0 . Then follo wing from t he state tra nsition and the p olicy π K,b , the transition of { z t } t ≥ 0 tak es the form of z t +1 = Lz t + ν + δ t , (E.24) where L , ν and δ are defined as L = A B − K A − K B ! , ν = Aµ + d − K ( Aµ + d ) + b ! , δ t = ω t − K ω t + σ η t ! . Note that L also tak es the form of L = I − K !  A B  . 59 Com bining the fa ct that ρ ( U V ) = ρ ( V U ) for an y matrices U and V , w e kno w that ρ ( L ) = ρ ( A − B K ) < 1, whic h v erifies the stabilit y of ( E.24 ) . F ollo wing from the stabilit y of ( E.24 ) , w e kno w that the Mark ov chain generated b y ( E.24 ) admits a unique stationary distribution N ( µ z , Σ z ), where µ z and Σ z satisfy tha t µ z = Lµ z + ν , Σ z = L Σ z L ⊤ + Ψ δ . where Ψ δ = Ψ ω − Ψ ω K ⊤ − K Ψ ω K Ψ ω K ⊤ + σ 2 I ! . Also, w e kno w that Σ z tak es the form of Σ z = Co v " x u !# = Φ K − Φ K K ⊤ − K Φ K K Φ K K ⊤ + σ 2 I ! = 0 0 0 σ 2 I ! + I − K ! Φ K I − K ! ⊤ , (E.25) where Φ K is defined in ( 3.3 ). The follo wing lemma establishes the form of Θ K,b . Lemma E.2. The matrix Θ K,b in ( B.7 ) takes the form of Θ K,b = 2(Σ z ⊗ s Σ z )( I − L ⊗ s L ) ⊤ 0 0 Σ z ( I − L ) ⊤ ! . Pr o of. See § F.9 for a detailed pro of . Note that since ρ ( L ) < 1, b oth I − L ⊗ s L and I − L are p ositiv e definite. Therefore, b y Lemma E.2 , the matrix Θ K,b is in v ertible. This finishes the pro of of the inv ertibilit y of Θ K,b . Moreo ve r, b y ( E.25 ) a nd Lemma E.2 , w e upp er b ound the sp ectral norm of Θ K,b as k Θ K,b k 2 ≤ 2 max n k Σ z k 2 2 ·  1 + k L k 2 2  , k Σ z k 2 ·  1 + k L k 2  o ≤ 4  1 + k K k 2 F  2 · k Φ K k 2 2 , whic h prov es the upp er b ound o f k Θ K,b k 2 . Minim um singular v alue. T o lo w er b ound σ min ( e Θ K,b ), we only need to upp er b ound σ max ( e Θ − 1 K,b ) = k e Θ − 1 K,b k 2 . W e first calculate e Θ − 1 K,b . Recall that the matrix e Θ K,b in ( B.8 ) tak es the form of e Θ K,b = 1 0 E π K,b  ψ ( x, u )  Θ K,b ! . 60 By the definition o f the feature vec tor ψ ( x, u ) in ( B.5 ), the v ector e σ z = E π K,b [ ψ ( x, u )] tak es the form of e σ z = E π K,b  ψ ( x, u )  = sv ec(Σ z ) 0 k + m ! , where 0 k + m denotes the all-zero column v ector with dimension k + m . Also, since Θ K,b is in ve r tible, the matrix e Θ K,b is also in v ertible, whose inv erse tak es the form of e Θ − 1 K,b = 1 0 − Θ − 1 K,b · e σ z Θ − 1 K,b ! . The follo wing lemma upp er b ounds the sp ectral norm of e Θ − 1 K,b . Lemma E.3. The sp ectral norm of the matrix e Θ − 1 K,b is upp er b ounded b y a p ositiv e constan t e λ K , where e λ K only dep ends on k K k 2 and ρ ( A − B K ). Pr o of. See § F.10 for a detailed pro o f. By Lemma E.3 , w e know that σ min ( e Θ K,b ) is lo wer b ounded b y a p ositiv e constant λ K = 1 / e λ K , whic h only dep ends o n k K k 2 and ρ ( A − B K ). This concludes the pro of of the prop osition. F Pro ofs o f Lemmas F.1 Pro of of Lemma D.1 Pr o of. F ollowing from ( 3.4 ), it holds t ha t y ⊤ P K 2 y = X t ≥ 0 y ⊤  ( A − B K 2 ) t  ⊤ ( Q + K ⊤ 2 RK 2 )( A − B K 2 ) t y . (F.1) Mean while, by the state transition y t +1 = ( A − B K 2 ) y t , w e kno w that y t = ( A − B K 2 ) t y 0 = ( A − B K 2 ) t y . (F.2) By plugging ( F.2 ) in to ( F.1 ), it holds that y ⊤ P K 2 y = X t ≥ 0 y ⊤ t ( Q + K ⊤ 2 RK 2 ) y t = X t ≥ 0 ( y ⊤ t Qy t + y ⊤ t K ⊤ 2 RK 2 y t ) . (F.3) 61 Also, it ho lds t hat y ⊤ P K 1 y = X t ≥ 0 ( y ⊤ t +1 P K 1 y t +1 − y ⊤ t P K 1 y t ) (F.4) Com bining ( F.3 ) and ( F.4 ), w e ha ve y ⊤ P K 2 y − y ⊤ P K 1 y = X t ≥ 0 ( y ⊤ t Qy t + y ⊤ t K ⊤ 2 RK 2 y t + y ⊤ t +1 P K 1 y t +1 − y ⊤ t P K 1 y t ) . (F.5) Also, b y the state transition y t +1 = ( A − B K 2 ) y t , it holds for an y t ≥ 0 that y ⊤ t Qy t + y ⊤ t K ⊤ 2 RK 2 y t + y ⊤ t +1 P K 1 y t +1 − y ⊤ t P K 1 y t = y ⊤ t  Q + ( K 2 − K 1 + K 1 ) ⊤ R ( K 2 − K 1 + K 1 )  y t + y ⊤ t  A − B K 1 − B ( K 2 − K 1 )  ⊤ P K 1  A − B K 1 − B ( K 2 − K 1 )  y t − y ⊤ t P K 1 y t = 2 y ⊤ t ( K 2 − K 1 ) ⊤  ( R + B ⊤ P K 1 B ) K 1 − B ⊤ P K 1 A  y t + y ⊤ t ( K 2 − K 1 ) ⊤ ( R + B ⊤ P K 1 B )( K 2 − K 1 ) y t = 2 y ⊤ t ( K 2 − K 1 ) ⊤ (Υ 22 K 1 K 1 − Υ 21 K 1 ) y t + y ⊤ t ( K 2 − K 1 ) ⊤ Υ 22 K 1 ( K 2 − K 1 ) y t , (F.6) where the matr ix Υ K 1 is defined in ( 3.7 ). By plugging ( F.6 ) into ( F .5 ), we hav e y ⊤ P K 2 y − y ⊤ P K 1 y = X t ≥ 0 2 y ⊤ t ( K 2 − K 1 ) ⊤ (Υ 22 K 1 K 1 − Υ 21 K 1 ) y t + y ⊤ t ( K 2 − K 1 ) ⊤ Υ 22 K 1 ( K 2 − K 1 ) y t = X t ≥ 0 D K 1 ,K 2 ( y t ) , where D K 1 ,K 2 ( y ) = 2 y ⊤ ( K 2 − K 1 )(Υ 22 K 1 K 1 − Υ 21 K 1 ) y + y ⊤ ( K 2 − K 1 ) ⊤ Υ 22 K 1 ( K 2 − K 1 ) y . W e finish the pro of of the lemma. F.2 Pro of of Lemma D.2 Pr o of. W e pro v e ( D.1 6 ) and ( D.17 ) separately in the sequel. Pro of of ( D.16 ) . F rom the definition of J 1 ( K ) in ( 3.6 ), w e hav e J 1 ( K ) − J 1 ( K ∗ ) = tr( P K Ψ ǫ − P K ∗ Ψ ǫ ) = E y ∼ N (0 , Ψ ǫ ) ( y ⊤ P K y − y ⊤ P K ∗ y ) = − E  X t ≥ 0 D K,K ∗ ( y t )  , (F.7) 62 where in the la st equality , w e apply Lemma D .1 and the exp ectatio n is tak en follo wing the transition y t +1 = ( A − B K ∗ ) y t with initial state y 0 ∼ N (0 , Ψ ǫ ). Here we denote b y D K,K ∗ ( y ) as D K,K ∗ ( y ) = 2 y ⊤ ( K ∗ − K )(Υ 22 K K − Υ 21 K ) y + y ⊤ ( K ∗ − K ) ⊤ Υ 22 K ( K ∗ − K ) y . Also, w e write D K,K ∗ ( y ) as D K,K ∗ ( y ) = 2 y ⊤ ( K ∗ − K ) ( Υ 22 K K − Υ 21 K ) y + y ⊤ ( K ∗ − K ) ⊤ Υ 22 K ( K ∗ − K ) y (F.8) = y ⊤  K ∗ − K + (Υ 22 K ) − 1 (Υ 22 K K − Υ 21 K )  ⊤ Υ 22 K  K ∗ − K + (Υ 22 K ) − 1 (Υ 22 K K − Υ 21 K )  y − y ⊤ (Υ 22 K K − Υ 21 K ) ⊤ (Υ 22 K ) − 1 (Υ 22 K K − Υ 21 K ) y . Note that the first term on the RHS o f ( F.8 ) is p ositiv e, due to the fact that it is a quadratic form o f a p ositiv e definite matrix, w e lo w er b ound D K,K ∗ ( y ) as D K,K ∗ ( y ) ≥ − y ⊤ (Υ 22 K K − Υ 21 K ) ⊤ (Υ 22 K ) − 1 (Υ 22 K K − Υ 21 K ) y . (F.9) Com bining ( F.7 ) and ( F.9 ), it holds that J 1 ( K ) − J 1 ( K ∗ ) ≤     E  X t ≥ 0 y t y ⊤ t      2 · tr  (Υ 22 K K − Υ 21 K ) ⊤ (Υ 22 K ) − 1 (Υ 22 K K − Υ 21 K )  = k Φ K ∗ k 2 · tr  (Υ 22 K K − Υ 21 K ) ⊤ (Υ 22 K ) − 1 (Υ 22 K K − Υ 21 K )  ≤   (Υ 22 K ) − 1   2 · k Φ K ∗ k 2 · tr  (Υ 22 K K − Υ 21 K ) ⊤ (Υ 22 K K − Υ 21 K )  ≤ σ − 1 min ( R ) · k Φ K ∗ k 2 · tr  (Υ 22 K K − Υ 21 K ) ⊤ (Υ 22 K K − Υ 21 K )  , where the last line comes from the fact tha t Υ 22 K = R + B ⊤ P K B  R . This complete the pro of of ( D.16 ). Pro of of ( D.17 ) . Note that for any e K , it holds b y the optimality of K ∗ that J 1 ( K ) − J 1 ( K ∗ ) ≥ J 1 ( K ) − J 1 ( e K ) = − E  X t ≥ 0 D K, e K ( y t )  , (F.10) where the exp ectation is taken follo wing the transition y t +1 = ( A − B e K ) y t with initial stat e y 0 ∼ N ( 0 , Ψ ǫ ). By taking e K = K − (Υ 22 K ) − 1 (Υ 22 K K − Υ 21 K ) a nd follow ing from a similar calculation as in ( F.8 ), the function D K, e K ( y ) take s the for m of D K, e K ( y ) = − y ⊤ (Υ 22 K K − Υ 21 K ) ⊤ (Υ 22 K ) − 1 (Υ 22 K K − Υ 21 K ) y . (F.11) 63 Com bining ( F.10 ) and ( F .11 ), it holds that J ( K ) − J ( K ∗ ) ≥ tr  Φ e K (Υ 22 K K − Υ 21 K ) ⊤ (Υ 22 K ) − 1 (Υ 22 K K − Υ 21 K )  ≥ σ min (Ψ ǫ ) · k Υ 22 K k − 1 2 · tr  (Υ 22 K K − Υ 21 K ) ⊤ (Υ 22 K K − Υ 21 K )  , where w e use the fact tha t Φ e K = ( A − B e K )Φ e K ( A − B e K ) ⊤ + Ψ ǫ  Ψ ǫ in the la st line. This finishes the pro of of ( D.17 ). F.3 Pro of of Lemma D.3 Pr o of. By Prop osition B.2 , w e hav e   J 1 ( e K n +1 ) − J 1 ( K n +1 )   = tr  ( P e K n +1 − P K n +1 )Ψ ǫ  ≤ k P e K n +1 − P K n +1 k 2 · k Ψ ǫ k F . (F.12) The follo wing lemma upp er b ounds the term k P e K n +1 − P K n +1 k 2 . Lemma F.1. Supp o se tha t the parameters K and e K satisfy that k e K − K k 2 ·  k A − B K k 2 + 1  · k Φ K k 2 ≤ σ min (Ψ ω ) / 4 · k B k − 1 2 , ( F .13) then it ho lds that k P e K − P K k 2 ≤ 6 · σ − 1 min (Ψ ω ) · k Φ K k 2 · k K k 2 · k R k 2 · k e K − K k 2 (F.14) ·  k B k 2 · k K k 2 ) · k A − B K k 2 + k B k 2 · k K k 2 + 1  . Pr o of. See Lemma 5.7 in Y ang et al. ( 20 19 ) for a detailed pro of. T o use Lemma F.1 , it suffices to v erify that e K n +1 and K n +1 satisfy ( F.13 ). Note that from t he definitions of K n +1 and e K n +1 in ( D.18 ) and ( D .21 ), respectiv ely , w e ha v e k e K n +1 − K n +1 k 2 ·  k A − B e K n +1 k 2 + 1  · k Φ e K n +1 k 2 ≤ γ · k b Υ K n − Υ K n k F ·  1 + k K n k 2  ·  k A − B e K n +1 k 2 + 1  · k Φ e K n +1 k 2 . (F.15) No w, we upp er b ound the RHS of ( F.15 ). F o r the term k A − B e K n +1 k 2 , it holds b y the definition of e K n +1 in ( D.21 ) that k A − B e K n +1 k 2 ≤ k A − B K n k 2 + γ · k B k 2 · k Υ 22 K n K n − Υ 21 K n k 2 ≤ k A − B K n k 2 + γ · k B k 2 · k Υ K n k 2 ·  1 + k K n k 2  . (F.16) 64 By the definition o f Υ K n in ( 3.7 ), w e upp er b ound k Υ K n k 2 as k Υ K n k 2 ≤ k Q k 2 + k R k 2 +  k A k F + k B k F  2 · k P K n k 2 ≤ k Q k 2 + k R k 2 +  k A k F + k B k F  2 · J 1 ( K 0 ) · σ − 1 min (Ψ ǫ ) , (F.17) where the la st line comes from the fact that J 1 ( K 0 ) ≥ J 1 ( K n ) = tr  ( Q + K ⊤ n RK n )Φ K n  = tr( P K n Ψ ǫ ) ≥ k P K n k 2 · σ min (Ψ ǫ ) . As f o r the term k Φ e K n +1 k 2 in ( F.15 ), from the fact that J 1 ( K 0 ) ≥ J 1 ( e K n +1 ) = tr  ( Q + e K ⊤ n +1 R e K n +1 )Φ e K n +1  ≥ k Φ e K n +1 k 2 · σ min ( Q ) , it holds t ha t k Φ e K n +1 k 2 ≤ J 1 ( K 0 ) · σ − 1 min ( Q ) . (F.18) Therefore, com bining ( F.15 ), ( F.16 ), ( F.17 ), and ( F.18 ), w e know that k e K n +1 − K n +1 k 2 ·  k A − B e K n +1 k 2 + 1  · k Φ e K n +1 k 2 ≤ p oly 1  k K n k 2  · k b Υ K n − Υ K n k F . (F.19) F rom Theorem B.8 , it holds with probabilit y at least 1 − T − 4 n − e T − 6 n that k b Υ K n − Υ K n k F ≤ p oly 3  k K n k F , k µ k 2  λ K n · (1 − ρ ) 2 · log 3 T n T 1 / 4 n (F.20) + p oly 4  k K n k F , k b 0 k 2 , k µ k 2  λ K n · log 1 / 2 e T n e T 1 / 8 n · (1 − ρ ) , whic h holds for any ρ ∈ ( ρ ( A − B K n ) , 1). Note that fro m the c ho ice of T n and e T n in the statemen t of Theorem B.4 that T n ≥ p oly 5  k K n k F , k b 0 k 2 , k µ k 2  · λ − 4 K n ·  1 − ρ ( A − B K n )  − 9 · ε − 5 , e T n ≥ p oly 6  k K n k F , k b 0 k 2 , k µ k 2  · λ − 2 K n ·  1 − ρ ( A − B K n )  − 12 · ε − 12 , it holds t ha t p oly 3  k K n k F , k µ k 2  λ K n · (1 − ρ ) 2 · log 3 T n T 1 / 4 n + p oly 4  k K n k F , k b 0 k 2 , k µ k 2  λ K n · log 1 / 2 e T n e T 1 / 8 n · (1 − ρ ) ≤ min  h p oly 1  k K n k 2  i − 1 · σ min (Ψ ω ) / 4 · k B k − 1 2 , (F.21) h p oly 2  k K n k 2  i − 1 · ε/ 8 · γ · σ min (Ψ ǫ ) · σ min ( R ) · k Φ K ∗ k − 1 2 · k Ψ ǫ k − 1 F  . 65 Com bining ( F.19 ), ( F.20 ), and ( F.21 ), w e know t ha t ( F .13 ) holds with probability at least 1 − ε 15 for sufficien tly small ε > 0. Meanwh ile, by ( F.16 ), ( F.17 ), and ( F.1 8 ), the R HS of ( F.14 ) is upp er b ounded as 6 · σ − 1 min (Ψ ω ) · k Φ e K n +1 k 2 · k e K n +1 k 2 · k R k 2 · k e K n +1 − K n +1 k 2 ·  k B k 2 · k e K n +1 k 2 ) · k A − B e K n +1 k 2 + k B k 2 · k e K n +1 k 2 + 1  ≤ p oly 2  k K n k 2  · k b Υ K n − Υ K n k F . (F.22) No w, b y Lemma F.1 , it ho lds with probability at least 1 − ε 15 that k P e K n +1 − P K n +1 k 2 ≤ 6 · σ − 1 min (Ψ ω ) · k Φ e K n +1 k 2 · k e K n +1 k 2 · k R k 2 · k e K n +1 − K n +1 k 2 ·  k B k 2 · k e K n +1 k 2 ) · k A − B e K n +1 k 2 + k B k 2 · k e K n +1 k 2 + 1  ≤ p oly 2  k K n k 2  · k b Υ K n − Υ K n k F ≤ ε/ 8 · γ · σ min (Ψ ǫ ) · σ min ( R ) · k Φ K ∗ k − 1 2 · k Ψ ǫ k − 1 F , ( F .23) where the second inequalit y comes f rom ( F.22 ), and the last inequalit y comes from ( F.20 ) and ( F.21 ). Combining ( F.12 ) and ( F.23 ), it holds with probability at least 1 − ε 15 that   J 1 ( e K n +1 ) − J 1 ( K n +1 )   ≤ γ · σ min (Ψ ǫ ) · σ min ( R ) · k Φ K ∗ k − 1 2 · ε/ 4 , whic h concludes the pro o f of the lemma. F.4 Pro of of Lemma D.4 Pr o of. Note tha t Υ 22 K ∗ K ∗ − Υ 21 K ∗ is the natural g radien t of J 1 at the minimizer K ∗ , which implies that Υ 22 K ∗ K ∗ − Υ 21 K ∗ = 0 . (F.24) By Lemma D.1 , it holds that J 1 ( K ) − J 1 ( K ∗ ) = tr( P K Ψ ǫ − P K ∗ Ψ ǫ ) = E y ∼ N (0 , Ψ ǫ ) ( y ⊤ P K y − y ⊤ P K ∗ y ) = E  X t ≥ 0 h 2 y ⊤ t ( K − K ∗ )(Υ 22 K ∗ K ∗ − Υ 21 K ∗ ) y t + y ⊤ t ( K − K ∗ ) ⊤ Υ 22 K ∗ ( K − K ∗ ) y t i  = E  X t ≥ 0 y ⊤ t ( K − K ∗ ) ⊤ Υ 21 K ∗ ( K − K ∗ ) y t  , (F.25) 66 where w e use ( F .2 4 ) in the last line. Here the exp ectations are tak en follo wing the tra nsition y t +1 = ( A − B K ) y t with initial state y 0 ∼ N (0 , Ψ ǫ ). Also, w e ha ve E  X t ≥ 0 y ⊤ t ( K − K ∗ ) ⊤ Υ 22 K ∗ ( K − K ∗ ) y t  = tr  Φ K ( K − K ∗ ) ⊤ Υ 22 K ∗ ( K − K ∗ )  ≥ k Φ K k 2 · k Υ 22 K ∗ k 2 · tr  ( K − K ∗ ) ⊤ ( K − K ∗ )  ≥ σ min (Ψ ǫ ) · σ min ( R ) · k K − K ∗ k 2 F , (F.26) where w e use the fact that Φ K = ( A − B K )Φ K ( A − B K )+ Ψ ǫ  Ψ ǫ and Υ 22 K ∗ = R + B ⊤ P K ∗ B  R in the last line. Com bining ( F.25 ) and ( F.2 6 ), we ha v e J 1 ( K ) − J 1 ( K ∗ ) ≥ σ min (Ψ ǫ ) · σ min ( R ) · k K − K ∗ k 2 F . W e conclude t he pro of of the lemma. F.5 Pro of of Lemma D.5 Pr o of. F ollowing from Prop osition 3.3 , w e hav e J 2 ( K N , b h +1 ) − J 2 ( K N , e b h +1 ) ≤ γ b · ∇ b J 2 ( K N , e b h +1 ) ⊤  ∇ b J 2 ( K N , b h ) − b ∇ b J 2 ( K N , b h )  + ( γ b ) 2 · ν K N / 2 ·   ∇ b J 2 ( K N , b h ) − b ∇ b J 2 ( K N , b h )   2 2 , J 2 ( K N , e b h +1 ) − J 2 ( K N , b h +1 ) ≤ − γ b · ∇ b J 2 ( K N , e b h +1 ) ⊤  ∇ b J 2 ( K N , b h ) − b ∇ b J 2 ( K N , b h )  − ( γ b ) 2 · ι K N / 2 ·   ∇ b J 2 ( K N , b h ) − b ∇ b J 2 ( K N , b h )   2 2 , (F .27) where ν K N and ι K N are defined in Prop osition 3.3 . Also, follo wing fro m Prop osition B.3 , it holds that   ∇ b J 2 ( K N , e b h +1 )   2 ≤ p o ly 1  k K N k F , k b h k 2 , k µ k 2 , J ( K N , b 0 )  ·  1 − ρ ( A − B K N )  − 1 . (F.28) Com bining ( F.27 ), ( F.28 ), a nd the fact that ν K N ≤ ι K N ≤ [1 − ρ ( A − B K N )] − 2 · p oly 2 ( k K N k 2 ), w e know that   J 2 ( K N , b h +1 ) − J 2 ( K N , e b h +1 )   (F.29) ≤ ( γ b ) 2 · p oly 2  k K N k 2  ·   ∇ b J 2 ( K N , b h ) − b ∇ b J 2 ( K N , b h )   2 2 ·  1 − ρ ( A − B K N )  − 2 + γ b · p oly 1  k K N k F , k b h k 2 , k µ k 2 , J ( K N , b 0 )  ·   ∇ b J 2 ( K N , b h ) − b ∇ b J 2 ( K N , b h )   2 ·  1 − ρ ( A − B K N )  − 1 . 67 Note that from the definition of b ∇ b J 2 ( K N , b h ) and ∇ b J 2 ( K N , b h ) in ( D .31 ) and ( D .33 ), re- sp ectiv ely , it holds by triangle inequalit y that   ∇ b J 2 ( K N , b h ) − b ∇ b J 2 ( K N , b h )   2 ≤ k b Υ 22 K N − Υ 22 K N k 2 · k K N k 2 · k b µ K N ,b h k 2 + k Υ 22 K N k 2 · k K N k 2 · k b µ K N ,b h − µ K N ,b h k 2 + k b Υ 22 K N − Υ 22 K N k 2 · k b h k 2 + k b Υ 21 K N − Υ 21 K N k 2 · k b µ K N ,b h k 2 + k Υ 21 K N k 2 · k b µ K N ,b h − µ K N ,b h k 2 + k b q K N ,b h − q K N ,b h k 2 . By Theorem B.8 , combining the fact that J 2 ( K N , b h ) ≤ J 2 ( K N , b 0 ) and the fact that k µ K N ,b k 2 ≤ J ( K N , b 0 ) /σ min ( Q ), we know that with probabilit y at least 1 − ( T b n ) − 4 − ( e T b n ) − 6 , it holds for an y ρ ∈ ( ρ ( A − B K N ) , 1) that   ∇ b J 2 ( K N , b h ) − b ∇ b J 2 ( K N , b h )   2 (F.30) ≤ λ − 1 K N · p oly 3  k K N k F , k b h k 2 , k µ k 2 , J 2 ( K N , b 0 )  ·  log 3 T b n ( T b n ) 1 / 4 (1 − ρ ) 2 + log 1 / 2 e T b n ( e T b n ) 1 / 8 · (1 − ρ )  . F ollow ing fr om the c hoices of γ b , T b n , and e T b n in the statemen t of Theorem B.4 , it holds that γ b · p o ly 1  k K N k F , k b h k 2 , k µ k 2 , J ( K N , b 0 )  · λ − 1 K N · p oly 3  k K N k F , k b h k 2 , k µ k 2 , J 2 ( K N , b 0 )  ·  log 3 T b n ( T b n ) 1 / 4 (1 − ρ ) 2 + log 1 / 2 e T b n ( e T b n ) 1 / 8 · (1 − ρ )  ·  1 − ρ ( A − B K N )  − 1 +  1 − ρ ( A − B K N )  − 2 · p oly 3  k K N k F , k b h k 2 , k µ k 2 , J 2 ( K N , b 0 )  ·  log 6 T b n ( T b n ) 1 / 2 (1 − ρ ) 4 + log e T b n ( e T b n ) 1 / 4 · (1 − ρ ) 2  · ( γ b ) 2 · p oly 2  k K N k 2  · λ − 1 K N ≤ ν K N · γ b · ε/ 2 . F urther com bining ( F.29 ) and ( F.3 0 ), it holds with probabilit y at least 1 − ε 15 that   J 2 ( K N , b h +1 ) − J 2 ( K N , e b h +1 )   ≤ ν K N · γ b · ε/ 2 . W e then finish the pro o f o f the lemma. F.6 Pro of of Lemma D.6 Pr o of. W e sho w that ζ K,b ∈ V ζ and ξ ( ζ ) ∈ V ξ for an y ζ ∈ V ζ separately . P art 1. First w e sho w that ζ K,b ∈ V ζ . Note that from Definition B.7 , w e know that ζ 1 K,b = J ( K , b ) satisfies that 0 ≤ ζ 1 K,b ≤ J ( K 0 , b 0 ). It remains to sho w that ζ 2 K,b = α K,b 68 satisfies that k ζ 2 K,b k 2 ≤ M ζ . By the definition of α K,b in ( B.6 ), w e kno w tha t k α K,b k 2 2 ≤ k Υ K k 2 F + k Υ K k 2 2 ·  k µ K,b k 2 2 + k µ u K,b k 2 2  +  k A k 2 + k B k 2  2 ·  k P K k 2 · k Aµ + d k 2 + k f K,b k 2  2 (F.31) where f K,b = ( I − A + B K ) −⊤ [( A − B K ) ⊤ P K ( B b + Aµ + d ) − K ⊤ Rb ] and for notatio na l simplicit y , w e denote b y µ u K,b = − K µ K,b + b . W e only need to b ound Υ K , µ K,b , µ u K,b , P K , and f K,b . Note that by Prop osition B.2 , the exp ected to tal cost J ( K , b ) tak es the form of J ( K , b ) = tr( P K Ψ ǫ ) + µ ⊤ K,b Qµ K,b + ( µ u K,b ) ⊤ Rµ u K,b + σ 2 · tr( R ) + µ ⊤ Qµ. Th us, w e ha v e J ( K 0 , b 0 ) ≥ J ( K , b ) ≥ σ min (Ψ ω ) · t r ( P K ) ≥ σ min (Ψ ω ) · k P K k 2 , J ( K 0 , b 0 ) ≥ J ( K , b ) ≥ µ ⊤ K,b Qµ K,b ≥ σ min ( Q ) · k µ K,b k 2 , J ( K 0 , b 0 ) ≥ J ( K , b ) ≥ ( µ u K,b ) ⊤ Rµ u K,b ≥ σ min ( R ) · k µ u K,b k 2 , whic h imply that k P K k 2 ≤ J ( K 0 , b 0 ) /σ min (Ψ ω ) , k µ K,b k 2 ≤ J ( K 0 , b 0 ) /σ min ( Q ) , k µ u K,b k 2 ≤ J ( K 0 , b 0 ) /σ min ( R ) . (F.32) F or Υ K , it holds that Υ K = Q 0 0 R ! + A ⊤ B ⊤ ! P K  A B  , whic h give s k Υ K k F ≤ ( k Q k F + k R k F ) +  k A k 2 F + k B k 2 F  · k P K k F , k Υ K k 2 ≤ ( k Q k 2 + k R k 2 ) +  k A k 2 + k B k 2  2 · k P K k 2 . Com bining ( F.32 ) and the fa ct that k P K k F ≤ √ m · k P K k 2 , we know that k Υ K k F ≤  k Q k F + k R k F  +  k A k 2 F + k B k 2 F  · √ m · J ( K 0 , b 0 ) /σ min (Ψ ω ) , k Υ K k 2 ≤  k Q k 2 + k R k 2  +  k A k 2 + k B k 2  2 · J ( K 0 , b 0 ) /σ min (Ψ ω ) . (F.33) 69 No w, we upp er b ound the v ector f K,b . Note that b y a lgebra, the v ector f K,b tak es the form of f K,b = − P K µ K,b + ( I − A + B K ) − T ( Qµ K,b − K ⊤ Rµ u K,b ) . Therefore, w e upp er b ound f K,b as k f K,b k 2 ≤ J ( K 0 , b 0 ) 2 · σ − 1 min (Ψ ω ) · σ − 1 min ( Q ) +  1 − ρ ( A − B K )  − 1 · ( κ Q + κ R · k K k F ) (F.34) Com bining ( F.31 ), ( F.32 ), ( F.33 ), and ( F.34 ), it holds that k ζ 2 K,b k 2 = k α K,b k 2 ≤ M ζ , 1 + M ζ , 2 · (1 + k K k F ) · [1 − ρ ( A − B K )] − 1 . Therefore, it holds tha t ζ K,b ∈ V ζ . P art 2. No w we sho w that for an y ζ ∈ V ζ , we ha v e ξ ( ζ ) ∈ V ξ . Recall that fro m ( D.41 ), it holds that ξ 1 ( ζ ) = ζ 1 − J ( K , b ) , ξ 2 ( ζ ) = E π K,b  ψ ( x, u )  ζ 1 + Θ K,b ζ 2 − E π K,b  c ( x, u ) ψ ( x, u )  . (F.3 5 ) Then w e ha ve   ξ 1 ( ζ )   =   ζ 1 − J ( K , b )   ≤ J ( K 0 , b 0 ) , (F.36) where we use the fact that since ζ ∈ V ζ , w e ha ve 0 ≤ ζ 1 ≤ J ( K 0 , b 0 ) by Definition B.7 . Also, b y ( F.35 ), we hav e   ξ 2 ( ζ )   2 ≤    E π K,b  ψ ( x, u )  ζ 1    2 | {z } B 1 + k Θ K,b k 2 · k ζ 2 k 2 | {z } B 2 +    E π K,b  c ( x, u ) ψ ( x, u )     2 | {z } B 3 . (F.37) Note that we upp er b ound B 1 as B 1 ≤ J ( K 0 , b 0 ) ·    E π K,b  ψ ( x, u )     2 . (F.38) F ollow ing from t he definition of ψ ( x, u ) in ( B.5 ), w e kno w tha t    E π K,b  ψ ( x, u )     2 ≤ k Σ z k F , (F.39) where Σ z is defined as Σ z = Co v " x u !# = Φ K − Φ K K ⊤ − K Φ K K Φ K K ⊤ + σ 2 I ! = 0 0 0 σ 2 I ! + I − K ! Φ K I − K ! ⊤ . 70 Com bining ( F.38 ) and ( F .39 ), we hav e B 1 ≤ J ( K 0 , b 0 ) · k Σ z k F . (F.40) By Prop o sition B.6 , w e upp er b ound B 2 as B 2 ≤ 4(1 + k K k 2 F ) 3 · k Φ K k 2 2 · ( M ζ , 1 + M ζ , 2 ) ·  1 − ρ ( A − B K )  − 1 , (F.41) where we use the fa ct that ζ ∈ V ζ and Definition B.7 . As for the term B 3 in ( F .37 ), we utilize the follo wing lemma to pro vide an upp er b ound. Lemma F.2. The v ector E π K,b [ c ( x, u ) ψ ( x, u )] tak es the follo wing form, E π K,b  c ( x, u ) ψ ( x, u )  =    2sv ec  Σ z diag( Q, R )Σ z + h Σ z , diag( Q, R ) i Σ z  Σ z 2 Qµ K,b 2 Rµ u K,b !    +  µ ⊤ K,b Qµ K,b + ( µ u K,b ) ⊤ Rµ u K,b + µ ⊤ Qµ     sv ec(Σ z ) 0 m 0 k    . Here the matrix Σ z tak es the form of Σ z = Φ K − Φ K K ⊤ − K Φ K K Φ K K ⊤ + σ 2 · I ! . Pr o of. See § F.11 for a detailed pro o f. F rom Lemma F.2 a nd ( F.3 2 ), it holds that B 3 ≤ 3  k Q k F + k R k F + J ( K 0 , b 0 ) · k Q k 2 /σ min ( Q ) (F.42) + J ( K 0 , b 0 ) · k R k 2 /σ min ( R )  · k Σ z k 2 2 . Moreo ve r, by the definition of Σ z in ( E.25 ), com bining the triangle inequalit y , w e hav e the follo wing b ounds for k Σ z k F and k Σ z k 2 , k Σ z k F ≤ 2( d + k K k 2 F ) · k Φ K k 2 , k Σ z k 2 ≤ 2(1 + k K k 2 F ) · k Φ K k 2 . (F.43) Also, w e ha v e J ( K 0 , b 0 ) ≥ J ( K , b ) ≥ tr  ( Q + K ⊤ RK )Φ K  ≥ k Φ K k 2 · σ min ( Q ) , 71 whic h give s the upp er b ound for Φ K as follo ws, k Φ K k 2 ≤ J ( K 0 , b 0 ) /σ min ( Q ) . (F.44) Therefore, com bining ( F.37 ), ( F.40 ), ( F .41 ), ( F.4 2 ), ( F .43 ), and ( F.44 ), w e kno w that   ξ 2 ( ζ )   2 ≤ C · ( M ζ , 1 + M ζ , 2 ) · J ( K 0 , b 0 ) 2 /σ 2 min ( Q ) (F.45) ·  1 + k K k 2 F  3 ·  1 − ρ ( A − B K )  − 1 . By ( F.36 ) and ( F.45 ), we kno w that ξ ( ζ ) ∈ V ξ for any ζ ∈ V ζ . W e conclude the pro of of the lemma. F.7 Pro of of Lemma D.7 Pr o of. Assume that e z 0 ∼ N ( µ † , Σ † ). F ollo wing from t he fact that e z t +1 = L e z t + ν + δ t , it holds t ha t e z t ∼ N L t µ † + t − 1 X i =0 L i · ν, ( L ⊤ ) t Σ † L t + t − 1 X i =0 ( L ⊤ ) i Ψ δ L i ! , ( F .46) where Ψ δ = Ψ ω K Ψ ω K Ψ ω K Ψ ω K ⊤ + σ 2 I ! . F rom ( D.47 ), w e kno w that µ z tak es the form of µ z = ( I − L ) − 1 ν = ∞ X j =0 L j ν. (F.47) Therefore, com bining ( F.46 ) and ( F.4 7 ), we hav e E ( b µ z ) = µ z + 1 e T e T X t =1 L t µ † − 1 e T e T X t =1 ∞ X i = t L i ν. (F.48) W e denote by µ e T = e T X t =1 L t µ † − e T X t =1 ∞ X i = t L i ν. 72 Mean while, it holds that     e T X t =1 L t µ † − e T X t =1 ∞ X i = t L i ν     2 ≤ e T X t =1 ρ ( L ) t · k µ † k 2 + e T X t =1 ∞ X i = t ρ ( L ) i · k ν k 2 ≤  1 − ρ ( L )  − 1 · k µ † k 2 +  1 − ρ ( L )  − 2 · k ν k 2 ≤ M µ · (1 − ρ ) − 2 · k µ z k 2 , (F.49) where M µ is a p ositiv e absolute constan t. F or t he cov ariance, note that for a n y random v ariables X ∼ N ( µ 1 , Σ 1 ) a nd Y ∼ N ( µ 2 , Σ 2 ), w e know that Z = X + Y ∼ N ( µ 1 + µ 2 , Σ), where k Σ k F ≤ 2 k Σ 1 k F + 2 k Σ 2 k F . Com bining ( F.46 ), w e kno w that b µ z ∼ N ( E b µ z , e Σ e T / e T ), where e Σ e T satisfies that e T / 2 · k e Σ e T k F ≤ e T X t =1 ρ ( L ) 2 t · k Σ † k F + e T X t =1 t − 1 X i =0 ρ ( L ) 2 i · k Ψ δ k F ≤  1 − ρ ( L ) 2  − 1 · k Σ † k F + e T ·  1 − ρ ( L ) 2  − 1 · k Ψ δ k F , whic h implies that k e Σ e T k F ≤ M Σ · (1 − ρ ) − 1 · k Σ z k F , (F.50) where M Σ is a p ositiv e absolute constant. Combining ( F.48 ), ( F.49 ), and ( F.50 ), w e obtain that b µ z ∼ N  µ z + 1 e T µ e T , 1 e T e Σ e T  , where k µ e T k 2 ≤ M µ · (1 − ρ ) − 2 · k µ z k 2 and k e Σ e T k F ≤ M Σ · (1 − ρ ) − 1 · k Σ z k F . Moreov er, by the Gaussian tail inequality , it holds with probabilit y at least 1 − e T − 6 that k b µ z − µ z k 2 ≤ log e T e T 1 / 4 · (1 − ρ ) − 2 · p o ly  k Φ K k 2 , k K k F , k b k 2 , k µ k 2  . Then w e finish the pro of of the lemma. F.8 Pro of of Lemma D.8 Pr o of. W e con tinue using t he notatio ns giv en in § D.3 . W e define b F ( ζ , ξ ) = n E ( b ψ ) ζ 1 + E  ( b ψ − b ψ ′ ) b ψ ⊤  ζ 2 − E ( c b ψ ) o ⊤ ξ 2 +  ζ 1 − E ( c )  · ξ 1 − 1 / 2 · k ξ k 2 2 , 73 where b ψ = b ψ ( x, u ) is the estimated feature v ector. Here the exp ectation is o nly tak en o ver the tra jectory g enerated b y the state transition and the p olicy π K,b , conditioning on the randomness induced when calculating the estimated feature v ectors. Th us, the function b F ( ζ , ξ ) is still random, where the randomness comes from the estimated feature vec tors. Note that | F ( ζ , ξ ) − e F ( ζ , ξ ) | ≤ | F ( ζ , ξ ) − b F ( ζ , ξ ) | + | b F ( ζ , ξ ) − e F ( ζ , ξ ) | . Th us, w e only need to upp er b ound | F ( ζ , ξ ) − b F ( ζ , ξ ) | a nd | b F ( ζ , ξ ) − e F ( ζ , ξ ) | . P art 1. F irst w e upp er b ound | F ( ζ , ξ ) − b F ( ζ , ξ ) | . Note that b y algebra, w e ha ve   F ( ζ , ξ ) − b F ( ζ , ξ )   =     n E ( ψ − b ψ ) ζ 1 + E  ( ψ − ψ ′ ) ψ ⊤ − ( b ψ − b ψ ′ ) b ψ ⊤  ζ 2 − E  c ( ψ − b ψ )  o ⊤ ξ 2     ≤ E  k ψ − b ψ k 2  · h | ζ 1 | + E  k ψ − ψ ′ k 2 + 2 k b ψ k 2  · k ζ 2 k 2 + E ( c ) i · k ξ 2 k 2 , (F.51) where the exp ectatio n is only taken ov er the tra jectory generated by the state transition and the p olicy π K,b . F rom Lemma D.7 , it holds that P  k b µ z − µ z + 1 / e T · µ e T k 2 ≤ C 1  ≥ 1 − e T − 6 . ( F .52) Therefore, com bining ( F.52 ), it holds with probability at least 1 − e T − 6 that E  k ψ − ψ ′ k 2 + 2 k b ψ k 2  ≤ p oly  k Φ K k 2 , k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )  , (F.53) where the exp ectatio n is conditioned on the randomness induced when calculating the esti- mated feature vec tors. Also, w e know that E ( c ) ≤ p oly  k Φ K k 2 , k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )  . (F.54) Therefore, com bining ( F.51 ), ( F.53 ), ( F.54 ), and Definition B.7 , it holds with proba bility at least 1 − e T − 6 that   F ( ζ , ξ ) − b F ( ζ , ξ )   ≤ E  k ψ − b ψ k 2  · p oly  k Φ K k 2 , k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )  . (F.55) F ollow ing from the definitions of ψ ( x, u ) in ( B.5 ) and b ψ ( x, u ) in ( B.14 ), w e upp er b ound k ψ ( x, u ) − b ψ ( x, u ) k 2 for an y x and u as k ψ ( x, u ) − b ψ ( x, u ) k 2 2 = k b µ z − µ z k 2 2 +   z ( b µ z − µ z ) ⊤ + ( b µ z − µ z ) z ⊤   2 F + k µ z µ ⊤ z − b µ z b µ ⊤ z k 2 F ≤ p oly  k Φ K k 2 , k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )  · k b µ z − µ z k 2 2 , (F.56) 74 where µ z is defined in ( D.47 ), b µ z is define d in ( D.48 ), and z = ( x ⊤ , u ⊤ ) ⊤ . Also, b y Lemma D.7 , w e kno w that k b µ z − µ z k 2 ≤ log e T e T 1 / 4 · (1 − ρ ) − 2 · p o ly  k Φ K k 2 , k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )  , (F.57) whic h holds with probability at least 1 − e T − 6 . Combining ( F.55 ), ( F.56 ), and ( F.5 7 ), it holds with probabilit y at least 1 − e T − 6 that   F ( ζ , ξ ) − b F ( ζ , ξ )   ≤ log e T e T 1 / 4 · (1 − ρ ) − 2 · p oly  k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )  . (F.58) P art 2. W e now upp er b ound | b F ( ζ , ξ ) − e F ( ζ , ξ ) | in the sequel. By definitions, we hav e   e F ( ζ , ξ ) − b F ( ζ , ξ )   =     n E ( e ψ − b ψ ) ζ 1 + E  ( e ψ − e ψ ′ ) e ψ ⊤ − ( b ψ − b ψ ′ ) b ψ ⊤  ζ 2 − E ( e c e ψ − b c b ψ ) o ⊤ ξ 2 + E ( b c − e c ) ξ 1     ≤     n E ( b ψ ) ζ 1 + E ( b ψ b ψ ⊤ ) ζ 2 − E ( b c b ψ ) o ⊤ ξ 2 + E ( b c ) ξ 1     · 1 E c (F.59) +     E ( b ψ ′ b ψ ⊤ ) ζ 2  ⊤ ξ 2    · 1 ( E ′ ∩E ) c , where w e define the eve n t E ′ as E ′ =  \ t ∈ [ T ] n   k z ′ t − µ z + 1 / e T · µ e T k 2 2 − tr( e Σ z )   ≤ C 1 · log T · k e Σ z k 2 o  \ E 2 , where E 2 is defined in ( D.55 ). Combining the fact that P ( E 2 ) ≥ 1 − e T − 6 and Lemma G.3 , it holds that P ( E ′ ) ≥ 1 − T − 5 − e T − 6 . F ollowing a similar argumen t as in P art 1 , it holds from ( F.59 ) that   e F ( ζ , ξ ) − b F ( ζ , ξ )   ≤  1 T + 1 e T 1 / 4  · p oly  k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )  (F.60) for sufficie n tly large T and e T . No w, com bining ( F.58 ) a nd ( F.60 ), b y triangle inequality , it holds with probability at least 1 − e T − 6 that   F ( ζ , ξ ) − e F ( ζ , ξ )   ≤  1 2 T + log e T e T 1 / 4  · (1 − ρ ) − 2 · p oly  k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )  . W e finish the pro of of the lemma. 75 F.9 Pro of of Lemma E.2 Pr o of. Recall that the feature v ector ψ ( x, u ) tak es the following form ψ ( x, u ) = sv ec  ( z − µ z )( z − µ z ) ⊤  z − µ z ! . W e then hav e ψ ( x, u ) − ψ ( x ′ , u ′ ) = sv ec  y y ⊤ − ( Ly + δ )( Ly + δ ) ⊤  y − ( Ly + δ ) ! , (F.61) where w e denote b y y = z − µ z , and ( x ′ , u ′ ) is the stat e- action pair after ( x, u ) following the state transition and the p olicy π K,b . Therefore, for any symmetric matrices M , N and an y v ectors m , n , it holds f r om ( B.7 ) a nd ( F.61 ) that sv ec( M ) m ! ⊤ Θ K,b sv ec( N ) n ! = E y , δ ( sv ec( M ) m ! ⊤ sv ec( y y ⊤ ) y ! sv ec  y y ⊤ − ( Ly + δ )( Ly + δ ) ⊤  y − ( Ly + δ ) ! ⊤ sv ec( N ) n !) = E y , δ n  h M , y y ⊤ i + m ⊤ y  ·  h N , y y ⊤ − ( Ly + δ )( Ly + δ ) ⊤ i + n ⊤ ( y − Ly − δ )  o = E y  h y y ⊤ , M i · h y y ⊤ − Ly y ⊤ L ⊤ − Ψ δ , N i  | {z } A 1 + E y  h y y ⊤ , M i · n ⊤ ( y − Ly )  | {z } A 2 (F.62) + E y  m ⊤ y · h y y ⊤ − Ly y ⊤ L ⊤ − Ψ δ , N i  | {z } A 3 + E y  m ⊤ y · n ⊤ ( y − Ly )  | {z } A 4 , where t he exp ectations are tak en ov er y ∼ N (0 , Σ z ) and δ ∼ N (0 , Ψ δ ). W e ev aluate t he terms A 1 , A 2 , A 3 , and A 4 in the sequel. F or the terms A 2 and A 3 in ( F.62 ), by the f a ct that y = z − µ z ∼ N (0 , Σ z ), w e know that t hese tw o terms v anish. F or A 4 , it holds that A 4 = E y  m ⊤ y · ( y − Ly ) ⊤ n  = E y  m ⊤ y y ⊤ ( I − L ) ⊤ n  = m ⊤ Σ z ( I − L ) ⊤ n. (F.63) F or A 1 , b y algebra, w e hav e A 1 = E y  h y y ⊤ , M i · h y y ⊤ − Ly y ⊤ L ⊤ − Ψ δ , N i  = E y  h y y ⊤ , M i · h y y ⊤ − Ly y ⊤ L ⊤ , N i  − E y  h y y ⊤ , M i · h Ψ δ , N i  = E y  y ⊤ M y · y ⊤ ( N − L ⊤ N L ) y  − h Σ z , M i · h Ψ δ , N i = E u ∼N (0 ,I )  u ⊤ Σ 1 / 2 z M Σ 1 / 2 z u · u ⊤ Σ 1 / 2 z ( N − L ⊤ N L )Σ 1 / 2 z u  − h Σ z , M i · h Ψ δ , N i . (F.64) 76 No w, b y applying Lemma G .1 to the first term on the RHS of ( F .6 4 ), w e kno w that A 1 = 2 tr  Σ 1 / 2 z M Σ 1 / 2 z · Σ 1 / 2 z ( N − L ⊤ N L )Σ 1 / 2 z  + t r(Σ 1 / 2 z M Σ 1 / 2 z ) · t r  Σ 1 / 2 z ( N − L ⊤ N L )Σ 1 / 2 z  − h Σ z , M i · h Ψ δ , N i = 2  M , Σ z ( N − L ⊤ N L )Σ z i + h Σ z , M  · h Σ z − L Σ z L ⊤ − Ψ δ , N i = 2  M , Σ z ( N − L ⊤ N L )Σ z  , where w e use the fact tha t Σ z = L Σ z L ⊤ + Ψ δ in the last equalit y . By using the prop erty of the op erator sv ec( · ) and the definition of the symmetric Kro neck er pro duct, w e obtain that A 1 = 2sv ec ( M ) ⊤ sv ec  Σ z ( N − L ⊤ N L )Σ z  = 2sv ec ( M ) ⊤  Σ z ⊗ s Σ z − (Σ z L ⊤ ) ⊗ s (Σ z L ⊤ )  sv ec( N ) = 2sv ec ( M ) ⊤  (Σ z ⊗ s Σ z )( I − L ⊗ s L ) ⊤  sv ec( N ) . (F.65) Com bining ( F.62 ), ( F.63 ), and ( F.65 ), w e obtain that sv ec( M ) m ! ⊤ Θ K,b sv ec( N ) n ! = sv ec( M ) ⊤  2(Σ z ⊗ s Σ z )( I − L ⊗ s L ) ⊤  sv ec( N ) + m ⊤ Σ z ( I − L ) ⊤ n = sv ec( M ) m ! ⊤ 2(Σ z ⊗ s Σ z )( I − L ⊗ s L ) ⊤ 0 0 Σ z ( I − L ) ⊤ ! sv ec( N ) n ! . Th us, the matrix Θ K,b tak es the following form, Θ K,b = 2(Σ z ⊗ s Σ z )( I − L ⊗ s L ) ⊤ 0 0 Σ z ( I − L ) ⊤ ! , whic h concludes the pro o f of the lemma. F.10 Pro of of Lemma E.3 Pr o of. F rom the definition of e Θ K,b in ( B.9 ), it holds that k e Θ − 1 K,b k 2 2 ≤ 1 + k Θ − 1 K,b k 2 2 + k Θ − 1 K,b e σ z k 2 2 , (F.66) where e σ z is defined as e σ z = E π K,b  ψ ( x, u )  = sv ec(Σ z ) 0 k + m ! . 77 W e b ound the R HS of ( F.66 ) in the sequel. F or the term Θ − 1 K,b e σ z , com bining Lemma E.2 , w e ha ve Θ − 1 K,b e σ z = 1 / 2 · ( I − L ⊗ s L ) −⊤ (Σ z ⊗ s Σ z ) − 1 · sv ec (Σ z ) 0 k + m ! = 1 / 2 · ( I − L ⊗ s L ) −⊤ (Σ − 1 z ⊗ s Σ − 1 z ) · sve c(Σ z ) 0 k + m ! = 1 / 2 · ( I − L ⊗ s L ) −⊤ · sv ec(Σ − 1 z ) 0 k + m ! , (F.67) where w e use the prop erty of the symmetric Kro nec k er pro duct in the second and last line. By taking t he sp ectral norm on b oth sides of ( F.67 ), it holds t ha t k Θ − 1 K,b e σ z k 2 = 1 / 2 ·   ( I − L ⊗ s L ) −⊤ · sv ec(Σ − 1 z )   2 ≤ 1 / 2 ·   ( I − L ⊗ s L ) −⊤   2 ·   sv ec(Σ − 1 z )   2 ≤ 1 / 2 ·  1 − ρ 2 ( L )  − 1 · k Σ − 1 z k F ≤ 1 / 2 · √ k + m ·  1 − ρ 2 ( L )  − 1 · k Σ − 1 z k 2 = 1 / 2 · √ k + m ·  1 − ρ 2 ( L )  − 1 · σ − 1 min (Σ z ) , (F.68) where in the third line we use Lemma G.2 to the matrix L ⊗ s L . Similarly , w e upp er b ound k Θ − 1 K,b k 2 in the sequel k Θ − 1 K,b k 2 ≤ min n 1 / 2 ·  1 − ρ 2 ( L )  − 1 σ − 2 min (Σ z ) ,  1 − ρ ( L )  − 1 σ − 1 min (Σ z ) o . (F.69) Th us, com bining ( F.66 ), ( F.68 ), and ( F.69 ), w e obtain t hat k e Θ − 1 K,b k 2 2 ≤ 1 + 1 / 2 · √ k + m ·  1 − ρ 2 ( L )  − 1 · σ − 1 min (Σ z ) + min n 1 / 2 ·  1 − ρ 2 ( L )  − 1 σ − 2 min (Σ z ) ,  1 − ρ ( L )  − 1 σ − 1 min (Σ z ) o . (F .70) No w it remains to c haracterize σ min (Σ z ). F or any vec tors s ∈ R m and r ∈ R k , w e ha ve s r ! ⊤ Σ z s r ! = E x ∼N ( µ K,b , Φ K ) ,u ∼ π K,b ( · | x ) n  s ⊤ ( x − µ K,b ) + r ⊤ ( u + K µ K,b − b )  2 o = E x ∼N ( µ K,b , Φ K ) ,η ∼N (0 ,I ) n  ( s − K ⊤ r ) ⊤ ( x − µ K,b ) + σ r ⊤ η  2 o = E x ∼N ( µ K,b , Φ K ) n  ( s − K ⊤ r ) ⊤ ( x − µ K,b )  2 o + E η ∼N (0 ,I )  ( σ r ⊤ η ) 2  . (F.71) 78 The first t erm on the RHS of ( F.71 ) is lo w er b ounded as E x ∼N ( µ K,b , Φ K ) n  ( s − K ⊤ r ) ⊤ ( x − µ K,b )  2 o = ( s − K ⊤ r ) ⊤ Φ K ( s − K ⊤ r ) ≥ k s − K ⊤ r k 2 2 · σ min (Φ K ) ≥ k s − K ⊤ r k 2 2 · σ min (Ψ ω ) , (F.72) where the last inequality comes from the fact that σ min (Φ K ) ≥ σ min (Ψ ω ) b y ( 3.3 ). The second term o n the RHS of ( F.71 ) tak es the form of E η ∼N (0 ,I )  ( σ r ⊤ η ) 2  = σ 2 k r k 2 2 . ( F .73) Therefore, com bining ( F.71 ), ( F.72 ), and ( F.73 ), we hav e s r ! ⊤ Σ z s r ! ≥ k s − K ⊤ r k 2 2 · σ min (Ψ ω ) + σ 2 k r k 2 2 ≥ σ min (Ψ ω ) · k s k 2 2 +  σ 2 − k K k 2 2 · σ min (Ψ ω )  · k r k 2 2 . F rom this, we know that σ min (Σ z ) ≥ min  σ min (Ψ ω ) , σ 2 − k K k 2 2 · σ min (Ψ ω )  . (F.74) Th us, com bining ( F.70 ) and ( F.7 4 ), w e kno w that k e Θ − 1 K,b k 2 is upp er bounded by a constan t e λ K , where e λ K only dep ends on k K k 2 and ρ ( L ) = ρ ( A − B K ). This finishes the pro of of the lemma. F.11 Pro of of Lemma F.2 Pr o of. First, note that the cost function c ( x, u ) tak es the follow ing form, c ( x, u ) = ψ ( x, u ) ⊤    sv ec  diag( Q, R )  2 Qµ K,b 2 Rµ u K,b    +  µ ⊤ K,b Qµ K,b + ( µ u K,b ) ⊤ Rµ u K,b + µ ⊤ Qµ  . 79 F or an y matrix V and v ectors v x , v u , it holds that E π K,b  c ( x, u ) ψ ( x, u )  ⊤    sv ec( V ) v x v u    = E π K,b        ψ ( x, u ) ⊤    sv ec  diag( Q, R )  2 Qµ K,b 2 Rµ u K,b    ψ ( x, u ) ⊤    sv ec( V ) v x v u           | {z } D 1 (F.75) + E π K,b        ψ ( x, u ) ⊤ ( µ ⊤ K,b Qµ K,b + ( µ u K,b ) ⊤ Rµ u K,b + µ ⊤ Qµ )    sv ec( V ) v x v u           | {z } D 2 . In the sequel, w e calculate D 1 and D 2 resp ectiv ely . Calculation of D 1 . Note that b y the definition of ψ ( x, u ) in ( B.5 ), it holds t ha t D 1 = E π K,b        " ( z − µ z ) ⊤ diag( Q, R )( z − µ z ) + ( z − µ z ) ⊤ 2 Qµ K,b 2 Rµ u K,b !# · " ( z − µ z ) ⊤ V ( z − µ z ) + ( z − µ z ) ⊤ v x v u !#        = E π K,b  ( z − µ z ) ⊤ diag( Q, R )( z − µ z ) · ( z − µ z ) ⊤ V ( z − µ z )  (F.76) + E π K,b " 2 Qµ K,b 2 Rµ u K,b ! ⊤ ( z − µ z )( z − µ z ) ⊤ v x v u !# . Here z = ( x ⊤ , u ⊤ ) ⊤ and µ z = E π K,b ( z ). F or the first term on the RHS of ( F.76 ), note t ha t z − µ z ∼ N (0 , Σ z ). Therefore, by Lemma G.1 , w e obtain that E π K,b  ( z − µ z ) ⊤ diag( Q, R )( z − µ z ) · ( z − µ z ) ⊤ V ( z − µ z )  = 2  Σ z diag( Q, R )Σ z , V  +  Σ z , diag( Q, R )  · h Σ z , V i = sv ec h 2Σ z diag( Q, R )Σ z +  Σ z , diag( Q, R )  · Σ z i ⊤ sv ec( V ) . ( F .77) 80 Mean while, the second term on the RHS of ( F.7 6 ) ta k es the form of E π K,b " 2 Qµ K,b 2 Rµ u K,b ! ⊤ ( z − µ z )( z − µ z ) ⊤ v x v u !# = " Σ z 2 Qµ K,b 2 Rµ u K,b !# ⊤ v x v u ! . (F.78) Com bining ( F.76 ), ( F.77 ), and ( F.78 ), w e obtain that D 1 =    2sv ec  Σ z diag( Q, R )Σ z + h Σ z , diag( Q, R ) i Σ z  Σ z 2 Qµ K,b 2 Rµ u K,b !    ⊤    sv ec( V ) v x v u    . (F .79) Calculation of D 2 . By the definition of the f eat ure ve cto r ψ ( x, u ) in ( B.5 ), w e kno w tha t D 2 = ( µ ⊤ K,b Qµ K,b + ( µ u K,b ) ⊤ Rµ u K,b + µ ⊤ Qµ )    sv ec(Σ z ) 0 m 0 k    ⊤    sv ec( V ) v x v u    . (F.80) No w, com bining ( F.75 ), ( F.79 ), and ( F.80 ), it holds that E π K,b  c ( x, u ) ψ ( x, u )  =    2sv ec  Σ z diag( Q, R )Σ z + h Σ z , diag( Q, R ) i Σ z  Σ z 2 Qµ K,b 2 Rµ u K,b !    +  µ ⊤ K,b Qµ K,b + ( µ u K,b ) ⊤ Rµ u K,b + µ ⊤ Qµ     sv ec(Σ z ) 0 m 0 k    , whic h concludes the pro o f of the lemma. G Auxiliary Results Lemma G.1. Assume that the random v ariable w ∼ N (0 , I ), and let U and V b e t wo symmetric matrices, then it holds that E ( w ⊤ U w · w ⊤ V w ) = 2 tr( U V ) + tr( U ) · tr( V ) . Pr o of. See Magn us et al. ( 1978 ) a nd Magn us ( 197 9 ) for a detailed pro of. Lemma G.2. Let M , N b e comm uting symmetric mat r ices, and let α 1 , . . . , α n , β 1 , . . . , β n denote their eigen v alues with v 1 , . . . , v n a common basis of orthogonal eigenv ectors. Then the n ( n + 1) / 2 eigen v alues of M ⊗ s N are give n b y ( α i β j + α j β i ) / 2, where 1 ≤ i ≤ j ≤ n . 81 Pr o of. See Lemma 2 in Alizadeh et al. ( 1998 ) for a detailed pro of. Lemma G.3. F or any integer m > 0, let A ∈ R m × m and η ∼ N (0 , I m ). Then, there exists some a bsolute constan t C > 0 suc h that for an y t ≥ 0, we ha v e P h   η ⊤ Aη − E ( η ⊤ Aη )   > t i ≤ 2 · exp h − C · min  t 2 k A k − 2 F , t k A k − 1 2  i . Pr o of. See Rudelson et al. ( 20 1 3 ) for a detailed pro of. 82

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment