Actor-Critic Provably Finds Nash Equilibria of Linear-Quadratic Mean-Field Games

Actor-Critic Pro v ably Finds Nash Equilibria of Linear-Q uadrati c Mean-Fiel d Games Zuyue F u ∗ Zh u o ran Y ang † Y ong xin Chen ‡ Zhaor an W ang ∗ Abstract W e stu d y discrete-time mean-ﬁeld Marko v games with inﬁnite num b ers of agen ts where eac h agen t aims to minimize its ergo d ic cost. W e consider the setting where the agen ts hav e id en tical linear state transitions and quadratic cost functions, wh ile the aggregat ed eﬀect of the agen ts is captured b y th e p op- ulation mean of their state s , namely , the mean-ﬁeld state . F or su c h a game, based on the Nash certain ty equiv alence principle, w e pr o vide suﬃcient condi- tions for the existence and uniqueness of its Nash equ ilibrium. Moreo v er, to ﬁnd the Nash equilibr ium, w e prop ose a mean-ﬁeld actor-critic algorithm with linear fun ction appro xim ation, whic h do es not require knowing the mo del of dynamics. Sp eciﬁcally , at eac h iteration of our algo rithm, we use th e single- agen t actor-critic algo rithm to appr o ximately obtain th e optimal p olicy of the eac h agen t giv en th e curren t mean-ﬁeld state, and then up date the mean-ﬁeld state. In particular, w e prov e that our algorithm conv erges to the Nash equi- librium at a linear rate. T o th e b est of our kno wledge, this is the ﬁrst success of applying mo del-free reinforcemen t learning with function appro ximation to discrete-time m ean-ﬁeld Mark ov games w ith pro v able non-asymptotic global con ve rgence guaran tees. 1 In tro ductio n In reinforcemen t learning (RL) ( Sutton and Barto , 2018 ), an agent learns to mak e decisions that minimize its exp ected total cost through sequen tial in teractions with the en vironmen t. ∗ Department of Industr ial Engineering and Manag ement Sciences, Northw ester n Universit y † Department of O p erations Research and Financial Engineering, Princeton Universit y ‡ School of Aero space Engineering, Georg ia Institute o f T echnology 1 Multi-agen t reinforcemen t learning (MARL) ( Shoham et al. , 2003 , 200 7 ; Busoniu et al. , 2008 ) aims to extend RL to sequen tial decision-making problems in volving m ultiple agen ts. In a non-co op erativ e game, w e are intere sted in the Nash equilibrium ( Nash , 1 951 ), whic h is a join t p olicy o f all the agen ts suc h that each agent cannot decrease its exp ected to t al cost b y unilaterally deviating f rom its Na sh p o licy . The Nash equilibrium pla ys a critical role in understanding the so cial dynamics of self-in terested agen ts ( Ash , 2000 ; Axtell , 2002 ) and constructing the optimal p olicy of a particular a gen t via ﬁctitio us self-pla y ( Bo wling and V eloso , 2000 ; Ga nzfried and Sandholm , 200 9 ). In the presence of the recen t dev elopmen t in deep learning ( LeCun et al. , 2015 ) , MARL with function appro ximation ac hieve s tremendous empirical successes in applications, including Go ( Silv er et al. , 201 6 , 2017 ), D ota ( Op enAI , 2018 ), Star Craft ( Vin y als et a l. , 2019 ), Pok er ( Heinrich and Silv er , 20 16 ; Mora vˇ c ´ ık et a l. , 2017 ), m ulti- rob otic systems ( Y a ng and Gu , 200 4 ), auto nomous driving ( Shalev-Sh w artz et al. , 2016 ), and solving so cial dilemmas ( de Cote et al. , 2006 ; Leib o et al. , 2017 ; Hughes et al. , 2 018 ). Ho w ev er, since the capacity of the join t state and action spaces grow s exp onen- tially in the num b er of ag en ts, suc h MARL a pproac hes b ecome computatio na lly in tractable when the n um b er of agents is la rge, whic h is common in real- w orld applications ( Sandholm , 2010 ; Calderone , 20 17 ; W ang et al. , 2017a ). Mean-ﬁeld game is prop osed b y Huang et al. ( 2 0 03 , 2006 ); Lasry and Lions ( 2006 a , b , 2007 ) with the idea of utilizing mean-ﬁeld approx imation to mo del the strategic interactions within a la r ge p opulation. In a mean-ﬁeld g ame, eac h agent ha s the same cost function and state transition, whic h depend on the ot her agen ts only thro ug h their aggregated eﬀect. As a result, the optimal p olicy of eac h agen t dep ends solely on its own state and the aggregated eﬀect o f the p opulation, and suc h an optimal p olicy is symmetric across all the agents . Moreo ve r, if t he aggregated eﬀect of the p opulation corresp onds to the Nash equilibrium, then the optimal p olicy of eac h ag en t jointly constitutes a Nash equilibrium. Although suc h a Nash equilibrium corresp onds to an inﬁnite n um b er of ag ents, it w ell appro ximates the Nash equilibrium for a suﬃcien tly large n umber of agen ts ( Bensoussan et al. , 2016 ). Also, as the aggregat ed eﬀect of t he p o pulation abstracts a w ay the strategic in teractions b et wee n individual agen ts, it circumv en ts the computational intractabilit y of the MARL approac hes that do not exploit symmetry . Ho we v er, most existing work on mean-ﬁeld games fo cuses on c ha r a cterizing the existence and uniqueness of the Nash equilibrium rather tha n designing pro v ably eﬃcien t algorithms. In particular, most existing w ork considers the con tinuous-time setting, whic h requires solv- ing a pair of Ha milto n-Jacobi-Bellman ( HJB) and F okke r-Planc k (FP) equations, whereas 2 the discrete-time setting is more common in practice, e.g., in the afo remen tioned applica- tions. Moreo ver, mo st existing approaches , including the ones based on solving the HJB and FP equations, require kno wing the mo del of dynamics ( Bardi and Priuli , 2014 ), or having the access to a simulator, whic h generates the next state given a ny state-a ctio n pair and aggregated eﬀect of the p opulatio n ( Guo et al. , 2019 ), whic h is oft en unav ailable in practice. T o address these c hallenges, w e dev elop a n eﬃcien t mo del-free RL a pproac h to mean- ﬁeld game, whic h pro v ably a ttains the Nash equilibrium. In particular, w e fo cus o n discrete- time mean-ﬁeld games with linear state transitions and quadratic cost functions, where the aggregated eﬀect of the p opulation is quantiﬁed by the mean-ﬁeld state. Such games capture the fundamental diﬃculties of general mean-ﬁeld games and w ell a ppro ximates a v ariet y o f real-w orld systems such as p ow er grids ( Minciardi a nd Sacile , 201 1 ), sw arm rob ots ( F ang , 2014 ; Araki et al. , 2017 ; D o err et al. , 2 018 ), and ﬁnancial systems ( Zhou and Li , 2000 ; Huang and Li , 2018 ). In detail, based on the Nash certaint y equiv alenc e (NCE) principle ( Huang et al. , 20 06 , 20 07 ), w e prop o se a mean-ﬁeld actor- critic alg orithm whic h, at eac h iteration, give n the mean-ﬁeld state µ , appro ximately a ttains the optimal p o licy π ∗ µ of eac h agen t, and then up dates the mean-ﬁeld state µ assuming tha t all the agen ts follo w π ∗ µ . W e parametrize the a ctor and critic b y linear and quadratic functions, resp ectiv ely , and pro ve that suc h a pa r a meterization encompasses the optimal p olicy of eac h agen t. Sp eciﬁcally , w e up date the actor parameter using p olicy gradient ( Sutton et al. , 20 00 ) and natural p olicy gradien t ( Kak ade , 2002 ; Pe ters and Sch aal , 2008 ; Bhatnag ar et al. , 200 9 ) and up date the critic parameter using primal-dual gradien t tempo r a l diﬀerence ( Sutton et al. , 2009a , b ). In particular, we pro ve that giv en the mean-ﬁeld state µ , the sequence o f p olicies generated b y the actor con verges linearly to the optimal p olicy π ∗ µ . Moreo ver, when alternatingly up date the p o licy and mean-ﬁeld stat e, w e pro ve that the sequence of p olicies and its corresponding sequence of mean-ﬁeld states con verge to the unique Nash equilibrium at a linear rate. Our approac h can b e in terpreted f rom b oth “passiv e” a nd “activ e” p ersp ectiv es: ( i) Assuming that eac h self-interes ted agen t emplo ys the single-agen t actor - critic algorithm, the p olicy of eac h agen t conv erges to the unique Nash p olicy , whic h c haracterizes the so cial dynamics of a large p opulation of mo del- f ree RL agents. ( ii) F or a particular agent, our approac h serv es as a ﬁctitious self-pla y metho d for it to ﬁnd its Nash p olicy , assuming the other agen ts giv e their b est respo nses. T o the b est of our know ledge, our w ork establishes the ﬁrst eﬃcien t mo del- free RL approach with function approximation that prov ably attains the Nash equilibrium of a discrete-time mean-ﬁeld game. As a b ypro duct, w e also sho w that the seque nce of p olicies generated b y the single-agent actor-critic algorithm con v erges at a linear rate to the optimal 3 p olicy o f a linear-quadratic regulato r (LQR) problem in the presence o f drift, whic h may be of indep enden t interes t . Related W ork. Mean-ﬁeld game is ﬁrst in tro duced in Huang et al. ( 2 0 03 , 20 06 ); Lasry and Lions ( 2006a , b , 200 7 ). In the last decade, there is grow ing intere st in understanding con tinuous-time mean-ﬁeld games. See, e.g., Gu´ ean t et a l. ( 2011 ); Bensoussan et al. ( 2 013 ); Gomes et al. ( 2 0 14 ); Carmona and Delarue ( 2013 , 2018 ) and the references therein. Due to their simple structures, con tin uous-time linear-quadratic mean-ﬁeld games are extensiv ely studied under v arious mo del assumptions. See Li and Zhang ( 2008 ) ; Ba rdi ( 2011 ); W ang and Zhang ( 2012 ); Bardi and Priuli ( 2014 ); Huang et al. ( 20 16a , b ); Bensouss an et al. ( 2016 , 2017 ); Caines and Kizilk ale ( 2017 ) ; Huang and Huang ( 2017 ); Mo on and Ba¸ sar ( 20 18 ); Huang and Zhou ( 2019 ) for examples of t his line of work. Mean while, the literature on discrete- time linear-quadratic mean- ﬁeld g a mes remains relatively scarce. Most of this line of w ork fo cuses on c haracterizing t he existe nce of a Nash equilibrium and the behavior o f suc h a Nash equilibrium when the num b er of agen ts go es to inﬁnity ( Gomes et al. , 2010 ; T em bine and Huang , 20 11 ; Mo on and Ba¸ sar , 2014 ; Bisw as , 2015 ; Saldi et al. , 2018a , b , 2019 ). See also Y ang et al. ( 2018a ), whic h applies maximu m entrop y in v erse R L ( Ziebart et al. , 2008 ) to infer the cost function and social dynamics o f discrete-time mean-ﬁeld ga mes with ﬁnite state a nd action spaces. Our work is most related to Guo et al. ( 2019 ), where they pro p ose a mean- ﬁeld Q-learning alg orithm ( W atkins and D a y an , 1 992 ) for discrete-time mean-ﬁeld games with ﬁnite state and action spaces. Suc h an algorithm requires the access to a sim ulator, whic h, giv en an y state-actio n pair and mean-ﬁeld state, outputs the next state. In con trast, b oth o ur state and action spaces a re inﬁnite, a nd w e do not require suc h a sim ulator but only observ ations of tra jectories. Corresp ondingly , we study the mean-ﬁeld a cto r - critic alg orithm with linear function approximation, whereas their alg o rithm is tailored to the tabular setting. Also, our w ork is closely related to Mguni et al. ( 2018 ), whic h fo cuses on a more restrictiv e setting where the state transition do es not inv olv e the mean-ﬁeld state. In suc h a setting, mean-ﬁeld games are p oten tia l games, whic h is, ho we v er, not true in more g eneral settings ( Li et al. , 2017 ; Briani and Cardalia guet , 2018 ) . In comparison, w e allo w the state tr a nsition to dep end on the mean-ﬁeld state. Mean while, they pro p ose a ﬁctitious self-play metho d based on the single-a gen t actor- critic algo rithm and establishes its asymptotic conv ergence. Ho we v er, their pro of of con ve r gence relies on the assumption that the single-agent actor-critic algorithm conv erges to the optimal p olicy , whic h is unv eriﬁed therein. Meanwh ile, a mo del- based alg o rithm is prop osed in uz Zaman et al. ( 2019 ) for the discoun ted linear-quadratic mean-ﬁeld games, where they only sho w that the algorithm con v erges a symptotically to the 4 Nash equilibrium. In addition, o ur w or k is related to Ja y akumar and Adit y a ( 201 9 ), where the prop osed algorithm is only sho wn to con ve rge asymptotically t o a stationa ry p oint of the mean-ﬁeld ga me. Our work also extends the line of w ork on ﬁnding the Nash equilibria of Mark ov ga mes using MARL. Due t o the computatio na l in t ractabilit y in tro duced b y the large n umber of agen ts, suc h a line of w ork fo cuses o n ﬁnite-agen t Marko v games ( Littman , 1 994 , 2001 ; Hu and W ellman , 1998 ; Bowling , 2001 ; L a goudakis and P arr , 200 2 ; Hu and W ellman , 2003 ; Conitzer and Sandholm , 2007 ; P erolat et al. , 2015 ; P ´ e rolat et al. , 2016 b , a , 2018 ; W ei et al. , 2017 ; Z ha ng et al. , 2018 ; Zo u et al. , 2019 ; Casgrain et al. , 2019 ). See also Shoham et al. ( 2003 , 2007 ); Busoniu et al. ( 2008 ); Li ( 2018 ) for detailed surv eys. Our w ork is related to Y ang et al. ( 2018b ), where they com bine the mean-ﬁeld appro ximatio n of actions (r a ther than states) and Nash Q-learning ( Hu and W ellman , 2003 ) to study general- sum Marko v games with a large n um b er of agen ts. Ho w ev er, the Nash Q-learning algorithm is only applicable to ﬁnite state and action spaces, and its con vergence is established under rather strong assumptions. Also, when the num b er of ag en ts go es to inﬁnit y , their approac h yields a v arian t of tabular Q-learning, whic h is diﬀeren t fro m our mean-ﬁeld actor-critic algorithm. F or p olicy optimization, based on t he p olicy gr adien t theorem, Sutton et al. ( 2 000 ); Konda and Tsitsiklis ( 2000 ) prop ose the actor-critic algorithm, whic h is later generalized to the natural actor-critic a lgorithm ( P eters and Sc haa l , 2008 ; Bhatnagar et al. , 2 009 ). Most existing results on the conv ergence o f actor- critic algorithms are ba sed on sto chas tic appro x- imation using ordinary diﬀeren tial equations ( Bhatnagar et al. , 2009 ; Castro and Meir , 2 010 ; Konda and Tsitsiklis , 2000 ; Maei , 2018 ), whic h are asymptotic in nature. F or p olicy ev al- uation, the con v ergence of primal-dual gradient temporal diﬀerence is studied in Liu et al. ( 2015 ); Du et al. ( 20 17 ); W ang et al. ( 201 7b ); Y u ( 2017 ); W ai et al. ( 2 018 ). Ho we ver, this line of w ork assumes that the feature mapping is b ounded, whic h is not the case in our setting. Th us, the existing con ve rgence results are not applicable t o analyzing the critic up date in our setting. T o handle t he unbounded feature mapping, we utilize a truncation argumen t, whic h requires more delicate analysis. Finally , o ur work extends the line of w o r k that studies mo del-free RL for LQR. F or example, Bradtk e ( 19 9 3 ); Bradtke et al. ( 19 94 ) show that p o licy it eration con v erges to the optimal po licy , T u and Rec ht ( 2017 ); Dean et al. ( 2017 ) study the sample complexit y of least-squares temp oral-diﬀerence for p olicy ev aluation. More recen tly , F a zel et a l. ( 2018 ) ; Malik et al. ( 2018 ); T u and Rec h t ( 201 8 ) sho w that the p olicy gra dient algorithm con v erg es at a linear rate to the optimal p olicy . See as also Hardt et al. ( 2016 ); Dean et al. ( 20 18 ) for 5 more in this line of w ork. Our work is a lso closely related to Y ang et al. ( 2 019 ), where they sho w that the sequenc e of p olicies generated b y the natural actor-critic algorithm enjo ys a linear rate of con v ergence to the optimal p olicy . Compared with t his work, when ﬁxing the mean-ﬁeld state, w e use the a ctor-critic algorithm to study LQR in the presence of drift , whic h in tr o duces signiﬁcan t diﬃculties in the ana lysis. As w e sho w in § 3 , the drift causes the optimal p o licy t o hav e an additional interce pt, whic h mak es the state- and action-v alue functions more complicated. Notations. W e denote by k M k 2 the spectral norm, ρ ( M ) the sp ectral radius, σ min ( M ) the minim um singular v alue, and σ max ( M ) the maxim um singular v alue of a matrix M . W e use k α k 2 to represen t the ℓ 2 -norm of a v ector α , and ( α ) j i to denote the sub-v ector ( α i , α i +1 , . . . , α j ) ⊤ , where α k is the k -th en try of t he v ector α . F or scalars a 1 , . . . , a n , w e denote b y p oly( a 1 , . . . , a n ) the p olynomial of a 1 , . . . , a n , and this p olynomial may v ary from line to line. W e use [ n ] t o denote the set { 1 , 2 , . . . , n } for a n y n ∈ N . 2 Linear-Quadratic Mean-Field Game A linear- quadratic mean- ﬁeld game in volv es N a ∈ N agen ts. Their state transitions are giv en b y x i t +1 = Ax i t + B u i t + A · 1 N a N a X j =1 x j t + d i + ω i t , ∀ t ≥ 0 , i ∈ [ N a ] , where x i t ∈ R m and u i t ∈ R k are the state and action ve cto r s of agen t i , resp ectiv ely , the v ector d i ∈ R m is a drift term, and ω i t ∈ R m is an indep enden t random noise term follo wing the G aussian distribution N (0 , Ψ ω ). The agen ts are coupled thro ugh the mean-ﬁeld state 1 / N a · P N a j =1 x j t . In the linear-quadratic mean-ﬁeld game, the cost of a gen t i ∈ [ N a ] at time t ≥ 0 is given by c i t = ( x i t ) ⊤ Qx i t + ( u i t ) ⊤ Ru i t + 1 N a N a X j =1 x j t ! ⊤ Q 1 N a N a X j =1 x j t ! , where u i t is generated by π i , i.e., the p olicy of agen t i . T o measure the p erfor ma nce of agen t i follow ing its p olicy π i under the inﬂuence of the other agents, we deﬁne the expected total cost of agent i as J i ( π 1 , π 2 , . . . , π N a ) = lim T →∞ E 1 T T X t =0 c i t ! . 6 W e are interested in ﬁnding a Nash equilibrium ( π 1 , π 2 , . . . , π N a ), whic h is deﬁned b y J i ( π 1 , . . . , π i − 1 , π i , π i +1 , . . . , π N a ) ≤ J i ( π 1 , . . . , π i − 1 , e π i , π i +1 , . . . , π N a ) , ∀ e π i , i ∈ [ N a ] . That is, a gen t i cannot furt her decrease its exp ected total cost b y unilaterally deviating from its Nash p olicy . F or the simplicit y of discussion, w e assume that t he drift term d i is identic al for eac h agen t. By the symmetry of the agents in terms of their state transitions and cost functions, w e fo cus o n a ﬁxed agent and drop the sup erscript i hereafter. F urther taking the inﬁnite- p opulation limit N a → ∞ leads to the following fo r m ulation of linear-quadratic mean-ﬁeld game (LQ-MF G ). Problem 2.1 (LQ- MF G ) . W e consider the follow ing fo rm ulatio n, x t +1 = Ax t + B u t + A E x ∗ t + d + ω t , c ( x t , u t ) = x ⊤ t Qx t + u ⊤ t Ru t + ( E x ∗ t ) ⊤ Q ( E x ∗ t ) , J ( π ) = lim T →∞ E " 1 T T X t =0 c ( x t , u t ) # , where x t ∈ R m is the state v ector, u t ∈ R k is the action v ector generated b y the p olicy π , { x ∗ t } t ≥ 0 is the tra j ectory generated by a Nash p olicy π ∗ (assuming it exists), ω t ∈ R m is an indep enden t random noise term follo wing the Ga ussian distribution N (0 , Ψ ω ), and d ∈ R m is a drift term. Here the exp ectation E x ∗ t is take n across all the agen t s. W e a im to ﬁnd π ∗ suc h that J ( π ∗ ) = inf π ∈ Π J ( π ). The form ulatio n in Problem 2.1 is studied b y Lasry and Lions ( 2007 ); Bensoussan et al. ( 2016 ); Saldi et al. ( 201 8a , b ). W e prop ose a more general form ulation in Problem C.1 (see § C of the app endix fo r details), where an additional in teraction term betw een the state vec tor x t and the mean-ﬁeld state E x ∗ t is incorp orated into the cost function. According to our analysis in § C , up to minor mo diﬁcation, the results in the following sections also carry o ve r to Problem C.1 . Therefore, for the sake of simplicit y , w e fo cus on Problem 2.1 in the sequel. Note that the mean-ﬁeld state E x ∗ t con v erges to a constan t vec tor µ ∗ as t → ∞ , whic h serv es as a ﬁxed mean-ﬁeld state, since the Mark ov c hain of states g enerated b y the Nash p olicy π ∗ admits a stationa ry distribution. As w e consider the ergo dic setting, it suﬃces to study Problem 2.1 with t suﬃcien t ly large, whic h mo t iv ates the follo wing drifted LQ R (D-LQR) problem, where the mean-ﬁeld state acts as another drift term. 7 Problem 2.2 (D-LQR) . Giv en a mean- ﬁeld state µ ∈ R m , w e consider the f o llo wing form u- lation, x t +1 = Ax t + B u t + Aµ + d + ω t , c µ ( x t , u t ) = x ⊤ t Qx t + u ⊤ t Ru t + µ ⊤ Qµ, J µ ( π ) = lim T →∞ E " 1 T T X t =0 c µ ( x t , u t ) # , where x t ∈ R m is the state ve ctor, u t ∈ R k is the action v ector generated by the p olicy π , ω t ∈ R m is an indep enden t random noise term following the G aussian distribution N (0 , Ψ ω ), and d ∈ R m is a drift term. W e aim to ﬁnd an o ptimal p olicy π ∗ µ suc h that J µ ( π ∗ µ ) = inf π ∈ Π J µ ( π ). F or the mean-ﬁeld state µ = µ ∗ , whic h corresp onds to the Nash equilibrium, solving Problem 2.2 giv es π ∗ µ ∗ , whic h coincides with the Nash p olicy π ∗ deﬁned in Problem 2.1 . Compared with t he most studied LQR problem ( Lewis et al. , 2012 ), b oth the state transition and the cost function in Problem 2.2 ha ve drift terms, whic h act as the mean- ﬁeld “force” that driv es the stat es a wa y from zero. Suc h a mean-ﬁeld “fo rce” in tro duces additional c hallenges when solving Problem 2.2 in the mo del- f ree setting (see § 3.3 for details). On the o ther hand, the unique o ptimal p olicy π ∗ µ of Problem 2.2 admits a linear f o rm π ∗ µ ( x t ) = − K π ∗ µ x t + b π ∗ µ ( Anderson and Mo ore , 20 07 ), where the matrix K π ∗ µ ∈ R k × m and the v ector b π ∗ µ ∈ R k are the parameters of π ∗ µ . Motiv ated b y suc h a linear form of the optimal p olicy , w e deﬁne the class of linear- Gaussian p olicies as Π = { π ( x ) = − K x + b + σ · η : K ∈ R k × m , b ∈ R k } , (2.1) where the standar d Gaussian noise term η ∈ R k is included to encourage exploration. T o solv e Problem 2.2 , it suﬃces to ﬁnd t he optimal p o licy π ∗ µ within Π. No w, w e in tro duce the deﬁnition of the Nash equilibrium pair ( Saldi et a l. , 2018 a , b ). The Nash equilibrium pair is c haracterized b y the NCE principle, whic h states that it suﬃces to ﬁnd a pair o f π ∗ and µ ∗ , such that the p olicy π ∗ is optimal for each agen t when the mean-ﬁeld state is µ ∗ , while all the agen ts f o llo wing the p olicy π ∗ generate the mean-ﬁeld state µ ∗ as t → ∞ . T o presen t it s f o rmal deﬁnition, w e deﬁne Λ 1 ( µ ) as the optimal p olicy in Π giv en the mean-ﬁeld state µ , and deﬁne Λ 2 ( µ, π ) as the mean-ﬁeld state generated b y the p olicy π giv en the curren t mean-ﬁeld state µ as t → ∞ . Deﬁnition 2.3 (Nash Equilibrium Pair) . The pair ( µ ∗ , π ∗ ) ∈ R m × Π constitutes a Nash equilibrium pair o f Problem 2.1 if it satisﬁes π ∗ = Λ 1 ( µ ∗ ) and µ ∗ = Λ 2 ( µ ∗ , π ∗ ). Here µ ∗ is called t he Nash mean-ﬁeld state and π ∗ is called the Nash p olicy . 8 3 Mean-Field Actor-Crit i c W e ﬁrst c ha r a cterize the existence and uniquen ess of the Nash equilibrium pair of Problem 2.1 under mild regularit y conditions, and then prop ose a mean-ﬁeld actor-critic algo r it hm to obtain suc h a Nash equilibrium. As a building blo c k of the mean-ﬁeld a ctor-critic, w e prop ose the natural acto r - critic to solve Problem 2.2 . 3.1 Existence and Uniqueness of Nash Equ ilibrium P air W e now establish the existence and uniqueness of the Nash equilibrium pair deﬁned in Deﬁnition 2.3 . W e imp ose the follow ing regula rit y conditio ns. Assumption 3.1. W e assume that the follo wing statemen ts hold: (i) The algebraic R iccati equation X = A ⊤ X A + Q − A ⊤ X B ( B ⊤ X B + R ) − 1 B ⊤ X A admits a unique symmetric p ositiv e deﬁnite solution X ∗ ; (ii) It holds for L 0 = L 1 L 3 + L 2 that L 0 < 1, where L 1 =    ( I − A ) Q − 1 ( I − A ) ⊤ + B R − 1 B ⊤  − 1 A   2 ·    K ∗ Q − 1 ( I − A ) ⊤ − R − 1 B ⊤    2 , L 2 =  1 − ρ ( A − B K ∗ )  − 1 · k A k 2 , L 3 =  1 − ρ ( A − B K ∗ )  − 1 · k B k 2 . Here K ∗ = − ( B ⊤ X ∗ B + R ) − 1 B ⊤ X ∗ A . The ﬁrst assumption is implied b y mild regularity conditions on the matrices A , B , Q , and R . See Theorem 3 .2 in De Souza et al. ( 1986 ) for details. The second a ssumption is standard in the literature ( Bensoussan et a l. , 201 6 ; Saldi et al. , 2018b ), whic h ensures the stabilit y of the LQ-MF G . In the follow ing prop osition, w e sho w that Problem 2.1 admits a unique Nash equilibrium pair. Prop osition 3.2 (Existence and Uniqueness of Nash Equilibrium P air) . Under Assumption 3.1 , the op erator Λ( · ) = Λ 2 ( · , Λ 1 ( · )) is L 0 -Lipsc hitz, where L 0 is g iven in Assumption 3.1 . Moreo ve r, there exists a unique Nash equilibrium pair ( µ ∗ , π ∗ ) of Problem 2 .1 . Pr o of. See § E.1 for a detailed pro o f. 3.2 Mean-Field Actor-Critic for LQ-MFG The NCE principle motiv ates a ﬁxed-p oin t approa c h to solv e Problem 2 .1 , whic h generates a sequence of p olicies { π s } s ≥ 0 and mean-ﬁeld stat es { µ s } s ≥ 0 satisfying the following t w o 9 prop erties: (i) G iven the mean-ﬁeld state µ s , the p olicy π s is optimal. (ii) The mean-ﬁeld state b ecomes µ s +1 as t → ∞ , if all the a gen ts follo w π s under the current mean-ﬁeld state µ s . Here (i) requires solving Problem 2.2 giv en the mean-ﬁeld state µ s , while (ii) requires sim ulating the agen ts following the p olicy π s giv en the curren t mean-ﬁeld µ s . Based on suc h prop erties, w e prop ose the mean-ﬁeld actor-critic in Algorithm 1 . Algorithm 1 Mean-Field Actor-Critic for solving LQ-MF G. 1: Input: • Initial mean-ﬁeld state µ 0 and Initial p olicy π 0 with parameters K 0 and b 0 . • Num b ers of iterations S , { N s } s ∈ [ S ] , { H s } s ∈ [ S ] , { e T s,n , T s,n } s ∈ [ S ] ,n ∈ [ N s ] , { e T b s,h , T b s,h } s ∈ [ S ] ,h ∈ [ H s ] . • Stepsizes { γ s } s ∈ [ S ] , { γ b s } s ∈ [ S ] , { γ s,n,t } s ∈ [ S ] ,n ∈ [ N s ] ,t ∈ [ T s,n ] , { γ b s,h,t } s ∈ [ S ] ,h ∈ [ H s ] ,t ∈ [ T b s,h ] . 2: for s = 0 , 1 , 2 , . . . , S − 1 do 3: P olicy Up date: Solv e for the optimal p olicy π s +1 with parameters K s +1 and b s +1 of Problem 2.2 via Algorithm 2 with µ s , π s , N s , H s , { e T s,n , T s,n } n ∈ [ N s ] , { e T b s,h , T b s,h } h ∈ [ H s ] , γ s , γ b s , { γ s,n,t } n ∈ [ N s ] ,t ∈ [ T s,n ] , and { γ b s,h,t } h ∈ [ H s ] ,t ∈ [ T b s,h ] , whic h giv es the estimated mean-ﬁeld state b µ K s +1 ,b s +1 . 4: Mean-Fiel d State Up date: Up date the mean-ﬁeld state via µ s +1 ← b µ K s +1 ,b s +1 . 5: end for 6: Output: P air ( π S , µ S ). Algorithm 1 requires solving Problem 2 .2 at each iteration to o btain π s = Λ 1 ( µ s ) and µ s +1 = Λ 2 ( µ s , π s ). T o this end, we intro duce the natural actor-critic in § 3.3 that solv es Problem 2.2 . 3.3 Natural Actor-Critic for D-LQR No w we fo cus on solving Problem 2.2 for a ﬁxed mean- ﬁeld state µ , w e th us drop the subscript µ hereafter. W e write π K,b ( x ) = − K x + b + σ η to emphasize the dep endence o n K and b , and J ( K , b ) = J ( π K,b ) consequen tly . Now, w e prop ose the natural actor-critic to solv e Problem 2.2 . F or an y p olicy π K,b ∈ Π, b y the state transition in Problem 2.2 , w e hav e x t +1 = ( A − B K ) x t + ( B b + Aµ + d ) + ǫ t , ǫ t ∼ N (0 , Ψ ǫ ) , (3.1) where Ψ ǫ = σ B B ⊤ + Ψ ω . It is know n that if ρ ( A − B K ) < 1, then the Mark o v c hain { x t } t ≥ 0 induced b y ( 3.1 ) has a unique stationary distribution N ( µ K,b , Φ K ) ( Anderson and Mo ore , 10 2007 ), where the mean-ﬁeld state µ K,b and the cov a r iance Φ K satisfy tha t µ K,b = ( I − A + B K ) − 1 ( B b + Aµ + d ) , (3.2) Φ K = ( A − B K )Φ K ( A − B K ) ⊤ + Ψ ǫ . (3.3) Mean while, the Bellman equation for Problem 2.2 tak es the follo wing form P K = ( Q + K ⊤ RK ) + ( A − B K ) ⊤ P K ( A − B K ) . (3.4) Then b y calculation (see Prop osition B.2 in § B.1 of the app endix for details), it holds that the expected total cost J ( K , b ) is decomp osed as J ( K , b ) = J 1 ( K ) + J 2 ( K , b ) + σ 2 · tr( R ) + µ ⊤ Qµ, (3.5) where J 1 ( K ) and J 2 ( K , b ) are deﬁned as J 1 ( K ) = tr  ( Q + K ⊤ RK )Φ K  = tr( P K Ψ ǫ ) , J 2 ( K , b ) = µ K,b b ! ⊤ Q + K ⊤ RK − K ⊤ R − RK R ! µ K,b b ! . (3.6) Here J 1 ( K ) is t he expected t o tal cost in the most studied LQR problems ( Y ang et al. , 201 9 ; F azel et al. , 2018 ), where the state transition do es not hav e drif t terms. Mean while, J 2 ( K , b ) corresp onds to the exp ected cost induced by the drift terms. The follo wing tw o prop ositions c haracterize the prop erties of J 2 ( K , b ). First, w e sho w that J 2 ( K , b ) is strongly con v ex in b . Prop osition 3.3. Giv en a n y K , the function J 2 ( K , b ) is ν K -strongly con ve x in b . Here ν K = σ min ( Y ⊤ 1 ,K Y 1 ,K + Y ⊤ 2 ,K Y 2 ,K ), where Y 1 ,K = R 1 / 2 K ( I − A + B K ) − 1 B − R 1 / 2 and Y 2 ,K = Q 1 / 2 ( I − A + B K ) − 1 B . Also, J 2 ( K , b ) has ι K -Lipsc hitz contin uous gradien t in b , where ι K is upper b ounded as ι K ≤ [1 − ρ ( A − B K )] − 2 · ( k B k 2 2 · k K k 2 2 · k R k 2 + k B k 2 2 · k Q k 2 ). Pr o of. See § E.4 for a detailed pro o f. Second, w e show that min b J 2 ( K , b ) is indep enden t of K . Prop osition 3.4. W e deﬁne b K = argmin b J 2 ( K , b ), where J 2 ( K , b ) is deﬁned in ( 3.6 ). It holds that b K =  K Q − 1 ( I − A ) ⊤ − R − 1 B ⊤  ·  ( I − A ) Q − 1 ( I − A ) ⊤ + B R − 1 B ⊤  − 1 · ( Aµ + d ) . 11 Moreo ve r, J 2 ( K , b K ) ta kes the form of J 2 ( K , b K ) = ( Aµ + d ) ⊤  ( I − A ) Q − 1 ( I − A ) ⊤ + B R − 1 B ⊤  − 1 · ( Aµ + d ) , whic h is indep enden t of K . Pr o of. See § E.2 for a detailed pro o f. Since min b J 2 ( K , b ) is independen t of K by Prop o sition 3.4 , it holds that the o pt ima l K ∗ is the same as argmin K J 1 ( K ). This motiv ates us to minimize J ( K, b ) b y ﬁrst up dating K follo wing the gradien t direction ∇ K J 1 ( K ) to the o ptimal K ∗ , then up dating b follo wing the gradien t direction ∇ b J 2 ( K ∗ , b ). W e now design our algorithm based on this idea. W e deﬁne Υ K , p K,b , and q K,b as Υ K = Q + A ⊤ P K A A ⊤ P K B B ⊤ P K A R + B ⊤ P K B ! = Υ 11 K Υ 12 K Υ 21 K Υ 22 K ! , p K,b = A ⊤  P K · ( Aµ + d ) + f K,b  , q K,b = B ⊤  P K · ( Aµ + d ) + f K,b  , (3.7) where f K,b = ( I − A + B K ) −⊤ [( A − B K ) ⊤ P K ( B b + Aµ + d ) − K ⊤ Rb ]. By calculatio n (see Prop osition B.3 in § B.1 of the app endix for details), the gr a dien ts of J 1 ( K ) and J 2 ( K , b ) tak e the forms of ∇ K J 1 ( K ) = 2(Υ 22 K K − Υ 21 K ) · Φ K , ∇ b J 2 ( K , b ) = Υ 22 K ( − K µ K,b + b ) + Υ 21 K µ K,b + q K,b . Our algorithm fo llows the na t ural actor- critic metho d ( Bhatnagar et al. , 2009 ) and actor- critic metho d ( Konda a nd Tsitsiklis , 2000 ). Sp eciﬁcally , (i) T o obta in t he optimal K ∗ , in the critic up da te step, w e estimate the matrix Υ K b y b Υ K via a p o licy ev aluation algorithm, e.g., Algor ithm 3 or Algorithm 4 (see § B.2 and § B.3 o f the app endix f o r details); in the actor up date step, w e upda t e K via K ← K − γ · ( b Υ 22 K K − b Υ 21 K ), where the term b Υ 22 K K − b Υ 21 K is the estimated nat ur a l gradient. (ii) T o obta in the optimal b ∗ giv en K ∗ , in the critic up date step, w e estimate Υ K ∗ , q K ∗ ,b , and µ K ∗ ,b b y b Υ K ∗ , b q K ∗ ,b , and b µ K ∗ ,b via a p olicy ev aluation a lg orithm; In the actor up da t e step, w e up date b via b ← b − γ · b ∇ b J 2 ( K ∗ , b ), where b ∇ b J 2 ( K ∗ , b ) = b Υ 22 K ∗ ( − K ∗ b µ K ∗ ,b + b ) + b Υ 21 K ∗ b µ K ∗ ,b + b q K ∗ ,b is the estimated gradien t. Com bining the ab ov e pro cedure, we obtain the natural actor-critic for Problem 2.2 , whic h is stated in Algorithm 2 . 12 Algorithm 2 Natural Actor-Critic Algorithm for D -LQR. 1: Input: • Mean-ﬁeld state µ a nd initial p olicy π K 0 ,b 0 . • Num b ers of iterations N , H , { e T n , T n } n ∈ [ N ] , { e T b h , T b h } h ∈ [ H ] . • Stepsizes γ , γ b , { γ n,t } n ∈ [ N ] ,t ∈ [ T n ] , { γ b h,t } h ∈ [ H ] ,t ∈ [ T b h ] . 2: for n = 0 , 1 , 2 , . . . , N − 1 do 3: Critic Up date: Compute b Υ K n via Algo rithm 3 with π K n ,b 0 , µ , e T n , T n , { γ n,t } t ∈ [ T n ] , K 0 , and b 0 as inputs. 4: Actor Up date: Up da t e the para meter via K n +1 ← K n − γ · ( b Υ 22 K n K n − b Υ 21 K n ) . 5: end for 6: for h = 0 , 1 , 2 , . . . , H − 1 do 7: Critic Up date: Compute b µ K N ,b h , b Υ K N , b q K N ,b h via Algorithm 3 with π K N ,b h , µ , e T b h , T b h , { γ b h,t } t ∈ [ T b h ] , K 0 , and b 0 . 8: Actor Up date: Up da t e the para meter via b h +1 ← b h − γ b ·  b Υ 22 K N ( − K N b µ K,b h + b h ) + b Υ 21 K N b µ K N ,b h + b q K N ,b h  . 9: end for 10: Output: Policy π K,b = π K N ,b H , estimated mean-ﬁeld state b µ K,b = b µ K N ,b H . 4 Global C on v e r g ence Results The follo wing theorem establishes the rate of conv ergence of Algorithm 1 to the Nash equi- librium pair ( µ ∗ , π ∗ ) of Problem 2.1 . Theorem 4.1 (Conv ergence of Algor it hm 1 ) . F or a suﬃcien tly small to lera nce ε > 0, we set the num b er of iterations S in Algorithm 1 suc h that S > log  k µ 0 − µ ∗ k 2 · ε − 1  log(1 / L 0 ) . (4.1) F or an y s ∈ [ S ], w e deﬁne ε s = min n  1 − ρ ( A − B K ∗ )  4  k B k 2 + k A k 2  − 4  k µ s k − 2 2 + k d k − 2 2  · σ min (Ψ ǫ ) · σ min ( R ) · ε 2 , ν K ∗ ·  1 − ρ ( A − B K ∗ )  4 · k B k − 2 2 · M b ( µ s ) · ε 2 , ε o · 2 − s − 10 , (4.2) 13 where ν K ∗ is deﬁned in Prop o sition 3.3 and M b ( µ s ) = 4    Q − 1 ( I − A ) ⊤ ·  ( I − A ) Q − 1 ( I − A ) ⊤ + B R − 1 B ⊤  − 1 · ( Aµ s + d )    2 ·  ν − 1 K ∗ + σ − 1 min (Ψ ǫ ) · σ − 1 min ( R )  1 / 2 . (4.3) In the s -th p olicy up date step in Line 3 o f Alg o rithm 1 , w e set the inputs via Theorem B.4 suc h that J µ s ( π s +1 ) − J µ s ( π ∗ µ s ) < ε s , where the exp ected total cost J µ s ( · ) is deﬁned in Problem 2.2 , and π ∗ µ s = Λ 1 ( µ s ) is the optimal p olicy under the mean-ﬁeld state µ s . Then it holds with probability at least 1 − ε 5 that k µ S − µ ∗ k 2 ≤ ε, k K S − K ∗ k F ≤ ε, k b S − b ∗ k 2 ≤ (1 + L 1 ) · ε. Here µ ∗ is the Nash mean-ﬁeld state, K S and b S are parameters of the p olicy π S , and K ∗ and b ∗ are parameters o f the Nash p olicy π ∗ . Pr o of. See § D.1 for a detailed pro o f. W e highlight that if the inputs of Algo r ithm 1 satisfy the conditions stated in Theorem B.4 , it holds that J µ s ( π s +1 ) − J µ s ( π ∗ µ s ) < ε s for an y s ∈ [ S ]. See Theorem B.4 in § B.1 o f the app endix for details. By Theorem 4.1 , Algo rithm 1 conv erges linearly to the unique Nash equilibrium pair ( µ ∗ , π ∗ ) of Problem 2.1 . T o the b est of our kno wledge, this theorem is the ﬁrst successful attempt to establish that reinfor cemen t learning with function approx imation ﬁnds the Nash equilibrium pairs in mean-ﬁeld games with theoretical guarante e, whic h la ys the theoretical foundations for a pplying mo dern reinforcemen t learning tec hniques to general mean-ﬁeld games. 5 Conclus ion F or the discrete-time linear-quadratic mean-ﬁeld ga mes, we pro vide suﬃcien t conditions for the existe nce and uniqueness of the Nash equilibrium pair . Moreo ver, w e prop ose the mean- ﬁeld actor- critic algorithm with linear function appro ximation tha t is sho wn con verges to the Nash equilibrium pair with linear rate of con v ergence. Our algorithm can be mo diﬁed to use other parametrized function classes, including deep neural net w orks, for solving mean-ﬁeld games. F o r future researc h, we aim to extend o ur a lgorithm to other v ariations of mean-ﬁeld games including risk-sensitiv e mean-ﬁeld games ( Saldi et al. , 2018 a ; T embine et al. , 2014 ), robust mean-ﬁeld games ( Bauso et al. , 2016 ), and partia lly observ ed mean-ﬁeld ga mes ( Sa ldi et al. , 20 1 9 ). 14 References Alizadeh, F., Haeb erly , J.-P . A. and Ov erton, M. L. (1 9 98). Primal-dual in terior- p oin t meth- o ds for semideﬁnite programming: con ve rgence rates, stability and nume r ical results . SIAM Journal on Optimiz a tion , 8 746–768 . Anderson, B. D. a nd Mo ore, J. B. (200 7 ). Optimal c ontr ol: line ar quadr atic metho ds . Courier Corp oration. Araki, B., Strang , J., Pohorec ky , S., Qiu, C., Naegeli, T. a nd Rus, D. (2017 ) . Multi-rob o t path planning for a sw arm of rob ots tha t can b oth ﬂy and drive. In 2017 IEEE Interna- tional Confer enc e on R ob otics and A utomation (I CRA) . IEEE. Ash, C. (2000). So cial-self-in terest. Annals of public and c o op er ative e c onom i c s , 71 2 6 1–284. Axtell, R. L. (2002). Non-co op erativ e dynamics of multi-agen t t eams. In Autonomous A gents and Multiagent Systems . Bardi, M. (2 011). Explicit solutions of some linear-quadratic mean ﬁeld g ames. Networks and heter o gene ous me dia , 7 2 4 3–261. Bardi, M. and Priuli, F. S. (2014). Linear-quadratic n -p erson and mean-ﬁeld games with ergo dic cost. SI AM Journal on Contr ol and Optimization , 52 30 22–3052. Bauso, D., T em bine, H. and Ba¸ sar, T. (2 016). Ro bust mean ﬁeld games. Dynamic gam e s and applic ations , 6 277–303. Bensoussan, A., Chau, M., La i, Y. and Y am, S. C. P . ( 2 017). Linear-quadratic mean ﬁeld Stac ke lb erg games with state and control dela ys. SIAM Journal on C o n tr ol and Optimiza- tion , 55 274 8–2781. Bensoussan, A., F rehse, J. and Y am, P . (2013). Me an ﬁel d games and me an ﬁeld typ e c ontr ol the ory . Springer. Bensoussan, A., Sung, K., Y a m, S. C. P . a nd Y ung, S.-P . (2016). Linear- quadratic mean ﬁeld g ames. Journal of O p timization The ory and Applic ations , 169 496 – 529. Bhandari, J., Russo, D. and Singal, R. (2018). A ﬁnite time analysis of temp oral diﬀerence learning with linear function approximation. arXiv pr eprint arXiv:1806.0245 0 . 15 Bhatnagar, S., Sutton, R. S., G ha v amzadeh, M. and Lee, M. (2009). Natural actor–critic algorithms. Automatic a , 45 2471–2 4 82. Bisw as, A. (201 5 ). Mean ﬁeld games with ergo dic cost fo r discrete time mar ko v pro cess es. arXiv pr eprint arXiv:1510.08968 . Bork ar, V. S. and Meyn, S. P . (2000 ) . The OD E method for con v ergence of sto c hastic ap- pro ximation and r einforcemen t learning. SIAM Journal on Contr ol and Optimization , 38 447–469. Bo wling, M. (2001). Ra tional and conv ergen t learning in sto c ha stic games. In I n ternational Confer enc e on Artiﬁcial Intel ligen c e . Bo wling, M. and V eloso, M. (200 0 ). An ana lysis of sto c hastic game theory for multiagen t reinforcemen t learning. T ec h. rep., Carnegie Mellon Univ ersit y . Bradtk e, S. J. (1 993). Reinforcemen t learning applied to linear quadrat ic regulation. In A dvanc es in Neur al Information Pr o c essing Systems . Bradtk e, S. J., Ydstie, B. E. and Bar to, A. G. (1994). Adaptiv e linear quadratic con trol using p olicy iteration. In Americ an Contr ol Confe r enc e , v ol. 3. IEEE. Briani, A. and Cardaliaguet, P . (20 18). Sta ble solutions in p oten tial mean ﬁeld game sys- tems. Nonli n e ar Diﬀer ential Equations and Applic ations , 25 1. Busoniu, L., Babusk a, R. and De Sc h utter, B. (2008). A comprehensiv e surv ey o f multiagen t reinforcemen t learning. IEEE T r ansactions on Systems, Man, and Cyb ernetics, Part C: Applic ations and R eviews , 38 156–172. Caines, P . E. a nd Kizilk ale, A. C. (2017). ǫ -nash equilibria for partially observ ed LQG mean ﬁeld games with a ma jor pla yer. IEEE T r ansactions on A utomatic Contr ol , 62 3225–3 234. Calderone, D. J. (2 017). Mo dels of Com p etition for Intel lig e n t T r ansp ortation Infr astructur e: Parking, Ridesharing, and External F actors in R outing De cisions . Univ ersit y of California, Berk eley . Carmona, R. a nd Delarue, F. (2013). Probabilistic analysis of mean-ﬁeld games. S IAM Journal on Contr ol and Optimization , 51 2705–27 34. 16 Carmona, R. a nd Delarue, F. (2018). Pr ob abilistic The ory of Me an Field Games with Appli- c ations I-II . Springer. Casgrain, P ., Ning, B. and Jaim unga l, S. (20 19). Deep Q-learning for Nash equilibria: Nash- DQN. arXiv pr eprint arXiv:1904 . 1 0554 . Castro, D. D . and Meir, R. (2010). A con ve rgen t online single time scale actor critic algo - rithm. Journal of Machine L e arning R ese ar ch , 11 367–410 . Conitzer, V. and Sandholm, T. (2007). A WESOME: A general multiagen t learning algo- rithm that con v erges in self-pla y and learns a b est resp onse against stationary opp o nen ts. Machine L e arning , 67 23– 43. de Cote, E. M., Lazaric, A. and R estelli, M. (200 6). Learning to co o p erate in m ulti- agen t so cial dilemmas. In International Confer enc e on A utonomo us A gents and Multiagent Sys- tems . ACM . De Souza, C., Gev ers, M. and G o o dwin, G. (1986). Riccati equations in optimal ﬁltering of nonstabilizable sys tems having singular state transition matrices. IEEE T r an s actions on A utomatic Contr ol , 31 83 1–838. Dean, S. , Mania, H., Matni, N., R ec h t, B. and T u, S. (2017 ) . On the sample complexit y of the linear quadrat ic regulator . arXiv pr eprint arXiv:1710.016 88 . Dean, S. , Mania , H., Matni, N., Rec h t, B. and T u, S. (20 18). Regret b ounds for robust adaptiv e control o f the linear quadratic regulator. In A dvanc es in Neur al Inf o rmation Pr o c essing Systems . Do err, B., Linares, R ., Zh u, P . and F erra ri, S. (20 1 8). Random ﬁnite set theory and optimal con trol for large spacecraft swarms . arXiv pr eprint arXiv:1810.006 96 . Du, S. S., Chen, J., Li, L., Xiao, L. and Zhou, D. (2017) . Stochastic v ariance reduction metho ds fo r p olicy ev aluation. In Pr o c e e dings of the 34th International Confer enc e on Machine L e arning-V olume 70 . JMLR. org. F ang, J. (2014) . The LQR con troller design of t w o-wheeled self-balancing rob ot based on the particle swarm optimization algorithm. Mathematic al Pr oblems in Engine ering , 2014 . F azel, M., Ge, R., Kak ade, S. M. and Mesbahi, M. (2018). Global con ve rgence o f p olicy gradien t metho ds for the linear quadratic regula t or. arXiv pr eprint arXiv:1801.05039 . 17 Ganzfried, S. and Sandholm, T. (2009). Computing equilibria in mu ltipla yer sto chastic games of imperfect information. In Twenty-First International Joint C onfer enc e on A rti- ﬁcial In tel lig e n c e . Gomes, D. A., Mohr, J. and Souza, R. R. (2010) . D iscrete time, ﬁnite state space mean ﬁeld games. Journal de m ath´ ematiques pur es et appliq u´ ees , 93 308–328. Gomes, D. A. et al. (2014). Mean ﬁeld games mo delsa brief surv ey . Dynamic Games and Applic ations , 4 110–15 4. Gu ´ ean t, O., Lasry , J.-M. and Lions, P .-L . (2011). Mean ﬁeld ga mes and applications. In Paris-Princ eton le ctur es on mathematic al ﬁnanc e 20 10 . Springer, 20 5 –266. Guo, X., Hu, A., Xu, R. and Zhang, J. (2019). Learning mean-ﬁeld games. arXiv pr eprint arXiv:1901.09585 . Hardt, M., Ma, T. and Rec h t , B. (2016). Gradien t descen t learns linear dynamical systems. arXiv pr eprint arXiv:1609.05191 . Heinric h, J. and Silv er, D. (2 016). Deep reinforcemen t learning fr o m self-pla y in imp erfect- information games. arXiv pr eprint arXiv:1603.01 121 . Hu, J. and W e llman, M. P . (19 9 8). Multiagen t reinforcemen t learning: Theoretical f r a me- w ork and an algorithm. In Internationa l Confer enc e on Machine L e arning . Morga n Kauf- mann Publishers Inc. Hu, J. and W e llman, M. P . (2003). Nash Q-learning for general- sum sto c hastic games. Jour- nal of machine le arning r ese ar ch , 4 103 9–1069. Huang, J. and Huang, M. (201 7). Robust mean ﬁeld linear-quadratic-gaussian games with unkno wn L 2 -disturbance. SIAM Journal on Contr ol and Optimization , 55 281 1–2840. Huang, J. and Li, N. (201 8). Linear–quadratic mean-ﬁeld game for sto chas tic delay ed sys- tems. I EEE T r ansactions on Aut omatic Contr ol , 63 2722–272 9. Huang, J., Li, X. and W ang, T. (2016a ). Mean-ﬁeld linear-quadratic-Gaussian (LQ G) games for sto chastic in t egral systems . IEEE T r ansactions on Aut omatic Contr ol , 61 267 0–2675. Huang, J., W ang, S. and W u, Z. (2016 b). Ba c kw ar d mean-ﬁeld linear-quadratic-G aussian (LQG) games: F ull and partial infor mation. IEEE T r ansactions on A utomatic Contr ol , 61 37 8 4–3796. 18 Huang, M., Caines, P . E. a nd Malham ´ e, R. P . (2003). Individual and mass b ehav iour in large p opulation sto chastic wireles s p o w er con trol problems: cen tralized and nash equilibrium solutions. In Con f e r enc e on De cisio n an d Contr ol . IEEE. Huang, M., Caines, P . E. and Malham ´ e, R. P . (2007). Large- p opulation cost-coupled LQG problems with non uniform a g en ts: individual-mass b eha vior and decen tr alized ε -Nash equilibria. IEEE tr ansactions o n automatic c ontr ol , 52 1 5 60–1571 . Huang, M., Malham ´ e, R. P ., Caines, P . E. et a l. (2006). Large p o pula t io n sto chas tic dy- namic games: closed-lo op Mc k ean-Vlasov systems and the Nash certain ty equiv alence principle. C o mmunic ations in Information & Systems , 6 221–252 . Huang, M. and Zhou, M. (20 1 9). Linear quadratic mean ﬁeld games: Asymptotic solv abilit y and relatio n to the ﬁxed p oin t approac h. arXiv p r eprint arXiv:1903.08776 . Hughes, E., Leib o, J. Z., Phillips, M., T uyls, K., D ue ˜ nez-Guzman, E., Casta ˜ neda, A. G., Dunning, I. , Z h u, T., McKee, K., Koster, R. et al. (2018). Inequit y av ersion improv es co op eratio n in intertemporal so cial dilemmas. In A dvanc es in Neur al Information Pr o- c essing Systems . Ja yakum ar, S. and Adity a, M. (2019). Reinforcemen t learning in stationary mean-ﬁeld games. In International Confer enc e on Autonomous A gents an d Multiagent Systems . Kak ade, S. M. (200 2 ). A natural p olicy gradient. In A dvanc es in neur al information pr o- c essing systems . Konda, V. R. and Tsitsiklis, J. N. (200 0). Actor- critic a lgorithms. In A dvanc es in neur al information pr o c essing systems . Korda, N. a nd La, P . (2015). On TD(0) with function approximation: Concen tration b ounds and a cen tered v arian t with exp onen tia l conv ergence. In I nternational Confe r enc e on Machine L e arning . Kushner, H. and Yin, G. G. (2003) . Sto chastic appr oxim a tion and r e cursive algorithms and applic ations . Springer Science & Business Media. Lagoudakis, M. G . and P a r r , R. (2002). V alue function a pproximation in zero-sum Marko v games. In Unc ertainty in A rtiﬁci a l Intel ligenc e . 19 Lasry , J.-M. and Lions, P .-L. (2006a). Jeux ` a c hamp mo y en. I–le cas statio nnaire. Co m ptes R endus Math´ ema tique , 343 6 19–625. Lasry , J.-M. and Lio ns, P .- L. ( 2 006b). Jeux ` a c hamp moy en. I I–ho r izon ﬁni et con t r ˆ ole optimal. Comptes R endus Math´ ematique , 343 679–6 84. Lasry , J.-M. and L ions, P .- L. (2007). Mean ﬁeld games. Jap anese journal of mathem atics , 2 229–260. LeCun, Y., Bengio, Y. and Hin ton, G. (2015). Deep learning. Natur e , 521 436–444. Leib o, J. Z ., Z am baldi, V., La nctot, M., Marec ki, J. and Graep el, T. (2017 ). Multi-agen t reinforcemen t learning in sequen tial so cial dilemmas. In Internation a l Confer enc e on A utonomous A gents and Multiagent Systems . In ternational F oundation for Autonomous Agen ts and Multiagen t Systems. Lewis, F. L., V rabie, D. and Syrmos, V. L. (201 2 ). Optimal c ontr o l . John Wiley & Sons. Li, S., Zhang, W. and Zhao, L. (2017). Connections b etw een mean-ﬁeld game a nd so cial w elfare optimization. arXiv pr eprint arXiv:1703.102 1 1 . Li, T. and Z hang, J.-F. ( 2 008). Asymptotically optimal decen tralized con trol for larg e p opu- lation sto c hastic multiagen t system s. I EEE T r ansactions on Automatic Contr ol , 53 16 43– 1660. Li, Y. (201 8). Deep reinforcemen t learning. arXiv pr eprint arXiv:181 0 .06339 . Littman, M. L. (1994). Mark o v games as a framew ork fo r m ulti- agen t reinforcemen t learning. In Machine L e arning Pr o c e e dings 1 9 94 . Elsevier, 1 57–163. Littman, M. L. (2001). F riend-or-fo e Q-learning in general-sum g ames. In Pr o c e e di n gs o f the Eighte enth International Confer enc e on Machine L e arning . Liu, B., Liu, J., Ghav amzadeh, M., Mahadev an, S. and Pe trik, M. (2015). Finite-sample analysis of proximal gradien t TD algorithms. In Confer enc e on Unc ertainty in A rtiﬁci a l Intel ligenc e . Maei, H. R. (2018 ) . Con vergen t actor- critic alg o rithms under oﬀ - p olicy training and function appro ximation. arXiv pr eprint arXiv:1802.078 4 2 . 20 Magn us, J. R. (1979). The exp ectation o f pro ducts of quadratic fo rms in normal v ariables: the practice. Statistic a Ne erlandic a , 33 131–136. Magn us, J. R. et al. (1978) . The momen ts of pr o ducts of quadr atic forms in norm al variab le s . Univ., Instituut voo r Actuariaa t en Econometrie. Malik, D ., Pananjady , A., Bhatia, K., K ha maru, K., Bartlett, P . L. and W ain wrigh t, M. J. (2018). Deriv ativ e-free methods for p olicy optimization: Guara n tees for linear quadratic systems . arXiv pr eprint arXiv:1812.08 3 0 5 . Mguni, D., Jennings, J. and de Cote, E. M. (2018 ). D ecen tralised learning in systems with man y , man y strategic agen ts. In Thirty-Se c ond AAAI Confer enc e on Art iﬁcial Intel ligenc e . Minciardi, R. a nd Sacile, R. (2011 ) . Optimal con trol in a co o p erativ e net w or k of smart p ow er grids. I EEE Systems Jo urna l , 6 126–13 3 . Mo on, J. and Ba¸ sar, T. (2014). Discrete-time LQ G mean ﬁeld games with unreliable com- m unication. In Confer enc e on De cision and Contr ol . IEEE. Mo on, J. and Ba¸ sar, T. (2018). Linear quadratic mean ﬁeld stac kelberg diﬀeren tial games. A utomatic a , 97 200–2 1 3. Mora vˇ c ´ ık, M., Sc hmid, M., Burc h, N., Lis` y, V., Morrill, D., Bard, N., Davis , T., W augh, K., Johanson, M. and Bow ling, M. (2 0 17). Deepstac k: Exp ert-lev el artiﬁcial in telligence in heads-up no-limit p ok er. Scienc e , 356 50 8–513. Nash, J. (1 9 51). Non- co op erativ e games. Annals of ma them atics 286–295 . Op enAI (2018). O p enai ﬁv e. https://blo g.openai.co m/openai- f ive/ . P ´ erolat, J., Piot, B., Geist, M., Sc herrer, B. and Pietquin, O. (2016a). Softened appro ximate p olicy iteration f or ma r ko v games. In International Confe r enc e on Machine L e arning . P ´ erolat, J., Piot, B. and Pietquin, O . (2018). Actor-critic ﬁctitious play in sim ultaneous mo ve m ultistage games. In International Con fer enc e on Art iﬁcial I ntel ligenc e a nd Statis- tics . P ´ erolat, J., Piot , B., Sc herrer, B. and Pietquin, O. (2016b). On the use of non-statio nary strategies f o r solving t wo-pla y er zero-sum Mark ov games. In I nternational Confer enc e on A rtiﬁcial Intel ligen c e and Statistics . 21 P erolat, J., Sc herrer, B., Piot, B. and Pietquin, O. (2015) . Appro ximate dynamic program- ming for t wo-pla y er zero-sum Mark o v games. In I n ternational Confer enc e on Machine L e arning (ICML 20 1 5) . P eters, J. and Sc haal, S. (2008 ) . Natural actor-critic. Neur o c omputing , 71 1180–11 9 0. Rudelson, M., V ersh ynin, R. et al. (2013). Hanson-wrigh t inequalit y and sub-Ga ussian con- cen tration. Ele ctr onic Com m unic ations in Pr ob ability , 18 . Saldi, N., Basar, T. and Raginsky , M. (2018a) . Discrete-time risk-sensitiv e mean-ﬁeld games. arXiv pr eprint arXiv:1808.03929 . Saldi, N., Basar, T. and Raginsky , M. (2018b). Mark ov–Nash equilibria in mean-ﬁeld games with discoun ted cost. SIAM Journal on Contr ol and Optimization , 56 42 56–4287. Saldi, N., Basar, T. and Raginsky , M. (2019) . Approximate Nash equilibria in partially ob- serv ed sto chastic games with mean-ﬁeld in t era ctio ns. Mathematics of Op er ations R ese ar ch . Sandholm, W. H. (20 10). Population Games and Evolutionary Dynamics . MIT Press . Shalev-Sh w artz, S., Shammah, S. and Shash ua, A. (2016). Safe, m ulti-agent, reinforcemen t learning fo r autonomous driving. arXiv pr eprint arXiv:1610.03295 . Shoham, Y., P o wers , R. and Grenager, T. (2003). Multi-agen t reinforcemen t learning: a critical surv ey . Shoham, Y., Po w ers, R . a nd Grenager, T. (2007). If m ulti-agent learning is the answe r , what is the question? Art iﬁcial Intel ligenc e , 171 365–377. Silv er, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., V an D en Driessc he, G., Sc hrittw ieser, J., An tonog lo u, I. , Panneers helv am, V., Lanctot, M. et al. (2016). Master- ing the game of Go with deep neural net works and tree searc h. Natur e , 529 484–489. Silv er, D., Sc hrittwies er, J. , Simony an, K., An tonoglou, I., Huang, A., Guez, A., Hub ert, T., Bak er, L., Lai, M., Bo lt on, A. et al. (20 17). Mastering the game of Go without human kno wledge. Natur e , 550 3 5 4. Sutton, R. S. and Barto, A. G. (20 18). R einfor c ement le arni n g: An intr o duction . MIT Press. 22 Sutton, R. S. , Maei, H. R., Precup, D., Bhatnagar , S., Silve r, D ., Szep esv´ ar i, C. and Wiewiora, E. (2009a ) . F ast gradien t- descen t metho ds for temp oral- diﬀerence learning with linear function appro ximation. In Pr o c e e di n gs of the 26th Annual International Confe r enc e on Machine L e a rning . A CM. Sutton, R. S. , Maei, H. R. and Szep esv´ ari, C. (2009 b). A con v ergent o ( n ) temp o ral- diﬀerence algorithm fo r o ﬀ-p olicy learning with linear function appro ximation. In A dvanc es in neur al information pr o c essing systems . Sutton, R. S. , McAllester, D . A., Singh, S. P . and Mansour, Y. (2000). P olicy gradien t meth- o ds for reinforcemen t learning with function approximation. In A dvanc es in neur al infor- mation pr o c essing systems . Sznitman, A.-S. (19 91). T opics in propa g ation of c ha os. In Ec ole d’ ´ et ´ e de p r ob abilit ´ es de Saint-Flour XIX1989 . Springer, 165–251 . T em bine, H. and Huang, M. (2011). Mean ﬁeld diﬀerence games: Mck ean-Vlaso v dynamics. In Confe r enc e on De cisio n and Contr ol and Eur op e an Contr ol Conf e r enc e . IEEE. T em bine, H., Zh u, Q. and Ba¸ sar, T. (2 0 14). Risk-sensitiv e mean- ﬁeld games. IEEE T r ans- actions on A utomatic Contr ol , 59 835–85 0 . T u, S. and Rech t, B. ( 2 017). Least-squares temp oral diﬀerence learning for the linear quadratic regulator. a rXiv pr eprint arXiv:1712.08642 . T u, S. and Rec h t , B. (2018). The g ap b et w een mo del-based and mo del-free metho ds on the linear quadratic regulator: An asymptotic viewp oin t. arXiv pr eprint arXiv:1812.03565 . uz Zaman, M. A., Zhang, K., Miehling, E. and Basar, T. (2019). Appro ximate equilibrium computation for discrete-time linear- quadra t ic mean-ﬁeld games. Manuscri p t . Vin y als, O., Babus c hkin, I., Ch ung, J., Mathieu, M., Jaderberg, M., Cz arnec ki, W., Dudzik, A., Huang, A., Georgiev, P ., P ow ell, R. et al. (20 1 9). Alphastar: Mastering the real-time strategy game starcraft ii. W ai, H.-T., Y ang, Z., W a ng, P . Z. and Hong, M. (2018 ). Multi- a gen t reinforcemen t learn- ing via double av eraging primal-dual optimizatio n. In A dvanc es i n Neur al I nformation Pr o c essing Systems . 23 W ang, B.-C. and Zhang, J.-F. (201 2). Mean ﬁeld games for large-p opulation m ultia g en t systems with Mark ov jump par a meters. SI AM Journal on Contr ol and Optimization , 50 2308–233 4. W ang, J., Zhang, W., Y uan, S. et al. (2017a ) . D ispla y adv ertising with real-time bidding (R TB) and b eha vioural targeting. F oundation s and T r ends R  in Information R etrieval , 11 29 7 –435. W ang, Y., Chen, W., Liu, Y., Ma, Z.- M. and Liu, T.-Y. (2017b). Finite sample analysis of the GTD p olicy ev a luation algorithms in Mark ov setting. In A dvanc es in Neur al Informa- tion Pr o c essing Systems . W atkins, C. J. and Day an, P . (19 92). Q- learning. Mach ine le arning , 8 279–292. W ei, C.-Y., Hong, Y.-T. and Lu, C.-J. (2 017). Online reinforcemen t learning in sto c ha stic games. In A dvanc es in Neur al Information Pr o c es s ing Systems . Y ang, E. and Gu, D. (2004). Multiagent r einforcemen t learning for m ulti-rob ot system s: A surv ey . Manuscript . Y ang, J., Y e, X. , T rived i, R ., Xu, H. and Zha , H. (2018a). D eep mean ﬁeld ga mes f o r learning optimal b ehavior p olicy of large p opulations. In International Confer enc e on L e arning R epr esentations . Y ang, Y., Luo, R., Li, M., Zhou, M., Zhang, W. and W ang, J. (20 18b). Mean ﬁeld m ulti- agen t reinforcemen t learning. arXiv pr eprint arXiv:180 2 . 05438 . Y ang, Z., Chen, Y. , Hong, M. and W ang, Z. (201 9 ). On the global con v ergence of actor -critic: A case for linear quadra t ic regulator with ergo dic cost. arXiv pr eprint arXiv:1 9 07.06246 . Y u, H. (2017). On conv ergence of some gradien t- based temp oral- diﬀerences algorithms for oﬀ-p olicy learning. arXiv pr eprint arXiv:1 712.09652 . Zhang, K., Y ang, Z., Liu, H., Zhang, T. and Ba¸ sar, T. (2018) . Finite-sample a na lyses for fully decen tralized m ulti-agen t reinforcemen t learning. arXiv pr eprint arXiv:1812.0 2 783 . Zhou, X. Y. and Li, D. (2 000). Con tinuous -time mean-v ariance p ortfolio selection: A sto c ha stic LQ framew o rk. Applie d Mathematics and Optimization , 42 19–33. 24 Ziebart, B. D ., Maas, A. L., Bagnell, J. A. and Dey , A. K. (2008). Maximum en tro p y inv erse reinforcemen t learning. In AAAI Co n fer enc e on Artiﬁcial In tel lige nc e , v ol. 3. Zou, S., Xu, T. and Liang, Y. (2019). Finite- sample analysis for SARSA and Q-learning with linear function approximation. arXiv pr eprint arXiv:1902.022 3 4 . 25 A Notations in the App endix In the pro of , for conv enience, for a n y in vertible matr ix M , w e denote b y M −⊤ = ( M − 1 ) ⊤ = ( M ⊤ ) − 1 and k M k F the F robenius norm. W e also denote b y sv ec( M ) the symmetric vec - torization of the symmetric matrix M , which is the ve ctorization of the upp er triangula r matrix of the symmetric matrix M , with oﬀ-diago nal en tries scaled b y √ 2. W e denote by smat( · ) the in v erse op eratio n. F or an y matrices G and H , w e denote b y G ⊗ H the Kronec k er pro duct, and G ⊗ s H the symmetric Kro nec k er pro duct, whic h is deﬁned a s a mapping on a v ector sv ec( M ) such that ( G ⊗ s H )sv ec( M ) = 1 / 2 · sv ec ( H M G ⊤ + GM H ⊤ ). F or nota tional simplicit y , w e write E π ( · ) to emphasize tha t the exp ectation is ta k en follo wing the p olicy π . B Auxiliary Algorith ms and Analysis B.1 Results in D- L QR In this section, we pro vide auxiliary results in analyzing Problem 2.2 . First, w e introduce the v alue functions of the Marko v decision pro cess (MDP) induced b y Problem 2.2 . W e deﬁne the state- and action-v alue functions V K,b ( x ) and Q K,b ( x, u ) as follows V K,b ( x ) = ∞ X t =0 n E  c ( x t , u t ) | x 0 = x  − J ( K , b ) o , (B.1) Q K,b ( x, u ) = c ( x, u ) − J ( K , b ) + E  V K,b ( x 1 ) | x 0 = x, u 0 = u  , (B.2) where x t follo ws the state transition, and u t follo ws the p o licy π K,b giv en x t . In other w ords, w e hav e u t = − K x t + b + σ η t , where η t ∼ N (0 , I ). The follo wing prop osition establishes the close forms of these v alue functions. Prop osition B.1. The state-v alue function V K,b ( x ) tak es the form of V K,b ( x ) = x ⊤ P K x − tr( P K Φ K ) + 2 f ⊤ K,b ( x − µ K,b ) − µ ⊤ K,b P K µ K,b , (B.3) and the a ction-v alue function Q K,b ( x, u ) t a k es the form of Q K,b ( x, u ) = x u ! ⊤ Υ K x u ! + 2 p K,b q K,b ! ⊤ x u ! − tr( P K Φ K ) − σ 2 · tr( R + P K B B ⊤ ) − b ⊤ Rb + 2 b ⊤ RK µ K,b − µ ⊤ K,b ( Q + K ⊤ RK + P K ) µ K,b + 2 f ⊤ K,b  ( Aµ + d ) − µ K,b  + ( Aµ + d ) ⊤ P K ( Aµ + d ) , (B.4) 26 where f K,b = ( I − A + B K ) −⊤ [( A − B K ) ⊤ P K ( B b + Aµ + d ) − K ⊤ Rb ], and Υ K , p K,b , and q K,b are deﬁned in ( 3.7 ). Pr o of. See § E.6 for a detailed pro o f. By Propo sition B.1 , we kno w that V K,b ( x ) is quadratic in x , while Q K,b ( x, u ) is quadratic in ( x ⊤ , u ⊤ ) ⊤ . No w, w e show that ( 3.5 ) holds. Prop osition B .2. The exp ected total cost J ( K , b ) deﬁned in Problem 2.2 tak es the f o rm of J ( K , b ) = J 1 ( K ) + J 2 ( K , b ) + σ 2 · tr( R ) + µ ⊤ Qµ, where J 1 ( K ) = tr  ( Q + K ⊤ RK )Φ K  = tr( P K Ψ ǫ ) , J 2 ( K , b ) = µ K,b b ! ⊤ Q + K ⊤ RK − K ⊤ R − RK R ! µ K,b b ! . Here µ K,b is deﬁned in ( 3.2 ), Φ K is deﬁned in ( 3.3 ), and P K is deﬁned in ( 3.4 ). Pr o of. See § E.3 for a detailed pro o f. The follo wing prop osition establishes the gradien ts of J 1 ( K ) and J 2 ( K , b ), resp ectiv ely . Prop osition B.3. The gradien t of J 1 ( K ) and the gradien t of J 2 ( K , b ) with resp ect to b ta ke the forms o f ∇ K J 1 ( K ) = 2( Υ 22 K K − Υ 21 K ) · Φ K , ∇ b J 2 ( K , b ) = 2  Υ 22 K ( − K µ K,b + b ) + Υ 21 K µ K,b + q K,b  , where Υ K and q K,b are deﬁned in ( 3.7 ). Pr o of. See § E.5 for a detailed pro o f. The follo wing theorem establishes the con v ergence of Algorithm 2 . 27 Theorem B.4 (Con vergenc e of Algorithm 2 ) . Ass ume that ρ ( A − B K 0 ) < 1. Let ε > 0 b e a suﬃcien tly small tolerance. W e set γ ≤  k R k 2 + k B k 2 2 · J ( K 0 , b 0 ) · σ − 1 min (Ψ ǫ )  − 1 , N ≥ C · k Φ K ∗ k 2 · γ − 1 · log n 4  J ( K 0 , b 0 ) − J ( K ∗ , b ∗ )  · ε − 1 o , T n ≥ p oly  k K n k F , k b 0 k 2 , k µ k 2 , J ( K 0 , b 0 )  · λ − 4 K n ·  1 − ρ ( A − B K n )  − 9 · ε − 5 , e T n ≥ p oly  k K n k F , k b 0 k 2 , k µ k 2 , J ( K 0 , b 0 )  · λ − 2 K n ·  1 − ρ ( A − B K n )  − 12 · ε − 12 , γ n,t = γ 0 · t − 1 / 2 , γ b ≤ min n 1 − ρ ( A − B K N ) ,  1 − ρ ( A − B K N )  − 2 ·  k B k 2 2 · k K N k 2 2 · k R k 2 + k B k 2 2 · k Q k 2  o , H ≥ C 0 · ν − 1 K N · ( γ b ) − 1 · log n 4  J ( K N , b 0 ) − J ( K N , b K N )  · ε − 1 o , T b h ≥ p oly  k K N k F , k b h k 2 , k µ k 2 , J ( K N , b 0 )  · λ − 4 K N · ν − 4 K N ·  1 − ρ ( A − B K N )  − 11 · ε − 5 , e T b h ≥ p oly  k K N k F , k b h k 2 , k µ k 2 , J ( K N , b 0 )  · λ − 4 K N · ν − 2 K N ·  1 − ρ ( A − B K N )  − 17 · ε − 8 , γ b h,t = γ 0 · t − 1 / 2 , where C , C 0 , and γ 0 are p ositiv e absolute constan ts, { K n } n ∈ [ N ] and { b h } h ∈ [ H ] are the se- quences generated by Algo rithm 2 , λ K n is speciﬁed in Prop osition B.6 , and ν K N is speciﬁed in Prop osition 3.3 . Then it holds with pro babilit y a t least 1 − ε 10 that J ( K N , b H ) − J ( K ∗ , b ∗ ) < ε, k b H − b ∗ k 2 ≤ M b ( µ ) · ε 1 / 2 , k K N − K ∗ k F ≤  σ − 1 min (Ψ ǫ ) · σ − 1 min ( R ) · ε  1 / 2 , k b µ K N ,b H − µ K ∗ ,b ∗ k 2 ≤ ε, where M b ( µ ) is deﬁned in ( 4.3 ). Pr o of. See § D.2 for a detailed pro o f. By Theorem B.4 , given an y mean-ﬁeld state µ , Algorithm 2 conv erges linearly to the optimal p olicy π ∗ µ of Pro blem 2.2 . B.2 Primal-Dual P olicy Ev aluat ion Algorithm Note that the critic up date steps in Algorithm 2 a re built up on the estimators of the matrix Υ K and the ve ctor q K,b . W e now deriv e a p olicy ev aluation algorithm to establish the estimators of Υ K and q K,b , whic h is based on gradien t temp oral diﬀerence algorithm ( Sutton et al. , 200 9 a ). 28 W e deﬁne t he feature v ector as ψ ( x, u ) =    ϕ ( x, u ) x − µ K,b u − ( − K µ K,b + b )    , (B.5) where ϕ ( x, u ) = sv ec " x − µ K,b u − ( − K µ K,b + b ) ! x − µ K,b u − ( − K µ K,b + b ) ! ⊤ # . Recall sv ec( M ) giv es the symmetric v ectorization of the symmetric matrix M . W e also deﬁne α K,b =    sv ec(Υ K ) Υ K µ K,b − K µ K,b + b ! + p K,b q K,b !    , (B.6) where Υ K , p K,b , and q K,b are deﬁned in ( 3.7 ). T o estimate Υ K and q K,b , it suﬃces to estimate α K,b . Mean while, w e deﬁne Θ K,b = E π K,b n ψ ( x, u )  ψ ( x, u ) − ψ ( x ′ , u ′ )  ⊤ o , (B.7) where ( x ′ , u ′ ) is the state-action pair after ( x, u ) f o llo wing the p olicy π K,b and the state transition. The fo llowing prop osition c haracterizes the connection b et we en Θ K,b and α K,b . Prop osition B.5. It holds that 1 0 E π K,b  ψ ( x, u )  Θ K,b ! J ( K , b ) α K,b ! = J ( K , b ) E π K,b  c ( x, u ) ψ ( x, u )  ! , where ψ ( x, u ) is deﬁned in ( B.5 ), α K,b is deﬁned in ( B.6 ), and Θ K,b is deﬁned in ( B.7 ). Pr o of. See § E.7 for a detailed pro of. By Prop osition B.5 , to obtain α K,b , it suﬃces to solv e the follow ing linear system in ζ = ( ζ 1 , ζ ⊤ 2 ) ⊤ , e Θ K,b · ζ = J ( K , b ) E π K,b  c ( x, u ) ψ ( x, u )  ! , (B.8) where for not a tional con v enience, w e deﬁne e Θ K,b = 1 0 E π K,b  ψ ( x, u )  Θ K,b ! . (B.9) The follo wing prop osition show s that Θ K,b is in v ertible. 29 Prop osition B.6. If ρ ( A − B K ) < 1, then the matrix Θ K,b is in ve rtible, and k Θ K,b k 2 ≤ 4(1 + k K k 2 F ) 2 · k Φ K k 2 2 . Also, σ min ( e Θ K,b ) ≥ λ K , where λ K only depends on k K k 2 and ρ ( A − B K ). Pr o of. See § E.8 for a detailed pro of. By Prop osition B.6 , Θ K,b is inv ertible. Therefore, ( B.8 ) admits the unique solution ζ K,b = ( J ( K , b ) , α ⊤ K,b ) ⊤ . No w, we presen t the primal-dual gra dient temp ora l diﬀerence a lgorithm. Primal-Dual Gradien t Method. Instead of solving ( B.8 ) directly , we minimize the fol- lo wing loss function with resp ect to ζ = (( ζ 1 ) ⊤ , ( ζ 2 ) ⊤ ),  ζ 1 − J ( K , b )  2 +    E π K,b  ψ ( x, u )  ζ 1 + Θ K,b ζ 2 − E π K,b  c ( x, u ) ψ ( x, u )     2 2 . (B.10) By F enc hel’s duality , the minimization of ( B.10 ) is equiv a len t to the follo wing primal-dual min-max problem, min ζ ∈V ζ max ξ ∈V ξ F ( ζ , ξ ) = n E π K,b  ψ ( x, u )  ζ 1 + Θ K,b ζ 2 − E π K,b  c ( x, u ) ψ ( x, u )  o ⊤ ξ 2 (B.11) +  ζ 1 − J ( K , b )  · ξ 1 − k ξ k 2 2 / 2 , where w e r estrict t he primal v ariable ζ in a compact set V ζ and the dual v ariable ξ in a compact set V ξ , whic h are sp eciﬁed in Deﬁnition B.7 . It holds that ∇ ζ 1 F = ξ 1 + E π K,b  ψ ( x, u )  ⊤ ξ 2 , ∇ ζ 2 F = Θ ⊤ K,b ξ 2 , ∇ ξ 1 F = ζ 1 − J ( K , b ) − ξ 1 , ∇ ξ 2 F = E π K,b  ψ ( x, u )  ζ 1 + Θ K,b ζ 2 − E π K,b  c ( x, u ) ψ ( x, u )  − ξ 2 . (B.12) The primal-dual g radien t metho d up dates ζ a nd ξ via ζ 1 ← ζ 1 − γ · ∇ ζ 1 F ( ζ , ξ ) , ζ 2 ← ζ 2 − γ · ∇ ζ 2 F ( ζ , ξ ) ξ 1 ← ξ 1 − γ · ∇ ξ 1 F ( ζ , ξ ) , ξ 2 ← ξ 2 − γ · ∇ ξ 2 F ( ζ , ξ ) . (B.13) Estimation of Mean-Field State µ K,b . T o utilize the primal-dual g r a dien t metho d in ( B.13 ), it remains to ev aluate the feature v ector ψ ( x, u ). Note that b y ( B.5 ), the ev aluation of the feature v ector ψ ( x, u ) requires the mean-ﬁeld state µ K,b . In what follows, w e establish the estimator b µ K,b of the mean-ﬁeld state µ K,b b y sim ulating the MDP following the p olicy π K,b for e T steps, and calculate the estimated f eature v ector b ψ ( x, u ) b y b ψ ( x, u ) =    b ϕ ( x, u ) x − b µ K,b u − ( − K b µ K,b + b )    , (B.14) 30 where b ϕ ( x, u ) tak es the form of b ϕ ( x, u ) = sv ec " x − b µ K,b u − ( − K b µ K,b + b ) ! x − b µ K,b u − ( − K b µ K,b + b ) ! ⊤ # . W e no w deﬁne the sets V ζ and V ξ in ( B.11 ). Deﬁnition B.7. G iv en K 0 and b 0 suc h t ha t ρ ( A − B K 0 ) < 1 and J ( K 0 , b 0 ) < ∞ , w e deﬁne the sets V ζ and V ξ as V ζ = n ζ : 0 ≤ ζ 1 ≤ J ( K 0 , b 0 ) , k ζ 2 k 2 ≤ M ζ , 1 + M ζ , 2 · (1 + k K k F ) ·  1 − ρ ( A − B K )  − 1 o , V ξ = n ξ : | ξ 1 | ≤ J ( K 0 , b 0 ) , k ξ 2 k 2 ≤ M ξ ·  1 + k K k 2 F  3 ·  1 − ρ ( A − B K )  − 1 o . Here M ζ , 1 , M ζ , 2 , and M ξ are constants indep enden t o f K and b , whic h tak e the forms of M ζ , 1 = h  k Q k F + k R k F  +  k A k 2 F + k B k 2 F  · √ d · J ( K 0 , b 0 ) · σ − 1 min (Ψ ω ) i +  k A k 2 + k B k 2  · J ( K 0 , b 0 ) 2 · σ − 1 min (Ψ ω ) · σ − 1 min ( Q ) , + h  k Q k 2 + k R k 2  +  k A k 2 + k B k 2  2 · J ( K 0 , b 0 ) · σ − 1 min (Ψ ω ) i · J ( K 0 , b 0 ) ·  σ − 1 min ( Q ) + σ − 1 min ( R )  M ζ , 2 =  k A k 2 + k B k 2  · ( κ Q + κ R ) , M ξ = C · ( M ζ , 1 + M ζ , 2 ) · J ( K 0 , b 0 ) 2 · σ − 2 min ( Q ) , where C is a p ositiv e absolute constan t, and κ Q and κ R are condition num b ers of Q and R , resp ectiv ely . W e summarize the primal-dual gradient temp or al diﬀerence a lgorithm in Algo r it hm 3 . Hereafter, for notational con venie nce, w e denote by b ψ t the estimated feature v ector b ψ ( x t , u t ). W e no w ch aracterize the rate of con vergence of Algorithm 3 . Theorem B.8 (Con v ergence o f Alg orithm 3 ) . G iv en K 0 , b 0 , K , and b suc h that ρ ( A − B K 0 ) < 1 and J ( K , b ) ≤ J ( K 0 , b 0 ), w e deﬁne the sets V ζ and V ξ through D eﬁnition B.7 . Let γ t = γ 0 t − 1 / 2 , where γ 0 is a p ositiv e absolute constan t. Let ρ ∈ ( ρ ( A − B K ) , 1). F or e T ≥ p oly 0 ( k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )) · (1 − ρ ) − 6 and a suﬃcien tly lar g e T , it holds with probabilit y at least 1 − T − 4 − e T − 6 that k b α K,b − α K,b k 2 2 ≤ λ − 2 K · p o ly 1  k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )  ·  log 6 T T 1 / 2 · (1 − ρ ) 4 + log e T e T 1 / 4 · (1 − ρ ) 2  , 31 Algorithm 3 Primal-Dual Gradien t T emp o ral Diﬀerence Alg orithm. 1: Input: P olicy π K,b , mean-ﬁeld stat e µ , n umbers of iteration e T a nd T , stepsizes { γ t } t ∈ [ T ] , parameters K 0 and b 0 . 2: Deﬁne the sets V ζ and V ξ via Deﬁnition B.7 with K 0 and b 0 . 3: Initialize the parameters b y ζ 0 ∈ V ζ and ξ 0 ∈ V ξ . 4: Sample e x 0 from t he the stationary distribution N ( µ K,b , Φ K ). 5: for t = 0 , . . . , e T − 1 do 6: Giv en the mean- ﬁeld state µ , tak e action e u t follo wing π K,b and generate the next state e x t +1 . 7: end for 8: Set b µ K,b ← 1 / e T · P e T t =1 e x t and compute the estimated feature v ector b ψ via ( B.14 ). 9: Sample x 0 from t he the stationary distribution N ( µ K,b , Φ K ). 10: for t = 0 , . . . , T − 1 do 11: Giv en the mean-ﬁeld stat e µ , tak e a ction u t follo wing π K,b , observ e the cost c t , and generate the next state x t +1 . 12: Set δ t +1 ← ζ 1 t + ( b ψ t − b ψ t +1 ) ⊤ ζ 2 t − c t . 13: Up date parameters via ζ 1 t +1 ← ζ 1 t − γ t +1 · ( ξ 1 t + b ψ ⊤ t ξ 2 t ) , ζ 2 t +1 ← ζ 2 t − γ t +1 · b ψ t ( b ψ t − b ψ t +1 ) ⊤ ξ 2 t , ξ 1 t +1 ← (1 − γ t +1 ) · ξ 1 t + γ t +1 · ( ζ 1 t − c t ) , ξ 2 t +1 ← (1 − γ t +1 ) · ξ 2 t + γ t +1 · δ t +1 · b ψ t . 14: Pro ject ζ t +1 and ξ t +1 to V ζ and V ξ , respective ly . 15: end for 16: Set b α K,b ← ( P T t =1 γ t ) − 1 · ( P T t =1 γ t · ζ 2 t ), and b Υ K ← smat( b α K,b, 1 ) , b p K,b b q K,b ! ← b α K,b, 2 − b Υ K b µ K,b − K b µ K,b + b ! , where b α K,b, 1 = ( b α K,b ) ( k + d +1)( k + d ) / 2 1 and b α K,b, 2 = ( b α K,b ) ( k + d +3)( k + d ) / 2 ( k + d +1)( k + d ) / 2+1 . 17: Output: Estimators b µ K,b , b Υ K , and b q K,b . 32 where λ K is deﬁned in Prop o sition B.6 . Same b ounds for k b Υ K − Υ K k 2 F , k b p K,b − p K,b k 2 2 , and k b q K,b − q K,b k 2 2 hold. Mean while, it holds with probabilit y at least 1 − e T − 6 that k b µ K,b − µ K,b k 2 ≤ log e T e T 1 / 4 · (1 − ρ ) − 2 · p oly 2  k Φ K k 2 , k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )  . Pr o of. See § D.3 for a detailed pro o f. B.3 T emp oral Diﬀerence P olicy Ev aluation A lgorithm Besides the prima l- dual gradien t temp ora l diﬀerence alg orithm, w e can also ev aluate α K,b b y TD(0) metho d ( Sutton and Barto , 2018 ) in practice, whic h is presen ted in Algo rithm 4 . Algorithm 4 T emp oral Diﬀerence P olicy Ev aluation Algorithm. 1: Input: P olicy π K,b , n um b er of iteration e T and T , stepsizes { γ t } t ∈ [ T ] . 2: Sample e x 0 from t he stationary distribution N ( µ K,b , Φ K ). 3: for t = 0 , . . . , e T − 1 do 4: T ak e action e u t under the p olicy π K,b and generate t he next state e x t +1 . 5: end for 6: Set b µ K,b ← 1 / e T · P e T t =1 e x t . 7: Sample x 0 from t he the stationary distribution N ( µ K,b , Φ K ). 8: for t = 0 , . . . , T do 9: Giv en the mean-ﬁeld state µ , take action u t follo wing π K,b , observ e the cost c t , and generate the next state x t +1 . 10: Set δ t +1 ← ζ 1 t + ( b ψ t − b ψ t +1 ) ⊤ ζ 2 t − c t . 11: Up date parameters via ζ 1 t +1 ← (1 − γ t +1 ) · ζ 1 t + γ t +1 · c t and ζ 2 t +1 ← ζ 2 t − γ t +1 · δ t +1 · b ψ t . 12: Pro ject ζ t to V ′ ζ , where V ′ ζ is a compact set. 13: end for 14: Set b α K,b ← ( P T t =1 γ t ) − 1 · ( P T t =1 γ t · ζ 2 t ), and b Υ K ← smat( b α K,b, 1 ) , b p K,b b q K,b ! ← b α K,b, 2 − b Υ K b µ K,b − K b µ K,b + b ! , where b α K,b, 1 = ( b α K,b ) ( k + d +1)( k + d ) / 2 1 and b α K,b, 2 = ( b α K,b ) ( k + d +3)( k + d ) / 2 ( k + d +1)( k + d ) / 2+1 . 15: Output: Estimators b µ K,b , b Υ K , and b q K,b . Note tha t in related literature ( Bhandari et al. , 2 018 ; Ko rda and La , 2015 ), non- asymptotic con v ergence analysis o f TD(0) metho d with linear function approximation is only applied 33 to discoun ted MDP . As f or our ergo dic setting, the conv ergence o f TD(0) metho d is only sho wn asymptotically ( Bork ar and Meyn , 2000 ; Kushner and Yin , 200 3 ) using ordinary dif- feren tial equation metho d. Therefore, in the con ve rgence theorem prop osed in § 3 , w e only fo cus on the primal-dual gradien t temp oral diﬀerence metho d (Algorithm 3 ) to establish non-asymptotic con vergenc e result. C General F orm ulation In t his section, w e study a g eneral form ulatio n of LQ-MFG. Compared with Problem 2.1 , suc h a general form ula t io n includes an additiona l term x ⊤ t P E x ∗ t in the cost function. W e deﬁne the g eneral form ulation as follo ws. Problem C.1 (G eneral LQ-MF G ) . W e consider the follo wing form ulation, x t +1 = Ax t + B u t + A E x ∗ t + d + ω t , e c ( x t , u t ) = x ⊤ t Qx t + u ⊤ t Ru t + ( E x ∗ t ) ⊤ Q ( E x ∗ t ) + 2 x ⊤ t P ( E x ∗ t ) , e J ( π ) = lim T →∞ E " 1 T T X t =0 e c ( x t , u t ) # , where x t ∈ R m is the state v ector, u t ∈ R k is the action v ector generated b y the p olicy π , { x ∗ t } t ≥ 0 is the tra j ectory generated by a Nash p olicy π ∗ (assuming it exists), ω t ∈ R m is an indep enden t random noise term follo wing the Ga ussian distribution N (0 , Ψ ω ), and d ∈ R m is a drift term. Here the exp ectatio n in E x ∗ t is tak en across all the agen ts. W e aim to ﬁnd π ∗ suc h that e J ( π ∗ ) = inf π ∈ Π e J ( π ). F ollow ing similar analysis in § 2 , it suﬃces to study Problem C.1 with t suﬃcien tly large, whic h motiv ates us to f o rm ulate the following general drifted LQ R (general D -LQR) problem. Problem C.2 (General D-LQ R ) . Giv en a mean-ﬁeld state µ ∈ R m , we consider the followin g form ulat ion, x t +1 = Ax t + B u t + Aµ + d + ω t , e c µ ( x t , u t ) = x ⊤ t Qx t + u ⊤ t Ru t + µ ⊤ Qµ + 2 x ⊤ t P µ, e J µ ( π ) = lim T →∞ E " 1 T T X t =0 e c µ ( x t , u t ) # , 34 where x t ∈ R m is the state ve ctor, u t ∈ R k is the action v ector generated by the p olicy π , ω t ∈ R m is an indep enden t random noise term following the G aussian distribution N (0 , Ψ ω ), and d ∈ R m is a drift term. W e aim to ﬁnd an o ptimal p olicy π ∗ µ suc h that e J µ ( π ∗ µ ) = inf π ∈ Π e J µ ( π ). In Problem C.2 , the unique optimal p o licy π ∗ µ admits a linear form π ∗ µ ( x t ) = − K π ∗ µ x t + b π ∗ µ ( Anderson and Mo ore , 20 07 ), where the matrix K π ∗ µ ∈ R k × m and the v ector b π ∗ µ ∈ R k are the para meters of the p olicy π . It then suﬃces to ﬁnd the optimal p olicy in the class Π in tro duced in ( 2.1 ). Similar to § 3.3 , w e drop the subscript µ when w e fo cus on Problem C.2 for a ﬁxed µ . W e write π K,b ( x ) = − K x + b + σ η t o emphasize the dep endence on K and b , and e J ( K , b ) = e J ( π K,b ) consequen tly . W e deriv e a close form of the exp ected total cost e J ( K , b ) in the follo wing prop osition. Prop osition C.3. The expected total cost e J ( K , b ) in Problem C.2 is decomp osed as e J ( K , b ) = e J 1 ( K ) + e J 2 ( K , b ) + σ 2 · tr( R ) + µ ⊤ Qµ, where e J 1 ( K ) and e J 2 ( K , b ) take the forms of e J 1 ( K ) = tr  ( Q + K ⊤ RK )Φ K  = tr( P K Ψ ǫ ) , e J 2 ( K , b ) = µ K,b b ! ⊤ Q + K ⊤ RK − K ⊤ R − RK R ! µ K,b b ! + 2 µ ⊤ P µ K,b . Here µ K,b is deﬁned in ( 3.2 ), Φ K is deﬁned in ( 3.3 ), and P K is deﬁned in ( 3.4 ). Pr o of. The pro of is similar to the one of Prop osition B.2 . Thus w e omit it here. Compared with the form of J ( K , b ) in ( 3.5 ), the form o f e J ( K , b ) contains an additional term 2 µ ⊤ P µ K,b in e J 2 ( K , b ). Recall that µ K,b is linear in b by ( 3.2 ). Therefore, 2 µ ⊤ P µ K,b is linear in b , whic h sho ws that e J 2 ( K , b ) is still strongly con v ex in b . The follo wing pro p osition formally c haracterize the strong con vex it y of e J 2 ( K , b ) in b . Prop osition C .4. Giv en an y K , the function e J 2 ( K , b ) is ν K -strongly conv ex in b , here ν K = σ min ( Y ⊤ 1 ,K Y 1 ,K + Y ⊤ 2 ,K Y 2 ,K ), where Y 1 ,K = R 1 / 2 K ( I − A + B K ) − 1 B − R 1 / 2 and Y 2 ,K = Q 1 / 2 ( I − A + B K ) − 1 B . Also, e J 2 ( K , b ) has ι K -Lipsc hitz con tinuous gradient in b , where ι K ≤ [1 − ρ ( A − B K )] − 2 · ( k B k 2 2 · k K k 2 2 · k R k 2 + k B k 2 2 · k Q k 2 ). Pr o of. The pro of is similar to the one of Prop osition 3.3 . Th us w e omit it here. W e deriv e a similar prop osition to Prop osition 3.4 in t he seque l. 35 Prop osition C.5. W e deﬁne e b K = argmin b e J 2 ( K , b ). It holds that e b K =  K Q − 1 ( I − A ) ⊤ − R − 1 B ⊤  · S ·  ( Aµ + d ) + ( I − A ) Q − 1 P ⊤ µ  − K Q − 1 P ⊤ µ. Moreo ve r, e J 2 ( K , b K ) ta kes the form of e J 2 ( K , e b K ) = Aµ + d P ⊤ µ ! ⊤ S S ( I − A ) Q − 1 Q − 1 ( I − A ) ⊤ S 3 Q − 1 ( I − A ) ⊤ S ( I − A ) Q − 1 − Q − 1 ! Aµ + d P ⊤ µ ! , whic h is indep enden t of K . Here S = [( I − A ) Q − 1 ( I − A ) ⊤ + B R − 1 B ⊤ ] − 1 . Pr o of. The pro of is similar to the one of Prop osition 3.4 . Th us w e omit it here. Similar to Problem 2 .2 , we deﬁne the state- and a ction-v alue functions as e V K,b ( x ) = ∞ X t =0 n E  e c ( x t , u t ) | x 0 = x  − e J ( K , b ) o , e Q K,b ( x, u ) = e c ( x, u ) − e J ( K , b ) + E  e V K,b ( x 1 ) | x 0 = x, u 0 = u  , where x t follo ws the state transition, and u t follo ws the p o licy π K,b giv en x t . In other w ords, w e ha v e u t = − K x t + b + σ η t , where η t ∼ N (0 , I ). Similar t o Prop osition B.1 , the f ollo wing prop osition establishes the close fo rms of these v alue functions. Prop osition C.6. The state-v alue function e V K,b ( x ) tak es the form of e V K,b ( x ) = x ⊤ P K x − tr( P K Φ K ) + 2 e f ⊤ K,b ( x − µ K,b ) − µ ⊤ K,b P K µ K,b , and the a ction-v alue function e Q K,b ( x, u ) t a k es the form of e Q K,b ( x, u ) = x u ! ⊤ Υ K x u ! + 2 e p K,b e q K,b ! ⊤ x u ! − tr( P K Φ K ) − σ 2 · tr( R + P K B B ⊤ ) − b ⊤ Rb + 2 b ⊤ RK µ K,b − µ ⊤ K,b ( Q + K ⊤ RK + P K ) µ K,b + 2 e f ⊤ K,b  ( Aµ + d ) − µ K,b  + ( Aµ + d ) ⊤ P K ( Aµ + d ) − 2 µ ⊤ P µ K,b . Here Υ K is deﬁned in ( 3.7 ), and e p K,b , e q K,b are deﬁned as e p K,b = A ⊤  P K · ( Aµ + d ) + e f K,b  + P µ, e q K,b = B ⊤  P K · ( Aµ + d ) + e f K,b  , (C.1) where e f K,b = ( I − A + B K ) −⊤ [( A − B K ) ⊤ P K ( B b + Aµ + d ) − K ⊤ Rb + P µ ]. Pr o of. The pro of is similar to the one of Prop osition B.1 . Thus w e omit it here. 36 The follo wing prop osition establishes the gradien ts of e J 1 ( K ) and e J 2 ( K , b ), resp ectiv ely . Prop osition C.7. The gr a dien t o f e J 1 ( K ) and the gradient of e J 2 ( K , b ) with resp ect to b tak e the forms of ∇ K e J 1 ( K ) = 2( Υ 22 K K − Υ 21 K ) · Φ K , ∇ b e J 2 ( K , b ) = 2  Υ 22 K ( − K µ K,b + b ) + Υ 21 K µ K,b + e q K,b  , where Υ K and e q K,b are deﬁned in ( 3.7 ) and ( C.1 ) , resp ectiv ely . Pr o of. The pro of is similar to the one of Prop osition B.3 . Thus w e omit it here. Equipped with ab ov e results, par allel to the analysis in § 3 , it is clear that by sligh t mo diﬁcation of Algo rithms 1 , 2 , and 3 , we deriv e similar actor-critic a lgorithms to solve b oth Problem C.1 and Problem C.2 , where all the non-asymptotic conv ergence results hold. W e omit the alg o rithms and the con ve rgence results here. D Pro ofs of Theore ms D.1 Pro of of Theorem 4.1 W e deﬁne µ ∗ s +1 = Λ( µ s ), whic h is the mean-ﬁeld state generated b y the optimal p o licy π K ∗ ( µ s ) ,b ∗ ( µ s ) = Λ 1 ( µ s ) under the current mean-ﬁeld state µ s . By Prop o sition 3 .4 , t he optimal K ∗ ( µ ) is independen t of the mean-ﬁeld state µ . Therefore, w e write K ∗ = K ∗ ( µ ) hereafter for notational conv enience. By ( 3.2 ), w e kno w that µ ∗ s +1 = ( I − A + B K ∗ ) − 1 ·  B b ∗ ( µ s ) + Aµ s + d  . W e deﬁne e µ s +1 = ( I − A + B K s ) − 1 ( B b s + A µ s + d ) , whic h is the mean-ﬁeld state generated by the p olicy π s under the curren t mean-ﬁeld state µ s , where K s and b s are the parameters of the p olicy π s . By tria ngle inequality , w e ha ve k µ s +1 − µ ∗ k 2 ≤ k µ s +1 − e µ s +1 k 2 | {z } E 1 + k e µ s +1 − µ ∗ s +1 k 2 | {z } E 2 + k µ ∗ s +1 − µ ∗ k 2 | {z } E 3 , (D.1) where µ s +1 is generated by Algorit hm 1 . W e upp er b ound E 1 , E 2 , and E 3 in the sequel. 37 Upp er B ound of E 1 . By Theorem B.4 , it holds with probability at least 1 − ε 10 that E 1 = k µ s +1 − e µ s +1 k 2 < ε s ≤ ε/ 8 · 2 − s , (D.2) where ε s is giv en in ( 4.2 ). Upp er B ound of E 2 . By the triangle inequalit y , w e ha v e E 2 =    ( I − A + B K s ) − 1 ( B b s + Aµ s + d ) − ( I − A + B K ∗ ) − 1 ·  B b ∗ ( µ s ) + A µ s + d     2 ≤   B b ∗ ( µ s ) + Aµ s + d   2 ·     I − A + B K ∗ + B ( K s − K ∗ )  − 1 − ( I − A + B K ∗ ) − 1    2 +   ( I − A + B K s ) − 1   2 · k B k 2 ·   b s − b ∗ ( µ s )   2 . (D.3) By T aylor’s expansion, we hav e     I − A + B K ∗ + B ( K s − K ∗ )  − 1 − ( I − A + B K ∗ ) − 1    2 =    ( I − A + B K ∗ ) − 1  I + ( I − A + B K ∗ ) − 1 B ( K s − K ∗ )  − 1 − ( I − A + B K ∗ ) − 1    2 ≤ 2   ( I − A + B K ∗ ) − 1 B ( K s − K ∗ )( I − A + B K ∗ ) − 1   2 . (D.4) Mean while, by T ay lo r’s expansion, it holds with probability at least 1 − ε 10 that   ( I − A + B K s ) − 1   2 =     I − A + B K ∗ + B ( K s − K ∗ )  − 1    2 =    ( I − A + B K ∗ ) − 1  I + ( I − A + B K ∗ ) − 1 B ( K s − K ∗ )  − 1    2 ≤  1 − ρ ( A − B K ∗ )  − 1 ·  1 +   ( I − A + B K ∗ ) − 1 B   2 · k K ∗ − K s k 2  ≤ 2  1 − ρ ( A − B K ∗ )  − 2 , (D.5) where the last inequalit y comes from Theorem B.4 . By plugging ( D.4 ) and ( D.5 ) in ( D.3 ), it holds with probability a t least 1 − ε 10 that E 2 ≤ 2   B b ∗ ( µ s ) + Aµ s + d   2 ·   ( I − A + B K ∗ ) − 1 B ( K s − K ∗ )( I − A + B K ∗ ) − 1   2 +   ( I − A + B K s ) − 1   2 · k B k 2 ·   b s − b ∗ ( µ s )   2 ≤ 2   B b ∗ ( µ s ) + Aµ s + d   2 ·  1 − ρ ( A − B K ∗ )  − 2 · k B k 2 · k K s − K ∗ k 2 (D.6) + 2  1 − ρ ( A − B K ∗ )  − 2 · k B k 2 ·   b s − b ∗ ( µ s )   2 . By Prop o sition 3.4 , it ho lds that   B b ∗ ( µ s ) + Aµ s + d   2 ≤ L 1 · k B k 2 · k µ s k 2 + k A k 2 · k µ s k 2 + k d k 2 ≤  L 1 · k B k 2 + k A k 2  · k µ s k 2 + k d k 2 , (D.7) 38 where the scalar L 1 is deﬁned in Assumption 3 .1 . Mean while, by Theorem B.4 , it ho lds with probabilit y at least 1 − ε 10 that k K s − K ∗ k F ≤  σ − 1 min (Ψ ǫ ) · σ − 1 min ( R ) · ε s  1 / 2 ,   b s − b ∗ ( µ s )   2 ≤ M b ( µ s ) · ε 1 / 2 s , (D.8) where M b ( µ s ) is deﬁned in ( 4.3 ). Combining ( D.6 ), ( D.7 ), ( D.8 ), and the c hoice of ε s in ( 4.2 ), it holds with probability at least 1 − ε 10 that E 2 ≤ ε/ 8 · 2 − s . (D.9) Upp er B ound of E 3 . By Prop osition 3.2 , w e ha v e E 3 = k µ ∗ s +1 − µ ∗ k 2 =   Λ( µ s ) − Λ( µ ∗ )   2 ≤ L 0 · k µ s − µ ∗ k 2 , (D.10) where L 0 = L 1 L 3 + L 2 b y Assumption 3.1 . By plugging ( D.2 ), ( D.9 ), and ( D.10 ) in ( D.1 ), w e know that k µ s +1 − µ ∗ k 2 ≤ L 0 · k µ s − µ ∗ k 2 + ε · 2 − s − 2 , (D.11) whic h holds with probability a t least 1 − ε 10 . F ollowing from ( D.11 ) and a union b ound argumen t with S = O (log (1 /ε )), it holds with probabilit y at least 1 − ε 5 that k µ S − µ ∗ k 2 ≤ L S 0 · k µ 0 − µ ∗ k 2 + ε / 2 , where w e use the fact that L 0 < 1 b y Assump tion 3.1 . By the c hoice of S in ( 4.1 ), it further holds with probability at least 1 − ε 6 that k µ S − µ ∗ k ≤ ε. (D.12) By Theorem B.4 and t he c hoice of ε s in ( 4.2 ), it holds with probabilit y at least 1 − ε 5 that k K S − K ∗ k F =   K S − K ∗ ( µ S )   F ≤  σ − 1 min (Ψ ǫ ) · σ − 1 min ( R ) · ε S  1 / 2 ≤ ε. (D.13) Mean while, b y the tria ng le inequalit y and the c hoice of ε s in ( 4.2 ), it holds with probabilit y at least 1 − ε 5 that k b S − b ∗ k 2 ≤   b S − b ∗ ( µ S )   2 +   b ∗ ( µ S ) − b ∗   2 ≤ M b ( µ S ) · ε 1 / 2 S + L 1 · k µ S − µ ∗ k 2 ≤ (1 + L 1 ) · ε , (D.14) where the second inequalit y comes from Theorem B.4 and Prop osition 3.4 , and the last inequalit y comes from ( D .12 ). By ( D.12 ), ( D.13 ), and ( D .14 ), w e conclude the pro of of the theorem. 39 D.2 Pro of of Theorem B.4 Pr o of. W e ﬁrst sho w that J 1 ( K N ) − J 1 ( K ∗ ) < ε/ 2 with a high probabilit y , then sho w that J 2 ( K N , b H ) − J 2 ( K ∗ , b ∗ ) < ε/ 2 with a high probability . Then w e hav e J ( K N , b N ) − J ( K ∗ , b ∗ ) = J 1 ( K N ) + J 2 ( K N , b H ) − J 1 ( K ∗ ) − J 2 ( K ∗ , b ∗ ) < ε with a high probability , whic h prov es Theorem B.4 . P art 1. W e show that J 1 ( K N ) − J 1 ( K ∗ ) < ε/ 2 with a high probability . W e ﬁrst b ound J 1 ( K 1 ) − J 1 ( K 2 ) for an y K 1 and K 2 . By Prop osition B.2 , J 1 ( K ) take s the form of J 1 ( K ) = tr( P K Ψ ǫ ) = E y ∼ N (0 , Ψ ǫ ) ( y ⊤ P K y ) . (D.15) The follo wing lemma calculates y ⊤ P K 1 y − y ⊤ P K 2 y for a n y K 1 and K 2 . Lemma D.1. Assume that ρ ( A − B K 1 ) < 1 and ρ ( A − B K 2 ) < 1. F or an y state vec tor y , we denote by { y t } t ≥ 0 the sequence generated b y the state transition y t +1 = ( A − B K 2 ) y t with initial state y 0 = y . It holds that y ⊤ P K 2 y − y ⊤ P K 1 y = X t ≥ 0 D K 1 ,K 2 ( y t ) , where D K 1 ,K 2 ( y ) = 2 y ⊤ ( K 2 − K 1 )(Υ 22 K 1 K 1 − Υ 21 K 1 ) y + y ⊤ ( K 2 − K 1 ) ⊤ Υ 22 K 1 ( K 2 − K 1 ) y . Here Υ K is deﬁned in ( 3.7 ). Pr o of. See § F.1 for a detailed pro of . The follo wing lemma sho ws that J 1 ( K ) is gradient dominan t. Lemma D.2. Let K ∗ b e the optimal para meter and K b e a parameter suc h that J 1 ( K ) < ∞ , then it ho lds that J 1 ( K ) − J 1 ( K ∗ ) ≤ σ − 1 min ( R ) · k Φ K ∗ k 2 · tr  (Υ 22 K K − Υ 21 K ) ⊤ (Υ 22 K K − Υ 21 K )  , (D.16) J 1 ( K ) − J 1 ( K ∗ ) ≥ σ min (Ψ ω ) · k Υ 22 K k − 1 2 · tr  (Υ 22 K K − Υ 21 K ) ⊤ (Υ 22 K K − Υ 21 K )  . (D.17) Pr o of. See § F.2 for a detailed pro of . 40 Recall that from Algor ithm 2 , the parameter K is up dated via K n +1 = K n − γ · ( b Υ 22 K n K n − b Υ 21 K n ) , (D.18) where b Υ K n is the o utput of Algo rithm 3 . W e upp er b ound | J 1 ( K n +1 ) − J 1 ( K ∗ ) | in t he sequ el. First, w e sho w that if J 1 ( K n ) − J 1 ( K ∗ ) ≥ ε/ 2 holds for an y n ≤ N , w e obtain that J 1 ( K N ) ≤ J 1 ( K N − 1 ) ≤ · · · ≤ J 1 ( K 0 ) , (D.19) whic h holds with probability at least 1 − ε 13 . W e prov e ( D.19 ) b y mathematical induction. Supp ose that J 1 ( K n ) ≤ J 1 ( K n − 1 ) ≤ · · · ≤ J 1 ( K 0 ) , (D.20) whic h holds for n = 0. In what fo llo ws, w e deﬁne e K n +1 as e K n +1 = K n − γ · (Υ 22 K n K n − Υ 21 K n ) , (D.21) where Υ K n is giv en in ( 3.7 ). By ( D.21 ), w e hav e J 1 ( e K n +1 ) − J 1 ( K n ) = E y ∼ N (0 , Ψ ǫ )  y ⊤ ( P e K n +1 − P K n ) y  = − 2 γ · tr  Φ e K n +1 · (Υ 22 K n K n − Υ 21 K n ) ⊤ (Υ 22 K n K n − Υ 21 K n )  + γ 2 · tr  Φ e K n +1 · (Υ 22 K n K n − Υ 21 K n ) ⊤ Υ 22 K n (Υ 22 K n K n − Υ 21 K n )  ≤ − 2 γ · tr  Φ e K n +1 · (Υ 22 K n K n − Υ 21 K n ) ⊤ (Υ 22 K n K n − Υ 21 K n )  (D.22) + γ 2 · k Υ 22 K n k 2 · tr  Φ e K n +1 · (Υ 22 K n K n − Υ 21 K n ) ⊤ (Υ 22 K n K n − Υ 21 K n )  , where the ﬁrst equalit y comes from ( D.15 ), t he second equality comes fro m Lemma D.1 , and the last inequalit y comes from the trace inequality . By the deﬁnition of Υ K in ( 3.7 ), w e obtain t hat k Υ 22 K n k 2 ≤ k R k 2 + k B k 2 2 · k P K n k 2 ≤ k R k 2 + k B k 2 2 · J 1 ( K n ) · σ − 1 min (Ψ ǫ ) ≤ k R k 2 + k B k 2 2 · J 1 ( K 0 ) · σ − 1 min (Ψ ǫ ) , (D.23) where the second inequality comes f r o m Prop osition B.2 . By plugging ( D.23 ) and t he c hoice of stepsize γ ≤ [ k R k 2 + k B k 2 2 · J 1 ( K 0 ) · σ − 1 min (Ψ ǫ )] − 1 in to ( D.22 ), w e obtain that J 1 ( e K n +1 ) − J 1 ( K n ) ≤ − γ · tr  Φ e K n +1 · (Υ 22 K n K n − Υ 21 K n ) ⊤ (Υ 22 K n K n − Υ 21 K n )  ≤ − γ · σ min (Ψ ǫ ) · t r  (Υ 22 K n K n − Υ 21 K n ) ⊤ (Υ 22 K n K n − Υ 21 K n )  ≤ − γ · σ min (Ψ ǫ ) · σ min ( R ) · k Φ K ∗ k − 1 2 ·  J 1 ( K n ) − J 1 ( K ∗ )  < 0 , (D.24) where the la st inequalit y comes from L emma D.2 . The follo wing lemma upp er b ounds | J 1 ( e K n +1 ) − J 1 ( K n +1 ) | . 41 Lemma D.3. Assume that J 1 ( K n ) ≤ J 1 ( K 0 ). It holds with pro ba bilit y at least 1 − ε 15 that   J 1 ( e K n +1 ) − J 1 ( K n +1 )   ≤ γ · σ min (Ψ ǫ ) · σ min ( R ) · k Φ K ∗ k − 1 2 · ε/ 4 , where K n +1 and e K n +1 are deﬁned in ( D.18 ) and ( D .21 ), respectiv ely . Pr o of. See § F.3 for a detailed pro of . Com bining ( D .2 4 ) and Lemma D.3 , if J 1 ( K n ) − J 1 ( K ∗ ) ≥ ε/ 2, it holds with probabilit y at least 1 − ε 15 that J 1 ( K n +1 ) − J 1 ( K n ) ≤ J 1 ( e K n +1 ) − J 1 ( K n ) +   J 1 ( e K n +1 ) − J 1 ( K n +1 )   ≤ − γ · σ min (Ψ ǫ ) · σ min ( R ) · k Φ K ∗ k − 1 2 · ε/ 4 < 0 . (D.25) Com bining ( D.20 ) and ( D .25 ), it holds with probability at least 1 − ε 15 that J 1 ( K n +1 ) ≤ J 1 ( K n ) ≤ · · · ≤ J 1 ( K 0 ) . Finally , following from a union b ound argumen t and the choice of N in Theorem B.4 , if J 1 ( K n ) − J 1 ( K ∗ ) ≥ ε/ 2 holds f o r a ny n ≤ N , w e hav e J 1 ( K N ) ≤ J 1 ( K N − 1 ) ≤ · · · ≤ J 1 ( K 0 ) , whic h holds with probability at least 1 − ε 13 . Th us, w e complete the pro of of ( D.19 ). Com bining ( D.24 ) a nd ( D .25 ), for J 1 ( K n ) − J 1 ( K ∗ ) ≥ ε/ 2, w e hav e J 1 ( K n +1 ) − J 1 ( K ∗ ) ≤  1 − γ · σ min (Ψ ǫ ) · σ min ( R ) · k Φ K ∗ k − 1 2  ·  J 1 ( K n ) − J 1 ( K ∗ )  , whic h holds with probability at least 1 − ε 13 . Mean while, following from a union b ound argumen t and the c hoice of N in Theorem B.4 , it holds with pr o babilit y at least 1 − ε 11 that J 1 ( K N ) − J 1 ( K ∗ ) ≤ ε/ 2 . (D.26) The follo wing lemma upp er b ounds k K N − K ∗ k F . Lemma D.4. F o r any K , we hav e k K − K ∗ k 2 F ≤ σ − 1 min (Ψ ǫ ) · σ − 1 min ( R ) ·  J 1 ( K ) − J 1 ( K ∗ )  . Pr o of. See § F.4 for a detailed pro of . 42 Com bining ( D.26 ) and Lemma D.4 , we hav e k K N − K ∗ k F ≤  σ − 1 min (Ψ ǫ ) · σ − 1 min ( R ) · ε/ 2  1 / 2 , (D.27) whic h holds with probability 1 − ε 11 . P art 2. W e sho w tha t J 2 ( K N , b H ) − J 2 ( K ∗ , b ∗ ) < ε/ 2 with high pro babilit y . F ollowing from Prop osition 3.4 , it holds t ha t J 2 ( K ∗ , b ∗ ) = J 2 ( K N , b K N ). Therefore, it suﬃces to sho w that J 2 ( K N , b H ) − J 2 ( K N , b K N ) < ε/ 2. First, w e sho w that if J 2 ( K N , b h ) − J 2 ( K N , b K N ) ≥ ε/ 2 for an y h ≤ H , w e obtain that J 2 ( K N , b H ) ≤ J 2 ( K N , b H − 1 ) ≤ · · · ≤ J 2 ( K N , b 1 ) ≤ J 2 ( K N , b 0 ) , (D.28) whic h holds with probability at least 1 − ε 13 . W e prov e ( D.28 ) b y mathematical induction. Supp ose that J 2 ( K N , b h ) ≤ J 2 ( K N , b h − 1 ) ≤ · · · ≤ J 2 ( K N , b 0 ) , (D.29) Recall that by Algor it hm 2 , the parameter b is upda ted via b h +1 = b h − γ b · b ∇ b J 2 ( K N , b h ) . (D.30) Here b ∇ b J 2 ( K N , b h ) = b Υ 22 K N ( − K N b µ K N ,b h + b h ) + b Υ 21 K N b µ K N ,b h + b q K N ,b h , (D.31) where b Υ K N and b q K N ,b h are the o utputs of Algorithm 3 . W e deﬁne e b h +1 as e b h +1 = b h − γ b · ∇ b J 2 ( K N , b h ) . (D.32) Here ∇ b J 2 ( K N , b h ) = Υ 22 K N ( − K N µ K N ,b h + b h ) + Υ 21 K N µ K N ,b h + q K N ,b h , (D.33) where Υ K N and q K N ,b h are deﬁned in ( 3.7 ). W e upp er b ound J 2 ( K N , b h +1 ) − J 2 ( K N , b K N ) in the seque l. F ollowing from ( D.32 ) and Prop osition 3.3 , w e ha ve J 2 ( K N , e b h +1 ) − J 2 ( K N , b h ) ≤ − γ b / 2 ·   ∇ b J 2 ( K N , b h )   2 2 ≤ − ν K N · γ b ·  J 2 ( K N , b h ) − J 2 ( K N , b K N )  ≤ − ν K N · γ b · ε < 0 , (D.34) where ν K N is sp eciﬁed in Prop osition 3.3 . The follo wing lemma upp er b ounds | J 2 ( K N , b h +1 ) − J 2 ( K N , e b h +1 ) | . 43 Lemma D.5. Assume that J 2 ( K N , b h ) ≤ J 2 ( K N , b 0 ). It holds with pro babilit y a t least 1 − ε 15 that   J 2 ( K N , b h +1 ) − J 2 ( K N , e b h +1 )   ≤ ν K N · γ b · ε/ 2 , where b h +1 and e b h +1 are deﬁned in ( D.30 ) a nd ( D .32 ), respectiv ely . Pr o of. See § F.5 for a detailed pro of . Com bining ( D .34 ) and Lemma D .5 , we kno w tha t if J 2 ( K N , b h ) − J 2 ( K N , b K N ) ≥ ε , it holds with probability at least 1 − ε 15 that J 2 ( K N , b h +1 ) − J 2 ( K N , b h ) ≤ J 2 ( K N , e b h +1 ) − J 2 ( K N , b h ) +   J 2 ( K N , b h +1 ) − J 2 ( K N , e b h +1 )   ≤ − ν K N · γ b · ε/ 2 < 0 . (D.35) Com bining ( D.29 ) and ( D .35 ), it holds with probability at least 1 − ε 15 that J 2 ( K N , b h +1 ) ≤ J 2 ( K N , b h ) ≤ · · · ≤ J 2 ( K N , b 0 ) . F ollow ing from a union b o und ar g umen t and the c hoice of H in Theorem B.4 , if J 2 ( K N , b h ) − J 2 ( K N , b K N ) ≥ ε holds for any h ≤ H , w e ha v e J 2 ( K N , b H ) ≤ J 2 ( K N , b H − 1 ) ≤ · · · ≤ J 2 ( K N , b 0 ) , whic h holds with probability at least 1 − ε 13 . Th us, w e ﬁnish the pro of of ( D.28 ). Com bining ( D.34 ) and Lemma D.5 , fo r J 2 ( K N , b h ) − J 2 ( K N , b K N ) ≥ ε/ 2, w e hav e J 2 ( K N , b h +1 ) − J 2 ( K N , b K N ) ≤ (1 − ν K N · γ b ) ·  J 2 ( K N , b h ) − J 2 ( K N , b K N )  , whic h holds with probability at least 1 − ε 13 . Mean while, following from a union b ound argumen t and the c hoice of H in Theorem B.4 , it holds with pr o babilit y at least 1 − ε 11 that J 2 ( K N , b H ) − J 2 ( K N , b K N ) ≤ ε/ 2 . (D.36) By Prop o sition 3.3 and ( D .36 ), it holds with probability at least 1 − ε 11 that k b H − b K N k 2 ≤ (2 ε/ν K ∗ ) 1 / 2 . (D.37) F ollow ing from Prop osition 3.4 , w e kno w that b K N − b ∗ = ( K N − K ∗ ) Q − 1 ( I − A ) ⊤ (D.38) ·  ( I − A ) Q − 1 ( I − A ) ⊤ + B R − 1 B ⊤  − 1 · ( Aµ + d ) . 44 Com bining ( D.27 ), ( D.37 ), and ( D.3 8 ), it holds with probabilit y 1 − ε 10 that k b H − b K N k 2 ≤ M b · ε 1 / 2 , where M b ( µ ) = 4    Q − 1 ( I − A ) ⊤ ·  ( I − A ) Q − 1 ( I − A ) ⊤ + B R − 1 B ⊤  − 1 · ( Aµ + d )    2 ·  ν − 1 K ∗ + σ − 1 min (Ψ ǫ ) · σ − 1 min ( R )  1 / 2 . W e ﬁnish the pro of of the theorem. D.3 Pro of of Theorem B.8 Pr o of. W e follo w the pro of of Theorem 4 .2 in Y ang et al. ( 2019 ), where they only consider LQR without drift terms. Since our pro of requires muc h more delicate a na lysis, w e presen t it here. P art 1. W e denote by b ζ and b ξ the primal and dual v ariables generated by Algorithm 3 . W e deﬁne the primal- dual g ap of ( B.11 ) as gap( b ζ , b ξ ) = max ξ ∈V ξ F ( b ζ , ξ ) − min ζ ∈V ζ F ( ζ , b ξ ) . (D.39) In the sequel, w e upp er b ound k b α K,b − α K,b k 2 using ( D.39 ). W e deﬁne ζ K,b and ξ ( ζ ) as ζ K,b =  J ( K , b ) , α ⊤ K,b  ⊤ , ξ ( ζ ) = argmax ξ F ( ζ , ξ ) . (D.40) F ollow ing from ( B.12 ), w e kno w that ξ 1 ( ζ ) = ζ 1 − J ( K , b ) , ξ 2 ( ζ ) = E π K,b  ψ ( x, u )  ζ 1 + Θ K,b ζ 2 − E π K,b  c ( x, u ) ψ ( x, u )  . (D.41) The follo wing lemma sho ws that ζ K,b ∈ V ζ and ξ ( ζ ) ∈ V ξ for an y ζ ∈ V ζ . Lemma D.6. Under the assumptions in Theorem B.8 , it holds that ζ K,b = ( J ( K , b ) , α ⊤ K,b ) ⊤ ∈ V ζ . Also, for an y ζ ∈ V ζ , the vec tor ξ ( ζ ) deﬁned in ( D.40 ) satisﬁes that ξ ( ζ ) ∈ V ξ . Pr o of. See § F.6 for a detailed pro of . 45 By ( B.12 ), w e kno w that ∇ ζ F ( ζ K,b , 0) = 0 and ∇ ξ F ( ζ K,b , 0) = 0. Com bining Lemma D.6 , it holds that ( ζ K,b , 0) is a saddle p oint of the f unction F ( ζ , ξ ) deﬁned in ( B.11 ). F ollow ing from ( D.39 ), it holds tha t    E π K,b  ψ ( x, u )  b ζ 1 + Θ K,b b ζ 2 − E π K,b  c ( x, u ) ψ ( x, u )     2 2 +   b ζ 1 − J ( K , b )   2 = F  b ζ , ξ ( b ζ )  = max ξ ∈V ξ F ( b ζ , ξ ) = gap( b ζ , b ξ ) + min ζ ∈V ζ F ( ζ , b ξ ) , (D.42) where the ﬁrst equalit y comes from ( D.41 ), a nd the sec ond equality comes from the fact that ξ ( b ζ ) = arg max ξ ∈V ξ F ( b ζ , ξ ) b y ( D.40 ) and Lemma D.6 . W e upp er b ound the RHS o f ( D.42 ) and lo we r b ound the LHS o f ( D.42 ) in the sequel. As f o r the RHS of ( D.42 ), it holds f o r a ny ξ ∈ V ξ that min ζ ∈V ζ F ( ζ , ξ ) ≤ min ζ ∈V ζ max ξ ∈V ξ F ( ζ , ξ ) = min ζ ∈V ζ F  ζ , ξ ( ζ )  = 1 2 min ζ ∈V ζ     E π K,b  ψ ( x, u )  ζ 1 + Θ K,b ζ 2 − E π K,b  c ( x, u ) ψ ( x, u )     2 2 +   ζ 1 − J ( K , b )   2  = 0 , (D.43) where the ﬁrst equalit y comes from the fact tha t ξ ( ζ ) = argmax ξ ∈V ξ F ( ζ , ξ ) b y ( D.40 ) and Lemma D.6 , the second equalit y comes from ( D .41 ), and the la st equality holds b y taking ζ = ζ K,b ∈ V ζ . Meanw hile, w e lo w er b ound the LHS of ( D.42 ) as    E π K,b  ψ ( x, u )  b ζ 1 + Θ K,b b ζ 2 − E π K,b  c ( x, u ) ψ ( x, u )     2 2 +   b ζ 1 − J ( K , b )   2 =   e Θ K,b ( b ζ − ζ K,b )   2 2 ≥ λ 2 K · k b ζ − ζ K,b k 2 2 ≥ λ 2 K · k b α K,b − α K,b k 2 2 , (D.44) where the ﬁrst equality comes fr o m the deﬁnition of e Θ K,b in ( B.9 ), and the ﬁrst inequalit y comes fro m Prop osition B.6 . Here λ K is deﬁned in Prop osition B.6 . Com bining ( D.42 ), ( D.43 ), and ( D .44 ), it holds that k b α K,b − α K,b k 2 2 ≤ λ − 2 K · gap( b ζ , b ξ ) , (D.45) whic h ﬁnishes the pro of of this part. P art 2. W e no w upp er b ound gap( b ζ , b ξ ). W e denote b y e z t = ( e x ⊤ t , e u ⊤ t ) ⊤ for t ∈ [ e T ], where e x t and e u t are generated in L ine 6 of Algorithm 3 . F ollo wing f rom the state transition in Problem 2.1 and the form of the linear p olicy , { e z t } t ∈ [ e T ] follo ws the following transition, e z t +1 = L e z t + ν + δ t , ( D .46) 46 where ν = Aµ + d − K ( Aµ + d ) + b ! , δ t = ω t − K ω t + σ η ! , L = A B − K A − K B ! . Note that we hav e L = A B − K A − K B ! = I − K !  A B  . Then b y the prop ert y of sp ectral radius, it holds that ρ ( L ) = ρ  A B  I − K !! = ρ ( A − B K ) < 1 . Th us, the Mark ov chain generated b y ( D.46 ) admits a unique stationary distribution N ( µ z , Σ z ), where µ z = ( I − L ) − 1 ν, Σ z = L Σ z L ⊤ + Ψ ω − Ψ ω K ⊤ − K Ψ ω K Ψ ω K ⊤ + σ 2 I ! . (D.47) The follo wing lemma c hara cterizes the av erage b µ z = 1 / e T · e T X t =1 e z t . (D.48) Lemma D.7. It holds that b µ z ∼ N  µ z + 1 e T µ e T , 1 e T e Σ e T  , where k µ e T k 2 ≤ M µ · (1 − ρ ) − 2 · k µ z k 2 and k e Σ e T k F ≤ M Σ · (1 − ρ ) − 1 · k Σ z k F . Here M µ and M Σ are p ositiv e absolute constan ts. Moreo ver, it holds with probabilit y at least 1 − e T − 6 that k b µ z − µ z k 2 ≤ log e T e T 1 / 4 · (1 − ρ ) − 2 · p o ly  k Φ K k 2 , k K k F , k b k 2 , k µ k 2  . Pr o of. See § F.7 for a detailed pro of . Lemma D.7 gives that k b µ K,b − µ K,b k 2 ≤ log e T e T 1 / 4 · (1 − ρ ) − 2 · p oly  k Φ K k 2 , k K k F , k b k 2 , k µ k 2  , whic h holds with probability at least 1 − e T − 6 . 47 W e now apply a truncation argumen t to sho w that gap( b ζ , b ξ ) is upp er b ounded. W e deﬁne the ev en t E in the sequel. F ollowing from Lemma D.7 , it holds for an y z ∼ N ( µ z , Σ z ) that z − b µ z + 1 / e T · µ e T ∼ N (0 , Σ z + 1 / e T · e Σ e T ) . By Lemma G.3 , there exists a p ositiv e a bsolute constan t C 0 suc h that P h   k z − b µ z + 1 / e T · µ e T k 2 2 − tr( e Σ z )   > τ i ≤ 2 exp h − C 0 · min  τ 2 k e Σ z k − 2 F , τ k e Σ z k − 1 2  i , (D .49) where w e write e Σ z = Σ z + 1 / e T · e Σ e T for notational con v enience. By taking τ = C 1 · log T · k e Σ z k F in ( D.49 ) fo r a suﬃcien tly large p ositiv e absolute constan t C 1 , it holds that P h   k z − b µ z + 1 / e T · µ e T k 2 2 − tr( e Σ z )   > C 1 · log T · k e Σ z k F i ≤ T − 6 . (D.50) W e deﬁne t he ev ent E t, 1 for an y t ∈ [ T ] as E t, 1 = n   k z t − b µ z + 1 / e T · µ e T k 2 2 − tr( e Σ z )   ≤ C 1 · log T · k e Σ z k F o . Then b y ( D.50 ), it holds for an y t ∈ [ T ] that P ( E t, 1 ) ≥ 1 − T − 6 . (D.51) Also, w e deﬁne E 1 = \ t ∈ [ T ] E t, 1 . (D.52) F ollow ing from a union b ound argumen t and ( D.51 ), it holds that P ( E 1 ) ≥ 1 − T − 5 . (D.53) Also, conditioning on E 1 , it holds for suﬃcien tly large e T that max t ∈ [ T ] k z t − b µ z k 2 2 ≤ C 1 · log T · k e Σ z k F + t r( e Σ z ) + k 1 / e T · µ e T k 2 2 ≤ 2 e C 1 ·  1 + M Σ (1 − ρ ) − 1 / e T 2  · log T · k Σ z k 2 + M µ (1 − ρ ) − 2 / e T 2 · k µ z k 2 2 ≤ C 2 · log T ·  1 + k K k 2 F  · k Φ K k 2 · (1 − ρ ) − 1 + C 3 ·  k b k 2 2 + k µ k 2 2  · (1 − ρ ) − 4 · e T − 2 ≤ 2 C 2 · log T ·  1 + k K k 2 F  · k Φ K k 2 · (1 − ρ ) − 1 , (D.54) 48 where e C 1 , C 2 , and C 3 are p ositiv e absolute constan ts. Here, the ﬁrst inequalit y comes from the deﬁnition of E 1 in ( D .52 ), the second inequalit y comes from Lemma D.7 , and the third inequalit y comes from ( D.47 ). Also, w e deﬁne the follo wing ev en t E 2 =  k b µ z − µ z + 1 / e T · µ e T k 2 ≤ C 1  . (D.55) Then b y Lemma D.7 , w e kno w that P ( E 2 ) ≥ 1 − e T − 6 (D.56) for e T suﬃcien tly large. W e deﬁne the ev ent E a s E = E 1 \ E 2 . Then follo wing from ( D.53 ), ( D.56 ), a nd a union b ound argument, w e kno w that P ( E ) ≥ 1 − T − 5 − e T − 6 . No w, w e deﬁne t he truncated feature v ector e ψ ( x, u ) as e ψ ( x, u ) = b ψ ( x, u ) 1 E , the truncated cost function e c ( x, u ) as e c ( x, u ) = c ( x, u ) 1 E , and also the truncated ob j ectiv e f unction e F ( ζ , ξ ) as e F ( ζ , ξ ) = n E ( e ψ ) ζ 1 + E  ( e ψ − e ψ ′ ) e ψ ⊤  ζ 2 − E ( e c e ψ ) o ⊤ ξ 2 +  ζ 1 − E ( e c )  · ξ 1 − k ξ k 2 2 / 2 , (D .57) where w e write e ψ = e ψ ( x, u ) and e c = e c ( x, u ) for notatio nal con v enience. Here the exp ectation is tak en fo llo wing the p olicy π K,b and the state transition. The following lemma establishes the upp er bo und of | F ( ζ , ξ ) − e F ( ζ , ξ ) | , where F ( ζ , ξ ) and e F ( ζ , ξ ) are deﬁned in ( B.11 ) and ( D.57 ), resp ectiv ely . Lemma D.8. It holds with probabilit y at least 1 − e T − 6 that   F ( ζ , ξ ) − e F ( ζ , ξ )   ≤  1 2 T + log e T e T 1 / 4  · (1 − ρ ) − 2 · p oly  k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )  . Pr o of. See § F.8 for a detailed pro of . F ollow ing from ( D .39 ) and Lemma D .8 , it holds with probabilit y at least 1 − e T − 6 that   gap( b ζ , b ξ ) − g gap( b ζ , b ξ )   ≤  1 2 T + log e T e T 1 / 4  · (1 − ρ ) − 2 · p oly  k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )  . (D.58) 49 where w e deﬁne g gap( b ζ , b ξ ) a s g gap( b ζ , b ξ ) = max ξ ∈V ξ e F ( b ζ , ξ ) − min ζ ∈V ζ e F ( ζ , b ξ ) . Therefore, to upp er b ound o f ga p( ζ , ξ ), w e only need to upp er b ound g gap( ζ , ξ ) . P art 3. W e upp er b ound g gap( ζ , ξ ) in the sequel. W e ﬁrst sho w that the tra jectory generated b y the p olicy π K,b and the state transition in Problem 2.2 is β - mixing. Lemma D.9. Consider a linear system y t +1 = D y t + ϑ + υ t , where { y t } t ≥ 0 ⊂ R m , the matrix D ∈ R m × m satisfying ρ ( D ) < 1, the v ector ϑ ∈ R m , and υ t ∼ N (0 , Σ) is the Gaussians. W e denote b y  t the marginal distribution of y t for an y t ≥ 0. Mean while, assume that the stationary distribution of { y t } t ≥ 0 is a Gaussian distribution N ( ( I − D ) − 1 ϑ, Σ ∞ ), where Σ ∞ is the cov ariance mat r ix. W e deﬁne the β -mixing co eﬃcien ts f or a n y n ≥ 1 as follo ws β ( n ) = sup t ≥ 0 E y ∼  t h   P y n ( · | y 0 = y ) − P N (( I − D ) − 1 ϑ, Σ ∞ ) ( · )   TV i . Then, for a ny ρ ∈ ( ρ ( D ) , 1), the β - mixing co eﬃcien ts satisfy that β ( n ) ≤ C ρ,D ,ϑ ·  tr(Σ ∞ ) + m · (1 − ρ ) − 2  1 / 2 · ρ n , where C ρ,D ,ϑ is a constant, whic h only dep ends on ρ , D , and ϑ . W e sa y that the sequence { y t } t ≥ 0 is β -mixing with par a meter ρ . Pr o of. See Prop osition 3.1 in T u and Rec h t ( 2017 ) fo r details. Recall that b y ( 3.1 ), the sequence { x t } t ≥ 0 follo ws x t +1 = ( A − B K ) x t + ( B b + Aµ + d ) + ǫ t , ǫ t ∼ N (0 , Ψ ǫ ) , where the matrix A − B K satisﬁes that ρ ( A − B K ) < 1. Therefore, b y Lemma D .9 , the sequence { z t } t ≥ 0 is β -mixing with pa r ameter ρ ∈ ( ρ ( A − B K ) , 1), where z t = ( x ⊤ t , u ⊤ t ) ⊤ . The follo wing lemma upp er b ounds the primal- dual ga p for a con vex -conca ve problem. Lemma D.10. Let X and Y b e t wo compact and con v ex sets suc h that k x − x ′ k 2 ≤ M and k y − y ′ k 2 ≤ M for an y x, x ′ ∈ X and y , y ′ ∈ Y . W e consider solving the following minimax problem min x ∈X max y ∈Y H ( x, y ) = E ǫ ∼  ǫ  G ( x, y ; ǫ )  , 50 where the ob jectiv e function H ( x, y ) is conv ex in x and conca ve in y . In addition, we assume that the distribution  ǫ is β - mixing with β ( n ) ≤ C ǫ · ρ n , where C ǫ is a constan t. Mean while, w e assume that it ho lds almost surely that G ( x, y ; ǫ ) is e L 0 -Lipsc hitz in b oth x and y , the gradien t ∇ x G ( x, y ; ǫ ) is e L 1 -Lipsc hitz in x fo r an y y ∈ Y , t he gradien t ∇ y G ( x, y ; ǫ ) is e L 1 - Lipsc hitz in y for any x ∈ X , where C ǫ , e L 0 , e L 1 > 1. Eac h step of our gradien t- based method tak es the following forms, x t +1 = Γ X  x t − γ t +1 · ∇ x G ( x t , y t ; ǫ t )  , y t +1 = Γ Y  y t − γ t +1 · ∇ y G ( x t , y t ; ǫ t )  , where the op erators Γ X and Γ Y pro jects the v ariables back to X and Y , resp ectiv ely , and the stepsizes tak e the for m γ t = γ 0 · t − 1 / 2 for a constan t γ 0 > 0. Moreov er, let b x = ( P T t =1 γ t ) − 1 ( P T t =1 γ t x t ) and b y = ( P T t =1 γ t ) − 1 ( P T t =1 γ t y t ) b e the ﬁnal output of the gr a - dien t metho d af ter T iterations, then there exists a p ositiv e absolute constan t C , such that for an y δ ∈ (0 , 1), the primal- dua l gap to the minimax problem is upp er b ounded as max x ∈X H ( b x, y ) − min y ∈Y H ( x, b y ) ≤ C · ( M 2 + e L 2 0 + e L 0 e L 1 M ) log(1 / ρ ) · log 2 T + log (1 /δ ) √ T + C · C ǫ e L 0 M T , whic h holds with probability at least 1 − δ . Pr o of. See Theorem 5.4 in Y ang et al. ( 2019 ) for details. T o use Lemma D .10 , we deﬁne the function G ( ζ , ξ ; e ψ, e ψ ′ ) as G ( ζ , ξ ; e ψ, e ψ ′ ) =  e ψ ζ 1 + ( e ψ − e ψ ′ ) e ψ ⊤ ζ 2 − e c e ψ  ⊤ ξ 2 + ( ζ 1 − e c ) · ξ 1 − 1 / 2 · k ξ k 2 2 , where e ψ = e ψ ( x, u ) and e ψ ′ = e ψ ( x ′ , u ′ ). Not e that the gradien ts of G ( ζ , ξ ; e ψ, e ψ ′ ) ta k e the form ∇ ζ G ( ζ , ξ ; e ψ, e ψ ′ ) = e ψ ⊤ ξ 2 + ξ 1 e ψ ( e ψ − e ψ ′ ) ⊤ ξ 2 ! , ∇ ξ G ( ζ , ξ ; e ψ, e ψ ′ ) = ζ 1 − e c − ξ 1 e ψ ζ 1 + ( e ψ − e ψ ′ ) e ψ ⊤ ζ 2 − e c e ψ − ξ 2 ! . By Deﬁnition B.7 and Lemma D.6 , w e know that   ∇ ζ G ( ζ , ξ ; e ψ, e ψ ′ )   2 ≤ p oly  k K k F , J ( K 0 , b 0 )  · log 2 T · (1 − ρ ) − 2 ,   ∇ ξ G ( ζ , ξ ; e ψ, e ψ ′ )   2 ≤ p oly  k K k F , k µ k 2 , J ( K 0 , b 0 )  · log 2 T · (1 − ρ ) − 2 . (D .59) This giv es the L ipschitz constan t e L 0 in Lemma D.10 for G ( ζ , ξ ; e ψ, e ψ ′ ). Also, the Hessians of G ( ζ , ξ ; e ψ, e ψ ′ ) ta ke the forms of ∇ 2 ζ ζ G ( ζ , ξ ; e ψ, e ψ ′ ) = 0 , ∇ 2 ξ ξ G ( ζ , ξ ; e ψ, e ψ ′ ) = − I , 51 whic h follows that   ∇ 2 ζ ζ G ( ζ , ξ ; e ψ, e ψ ′ )   2 = 0 ,   ∇ 2 ξ ξ G ( ζ , ξ ; e ψ, e ψ ′ )   2 = 1 . (D.60) This giv es the Lipsc hitz constant e L 1 in Lemma D.10 for ∇ ζ G ( ζ , ξ ; e ψ, e ψ ′ ) and ∇ ξ G ( ζ , ξ ; e ψ, e ψ ′ ). Moreo ve r, note that ( D.54 ) prov ides an upp er b ound of M , combining ( D.5 9 ), ( D .60 ) and Lemma D.10 , it holds with probabilit y at least 1 − T − 5 that g gap( b ζ , b ξ ) ≤ p oly  k K k F , k µ k 2 , J ( K 0 , b 0 )  · log 6 T (1 − ρ ) 4 · √ T . (D.61) Com bining ( D.45 ), ( D.58 ), a nd ( D.61 ), w e kno w that k b α K,b − α K,b k 2 2 ≤ λ − 2 K · p o ly 1  k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )  ·  log 6 T T 1 / 2 · (1 − ρ ) 4 + log e T e T 1 / 4 · (1 − ρ ) 2  . Same b ounds for k b Υ K − Υ K k 2 F , k b p K,b − p K,b k 2 2 , and k b q K,b − q K,b k 2 2 hold. W e ﬁnish the pro of of t he theorem. E Pro o fs of Prop osit ions E.1 Pro of of Prop osition 3.2 Pr o of. W e follo w a similar pro of as in the o ne of Theorem 1.1 in Sznitman ( 1 991 ) and Theorem 3.2 in Bensoussan et al. ( 201 6 ). Note that fo r an y p olicy π K,b ∈ Π, the parameters K and b uniquely determine the p olicy . W e deﬁne the follo wing metric on Π. Deﬁnition E.1. F or an y π K 1 ,b 1 , π K 2 ,b 2 ∈ Π, w e deﬁne the follo wing metric, k π K 1 ,b 1 − π K 2 ,b 2 k 2 = c 1 · k K 1 − K 2 k 2 + c 2 · k b 1 − b 2 k 2 , where c 1 and c 2 are p ositiv e constants. One can v erify that Deﬁnition E.1 satisﬁes the requirem en t of b eing a metric. W e ﬁrst ev aluate the forms of the o p erators Λ 1 ( · ) and Λ 2 ( · , · ). F orms of t he op erators Λ 1 ( · ) and Λ 2 ( · , · ) . By the deﬁnition of Λ 1 ( µ ), whic h g iv es the optimal p olicy under t he mean-ﬁeld state µ , it holds that Λ 1 ( µ ) = π ∗ µ , 52 where π ∗ µ solv es Problem 2 .2 . This giv es the f o rm of Λ 1 ( · ). W e now turn to Λ 2 ( µ, π ) , whic h giv es t he mean-ﬁeld state µ new generated by the p olicy π under the curren t mean-ﬁeld state µ . In Problem 2.2 , the sequence of states { x t } t ≥ 0 constitutes a Mark ov chain, whic h admits a unique stationary distribution. Thus , b y the state transition in Problem 2.2 and the form of t he linear-G aussian p o licy , w e hav e µ new = ( A − B K π ) µ new + ( B b π + Aµ + d ) , (E.1) where K π and b π are parameters o f the p olicy π . By solving ( E.1 ) fo r µ new , it holds that Λ 2 ( µ, π ) = µ new = ( I − A + B K π ) − 1 ( B b π + Aµ + d ) . This giv es the form of Λ 2 ( · , · ). Next, w e compute the Lipsc hitz constants for Λ 1 ( · ) and Λ 2 ( · , · ). Lipsc hitz constan t for Λ 1 ( · ) . By Prop osition 3.4 , for an y µ 1 , µ 2 ∈ R m , the o ptimal K ∗ is ﬁxed for Problem 2.2 . Therefore, b y the form of the optimal b K giv en in Prop o sition 3.4 , it holds that   Λ 1 ( µ 1 ) − Λ 1 ( µ 2 )   2 ≤ c 2 ·     ( I − A ) Q − 1 ( I − A ) ⊤ + B R − 1 B ⊤  − 1 A    2 ·     K ∗ Q − 1 ( I − A ) ⊤ − R − 1 B ⊤     2 · k µ 1 − µ 2 k 2 = c 2 L 1 · k µ 1 − µ 2 k 2 , (E.2) where L 1 is deﬁned in Assumption 3.1 . Lipsc hitz constants for Λ 2 ( · , · ) . By Prop osition 3.4 , for an y µ 1 , µ 2 ∈ R m , the optimal K ∗ is ﬁxed for Problem 2.2 . Th us, fo r an y π ∈ Π suc h that π is an optimal p o licy under some µ ∈ R m , it holds that   Λ 2 ( µ 1 , π ) − Λ 2 ( µ 2 , π )   2 =   ( I − A + B K π ) − 1 · A · ( µ 1 − µ 2 )   2 ≤  1 − ρ ( A − B K ∗ )  − 1 · k A k 2 · k µ 1 − µ 2 k 2 = L 2 · k µ 1 − µ 2 k 2 , (E.3) where L 2 is deﬁned in Assumption 3 .1 , and K π = K ∗ is the parameter of the p olicy π . Mean while, for an y mean-ﬁeld state µ ∈ R m , and an y p oicies π 1 , π 2 ∈ Π that are optima l under some mean- ﬁeld states µ 1 , µ 2 , resp ectiv ely , w e ha v e   Λ 2 ( µ, π 1 ) − Λ 2 ( µ, π 2 )   2 =   ( I − A + B K ∗ ) − 1 B · ( b π 1 − b π 2 )   2 ≤  1 − ρ ( A − B K ∗ )  − 1 · k B k 2 · k b π 1 − b π 2 k 2 = c − 1 2 L 3 · k π 1 − π 2 k 2 , (E.4) 53 where in the last equalit y , w e use the fa ct that K π 1 = K π 2 = K ∗ b y Prop o sition 3.4 . Here L 3 is deﬁne d in Assumption 3.1 , and b π 1 and b π 2 are the parameters of the p olicies π 1 and π 2 . No w w e sho w that the op erator Λ ( · ) is a contraction. F or an y µ 1 , µ 2 ∈ R m , it holds that   Λ( µ 1 ) − Λ( µ 2 )   2 =    Λ 2  µ 1 , Λ 1 ( µ 1 )  − Λ 2  µ 2 , Λ 1 ( µ 2 )     2 ≤    Λ 2  µ 1 , Λ 1 ( µ 1 )  − Λ 2  µ 1 , Λ 1 ( µ 2 )     2 +    Λ 2  µ 1 , Λ 1 ( µ 2 )  − Λ 2  µ 2 , Λ 1 ( µ 2 )     2 ≤ c − 1 2 L 3 ·   Λ 1 ( µ 1 ) − Λ 1 ( µ 2 )   2 + L 2 · k µ 1 − µ 2 k 2 ≤ c − 1 2 L 3 · c 2 L 1 · k µ 1 − µ 2 k 2 + L 2 · k µ 1 − µ 2 k 2 = ( L 1 L 3 + L 2 ) · k µ 1 − µ 2 k 2 , where the ﬁrst inequality comes from triangle inequalit y , t he second inequality comes from ( E.3 ) and ( E.4 ), and the last inequality comes fro m ( E.2 ) . By Assum ption 3.1 , w e know that L 0 = L 1 L 3 + L 2 < 1, whic h sho ws that the op erator Λ( · ) is a con traction. Moreo ve r, by Banac h ﬁxed-p o in t theorem, we obta in that Λ( · ) has a unique ﬁxed p oint, which gives the unique equilibrium pair o f Problem 2.1 . W e ﬁnish the pro of of the prop osition. E.2 Pro of of Prop osition 3.4 Pr o of. By the deﬁnition of J 2 ( K , b ) in ( 3 .6 ) a nd the deﬁnition of µ K,b in ( 3.2 ), t he problem min b J 2 ( K , b ) is equiv alent to the following constrained problem, min µ,b µ b ! ⊤ Q + K ⊤ RK − K ⊤ R − RK R ! µ b ! s.t. ( I − A + B K ) µ − ( B b + Aµ + d ) = 0 . (E.5) F ollow ing from t he KKT conditions of ( E.5 ), it holds that 2 M K µ b ! + N K λ = 0 , N ⊤ K µ b ! + Aµ + d = 0 , (E.6) where M K = Q + K ⊤ RK − K ⊤ R − RK R ! , N K = − ( I − A + B K ) ⊤ B ⊤ ! . 54 By solving ( E.6 ), the minimizer of ( E.5 ) takes the form of µ K,b K b K ! = − M − 1 K N K ( N ⊤ K M − 1 K N K ) − 1 ( Aµ + d ) . (E.7) By substituting ( E.7 ) in to the deﬁnition of J 2 ( K , b ) in ( 3.6 ), w e hav e J 2 ( K , b K ) = ( Aµ + d ) ⊤ ( N ⊤ K M − 1 K N K ) − 1 ( Aµ + d ) . (E .8) Mean while, by calculation, we ha ve M − 1 K = Q − 1 Q − 1 K ⊤ K Q − 1 K Q − 1 K ⊤ + R − 1 ! . Therefore, the term N ⊤ K M − 1 K N K in ( E.8 ) tak es the form of N ⊤ K M − 1 K N K = ( I − A ) Q − 1 ( I − A ⊤ ) + B R − 1 B ⊤ . (E.9) By plugging ( E.9 ) in to ( E.8 ), w e hav e J 2 ( K , b K ) = ( Aµ + d ) ⊤  ( I − A ) Q − 1 ( I − A ⊤ ) + B R − 1 B ⊤  − 1 ( Aµ + d ) . Also, b y plugging ( E.9 ) in to ( E.7 ) , we ha ve µ K,b K b K ! = Q − 1 ( I − A ) ⊤ K Q − 1 ( I − A ) ⊤ − R − 1 B ⊤ !  ( I − A ) Q − 1 ( I − A ) ⊤ + B R − 1 B ⊤  − 1 ( Aµ + d ) . W e ﬁnish the pro of of the prop osition. E.3 Pro of of Prop osition B.2 Pr o of. By the deﬁnition of the cost function c ( x, u ) in Problem 2.2 (recall that w e drop the subscript µ when w e fo cus o n Problem 2.2 ), w e hav e E c t = E ( x ⊤ t Qx t + u ⊤ t Ru t + µ ⊤ Qµ ) = E ( x ⊤ t Qx t + x ⊤ t K ⊤ RK x t − 2 b ⊤ RK x t + b ⊤ Rb + σ 2 η ⊤ t Rη t + µ ⊤ Qµ ) = E  x ⊤ t ( Q + K ⊤ RK ) x t − 2 b ⊤ RK x t  + b ⊤ Rb + σ 2 · tr( R ) + µ ⊤ Qµ, (E.10) where we write c t = c ( x t , u t ) for notational con v enience. Here in the second line we use u t = π K,b ( x t ) = − K x t + b + σ η t . Therefore, com bining ( E.10 ) and the deﬁnition of J ( K , b ) 55 in Problem 2.2 , we ha ve J ( K , b ) = lim T →∞ 1 T T X t =0 n E  x ⊤ t ( Q + K ⊤ RK ) x t − 2 b ⊤ RK x t  + b ⊤ Rb + σ 2 · tr( R ) + µ ⊤ Qµ o = E x ∼N ( µ K,b , Φ K )  x ⊤ ( Q + K ⊤ RK ) x − 2 b ⊤ RK x  + b ⊤ Rb + σ 2 · tr( R ) + µ ⊤ Qµ = tr  ( Q + K ⊤ RK )Φ K  + µ ⊤ K,b ( Q + K ⊤ RK ) µ K,b − 2 b ⊤ RK µ K,b (E.11) + b ⊤ Rb + σ 2 · tr( R ) + µ ⊤ Qµ. No w, b y iterativ ely applying ( 3.3 ) and ( 3 .4 ), w e ha v e tr  ( Q + K ⊤ RK )Φ K  = tr( P K Ψ ǫ ) , (E.12) where P K is giv en in ( 3.4 ). Com bining ( E.11 ) and ( E.12 ) , w e kno w that J ( K , b ) = J 1 ( K ) + J 2 ( K , b ) + σ 2 · tr( R ) + µ ⊤ Qµ, where J 1 ( K ) = tr  ( Q + K ⊤ RK )Φ K  = tr( P K Ψ ǫ ) , J 2 ( K , b ) = µ K,b b ! ⊤ Q + K ⊤ RK − K ⊤ R − RK R ! µ K,b b ! . W e ﬁnish the pro of of the prop osition. E.4 Pro of of Prop osition 3.3 Pr o of. By calculating the Hessian matrix of J 2 ( K , b ), w e ha v e ∇ 2 bb J 2 ( K , b ) = B ⊤ ( I − A + B K ) −⊤ ( Q + K ⊤ RK )( I − A + B K ) − 1 B −  RK ( I − A + B K ) − 1 B + B ⊤ ( I − A + B K ) −⊤ K ⊤ R  + R =  R 1 / 2 K ( I − A + B K ) − 1 B − R 1 / 2  ⊤  R 1 / 2 K ( I − A + B K ) − 1 B − R 1 / 2  + B ⊤ ( I − A + B K ) −⊤ Q ( I − A + B K ) − 1 B , whic h is a p ositiv e deﬁnite matrix indep enden t of b . W e denote b y its minim um singular v alue as ν K . Also, not e that k∇ 2 bb J 2 ( K , b ) k 2 is upper b ounded as   ∇ 2 bb J 2 ( K , b )   2 ≤  1 − ρ ( A − B K )  − 2 ·  k B k 2 2 · k K k 2 2 · k R k 2 + k B k 2 2 · k Q k 2  . Therefore, it holds tha t ι K ≤  1 − ρ ( A − B K )  − 2 ·  k B k 2 2 · k K k 2 2 · k R k 2 + k B k 2 2 · k Q k 2  , where ι K is the maxim um singular v alue of ∇ 2 bb J 2 ( K , b ). W e ﬁnish the pro of of the prop osi- tion. 56 E.5 Pro of of Prop osition B.3 Pr o of. F ollowing from Prop osition B.2 , it holds that J 1 ( K ) = tr( P K Ψ ǫ ) = E y ∼ N (0 , Ψ ǫ ) ( y ⊤ P K y ) = E y ∼ N (0 , Ψ ǫ )  f K ( y )  , (E.13) where f K ( y ) = y ⊤ P K y . By the deﬁnition of P K in ( 3.4 ), w e obtain that ∇ K f K ( y ) = ∇ K n y ⊤ ( Q + K ⊤ RK ) y +  ( A − B K ) y  ⊤ P K  ( A − B K ) y  ⊤ o = 2 RK y y ⊤ + ∇ K h f K  ( A − B K ) y  i . (E .14) Also, w e ha v e ∇ K h f K  ( A − B K ) y  i = ∇ K f K  ( A − B K ) y  − 2 B ⊤ P K ( A − B K ) y y ⊤ . (E.15) By plugging ( E.15 ) into ( E.14 ) , we ha v e ∇ K f K ( y ) = 2  ( R + B ⊤ P K B ) K − B ⊤ P K A  y y ⊤ + ∇ K f K  ( A − B K ) y  . ( E.16 ) By iterativ ely applying ( E.16 ), it holds that ∇ K f K ( y ) = 2  ( R + B ⊤ P K B ) K − B ⊤ P K A  · ∞ X t =0 y t y ⊤ t , (E.17) where y t +1 = ( A − B K ) y t with y 0 = y . Now , combining ( E.13 ) a nd ( E.17 ), it holds that ∇ K J 1 ( K ) = 2  ( R + B ⊤ P K B ) K − B ⊤ P K A  Φ K = 2(Υ 22 K K − Υ 21 K ) · Φ K , where Υ K is deﬁned in ( 3.7 ). Meanw hile, combining the fo r m of µ K,b in ( 3.2 ), it holds by calculation that ∇ b J 2 ( K , b ) = 2  Υ 22 K ( − K µ K,b + b ) + Υ 21 K µ K,b + q K,b  , where q K,b is deﬁned in ( 3.7 ). W e ﬁnish the pro of of t he prop osition. E.6 Pro of of Prop osition B.1 Pr o of. F rom the deﬁnition of V K,b ( x ) in ( B.1 ) and the deﬁnition of the cost function c ( x, u ) in Problem 2.2 , it holds that V K,b ( x ) = ∞ X t =0 n E  x ⊤ t ( Q + K ⊤ RK ) x t − 2 b ⊤ RK x t + b ⊤ Rb + σ 2 η ⊤ t Rη t + µ ⊤ Qµ | x 0 = x  − J ( K , b ) o . 57 Com bining ( 3 .1 ), w e know that V K,b ( x ) is a quadratic f unction taking the form of V K,b ( x ) = x ⊤ Gx + r ⊤ x + h , where G , r , and h are functions of K and b . Note tha t V K,b ( x ) satisﬁes that V K,b ( x ) = c ( x, − K x + b ) − J ( K , b ) + E  V K,b ( x ′ ) | x  , (E.18) b y substituting the form o f c ( x, − K x + b ) in Problem 2.2 and J ( K , b ) in ( 3.5 ) into ( E.18 ), w e obtain that x ⊤ Gx + r ⊤ x + h = x ⊤ ( Q + K ⊤ RK ) x − 2 b ⊤ RK x + b ⊤ Rb + µ ⊤ Qµ (E.19) −  tr( P K Ψ ǫ ) + µ ⊤ K,b ( Q + K ⊤ RK ) µ K,b − 2 b ⊤ RK µ K,b + µ ⊤ Qµ + b ⊤ Rb  +  ( A − B K ) x + ( B b + Aµ + d )  ⊤ G  ( A − B K ) x + ( B b + Aµ + d )  + t r( G Ψ ǫ ) + r ⊤  ( A − B K ) x + ( B b + Aµ + d )  + h − σ 2 · tr( R ) . By comparing the quadratic terms and linear terms on b oth the LHS and R HS in ( E.19 ), w e obtain that G = P K , r = 2 f K,b , where f K,b = ( I − A + B K ) −⊤ [( A − B K ) ⊤ P K ( B b + Aµ + d ) − K ⊤ Rb ]. Also, b y the deﬁnition of V K,b ( x ) in ( B.1 ), we know that E [ V K,b ( x )] = 0, where the exp ectation is tak en follo wing the stationary distribution generated by the p o licy π K,b and the state transition. Therefore, w e hav e h = − 2 f K,b µ K,b − µ ⊤ K,b P K µ K,b − tr( P K Φ K ) , whic h sho ws that V K,b ( x ) = x ⊤ P K x − tr( P K Φ K ) + 2 f ⊤ K,b ( x − µ K,b ) − µ ⊤ K,b P K µ K,b . (E.20) F or the a ction-v alue f unction Q K,b ( x, u ), by plugg ing ( E.20 ) into ( B.2 ) , w e obtain that Q K,b ( x, u ) = x u ! ⊤ Υ K x u ! + 2 p K,b q K,b ! ⊤ x u ! − tr( P K Φ K ) − σ 2 · tr( R + P K B B ⊤ ) − b ⊤ Rb + 2 b ⊤ RK µ K,b − µ ⊤ K,b ( Q + K ⊤ RK + P K ) µ K,b + 2 f ⊤ K,b  ( Aµ + d ) − µ K,b  + ( Aµ + d ) ⊤ P K ( Aµ + d ) . W e ﬁnish the pro of of the prop osition. 58 E.7 Pro of of Prop osition B.5 Pr o of. By Prop osition B.1 , it holds that Q K,b tak es the following linear form Q K,b ( x, u ) = ψ ( x, u ) ⊤ α K,b + β K,b , (E.21) where β K,b is a scalar indep enden t of x and u . No t e that Q K,b ( x, u ) satisﬁes that Q K,b ( x, u ) = c ( x, u ) − J ( K , b ) + E π K,b  Q K,b ( x ′ , u ′ ) | x, u  , (E.22) where ( x ′ , u ′ ) is the state-action pair after ( x, u ) f o llo wing the p olicy π K,b and the state transition. Com bining ( E.21 ) and ( E.22 ) , we obtain that ψ ( x, u ) ⊤ α K,b = c ( x, u ) − J ( K , b ) + E π K,b  ψ ( x ′ , u ′ ) | x, u  ⊤ α K,b . (E.23) By left multiplying ψ ( x, u ) to b o th sides o f ( E.23 ), a nd taking the exp ectation, we ha ve E π K,b n ψ ( x, u )  ψ ( x, u ) − ψ ( x ′ , u ′ )  ⊤ o · α K,b + E π K,b  ψ ( x, u )  · J ( K , b ) = E π K,b  c ( x, u ) ψ ( x, u )  . Com bining the deﬁnition of the matrix Θ K,b in ( B.7 ), we hav e 1 0 E π K,b  ψ ( x, u )  Θ K,b ! J ( K , b ) α K,b ! = J ( K , b ) E π K,b  c ( x, u ) ψ ( x, u )  ! , whic h concludes the pro o f of the prop osition. E.8 Pro of of Prop osition B.6 Pr o of. Inv ertibilit y and Upp er B ound. W e denote by z t = ( x ⊤ t , u ⊤ t ) ⊤ for an y t ≥ 0 . Then follo wing from t he state tra nsition and the p olicy π K,b , the transition of { z t } t ≥ 0 tak es the form of z t +1 = Lz t + ν + δ t , (E.24) where L , ν and δ are deﬁned as L = A B − K A − K B ! , ν = Aµ + d − K ( Aµ + d ) + b ! , δ t = ω t − K ω t + σ η t ! . Note that L also tak es the form of L = I − K !  A B  . 59 Com bining the fa ct that ρ ( U V ) = ρ ( V U ) for an y matrices U and V , w e kno w that ρ ( L ) = ρ ( A − B K ) < 1, whic h v eriﬁes the stabilit y of ( E.24 ) . F ollo wing from the stabilit y of ( E.24 ) , w e kno w that the Mark ov chain generated b y ( E.24 ) admits a unique stationary distribution N ( µ z , Σ z ), where µ z and Σ z satisfy tha t µ z = Lµ z + ν , Σ z = L Σ z L ⊤ + Ψ δ . where Ψ δ = Ψ ω − Ψ ω K ⊤ − K Ψ ω K Ψ ω K ⊤ + σ 2 I ! . Also, w e kno w that Σ z tak es the form of Σ z = Co v " x u !# = Φ K − Φ K K ⊤ − K Φ K K Φ K K ⊤ + σ 2 I ! = 0 0 0 σ 2 I ! + I − K ! Φ K I − K ! ⊤ , (E.25) where Φ K is deﬁned in ( 3.3 ). The follo wing lemma establishes the form of Θ K,b . Lemma E.2. The matrix Θ K,b in ( B.7 ) takes the form of Θ K,b = 2(Σ z ⊗ s Σ z )( I − L ⊗ s L ) ⊤ 0 0 Σ z ( I − L ) ⊤ ! . Pr o of. See § F.9 for a detailed pro of . Note that since ρ ( L ) < 1, b oth I − L ⊗ s L and I − L are p ositiv e deﬁnite. Therefore, b y Lemma E.2 , the matrix Θ K,b is in v ertible. This ﬁnishes the pro of of the inv ertibilit y of Θ K,b . Moreo ve r, b y ( E.25 ) a nd Lemma E.2 , w e upp er b ound the sp ectral norm of Θ K,b as k Θ K,b k 2 ≤ 2 max n k Σ z k 2 2 ·  1 + k L k 2 2  , k Σ z k 2 ·  1 + k L k 2  o ≤ 4  1 + k K k 2 F  2 · k Φ K k 2 2 , whic h prov es the upp er b ound o f k Θ K,b k 2 . Minim um singular v alue. T o lo w er b ound σ min ( e Θ K,b ), we only need to upp er b ound σ max ( e Θ − 1 K,b ) = k e Θ − 1 K,b k 2 . W e ﬁrst calculate e Θ − 1 K,b . Recall that the matrix e Θ K,b in ( B.8 ) tak es the form of e Θ K,b = 1 0 E π K,b  ψ ( x, u )  Θ K,b ! . 60 By the deﬁnition o f the feature vec tor ψ ( x, u ) in ( B.5 ), the v ector e σ z = E π K,b [ ψ ( x, u )] tak es the form of e σ z = E π K,b  ψ ( x, u )  = sv ec(Σ z ) 0 k + m ! , where 0 k + m denotes the all-zero column v ector with dimension k + m . Also, since Θ K,b is in ve r tible, the matrix e Θ K,b is also in v ertible, whose inv erse tak es the form of e Θ − 1 K,b = 1 0 − Θ − 1 K,b · e σ z Θ − 1 K,b ! . The follo wing lemma upp er b ounds the sp ectral norm of e Θ − 1 K,b . Lemma E.3. The sp ectral norm of the matrix e Θ − 1 K,b is upp er b ounded b y a p ositiv e constan t e λ K , where e λ K only dep ends on k K k 2 and ρ ( A − B K ). Pr o of. See § F.10 for a detailed pro o f. By Lemma E.3 , w e know that σ min ( e Θ K,b ) is lo wer b ounded b y a p ositiv e constant λ K = 1 / e λ K , whic h only dep ends o n k K k 2 and ρ ( A − B K ). This concludes the pro of of the prop osition. F Pro ofs o f Lemmas F.1 Pro of of Lemma D.1 Pr o of. F ollowing from ( 3.4 ), it holds t ha t y ⊤ P K 2 y = X t ≥ 0 y ⊤  ( A − B K 2 ) t  ⊤ ( Q + K ⊤ 2 RK 2 )( A − B K 2 ) t y . (F.1) Mean while, by the state transition y t +1 = ( A − B K 2 ) y t , w e kno w that y t = ( A − B K 2 ) t y 0 = ( A − B K 2 ) t y . (F.2) By plugging ( F.2 ) in to ( F.1 ), it holds that y ⊤ P K 2 y = X t ≥ 0 y ⊤ t ( Q + K ⊤ 2 RK 2 ) y t = X t ≥ 0 ( y ⊤ t Qy t + y ⊤ t K ⊤ 2 RK 2 y t ) . (F.3) 61 Also, it ho lds t hat y ⊤ P K 1 y = X t ≥ 0 ( y ⊤ t +1 P K 1 y t +1 − y ⊤ t P K 1 y t ) (F.4) Com bining ( F.3 ) and ( F.4 ), w e ha ve y ⊤ P K 2 y − y ⊤ P K 1 y = X t ≥ 0 ( y ⊤ t Qy t + y ⊤ t K ⊤ 2 RK 2 y t + y ⊤ t +1 P K 1 y t +1 − y ⊤ t P K 1 y t ) . (F.5) Also, b y the state transition y t +1 = ( A − B K 2 ) y t , it holds for an y t ≥ 0 that y ⊤ t Qy t + y ⊤ t K ⊤ 2 RK 2 y t + y ⊤ t +1 P K 1 y t +1 − y ⊤ t P K 1 y t = y ⊤ t  Q + ( K 2 − K 1 + K 1 ) ⊤ R ( K 2 − K 1 + K 1 )  y t + y ⊤ t  A − B K 1 − B ( K 2 − K 1 )  ⊤ P K 1  A − B K 1 − B ( K 2 − K 1 )  y t − y ⊤ t P K 1 y t = 2 y ⊤ t ( K 2 − K 1 ) ⊤  ( R + B ⊤ P K 1 B ) K 1 − B ⊤ P K 1 A  y t + y ⊤ t ( K 2 − K 1 ) ⊤ ( R + B ⊤ P K 1 B )( K 2 − K 1 ) y t = 2 y ⊤ t ( K 2 − K 1 ) ⊤ (Υ 22 K 1 K 1 − Υ 21 K 1 ) y t + y ⊤ t ( K 2 − K 1 ) ⊤ Υ 22 K 1 ( K 2 − K 1 ) y t , (F.6) where the matr ix Υ K 1 is deﬁned in ( 3.7 ). By plugging ( F.6 ) into ( F .5 ), we hav e y ⊤ P K 2 y − y ⊤ P K 1 y = X t ≥ 0 2 y ⊤ t ( K 2 − K 1 ) ⊤ (Υ 22 K 1 K 1 − Υ 21 K 1 ) y t + y ⊤ t ( K 2 − K 1 ) ⊤ Υ 22 K 1 ( K 2 − K 1 ) y t = X t ≥ 0 D K 1 ,K 2 ( y t ) , where D K 1 ,K 2 ( y ) = 2 y ⊤ ( K 2 − K 1 )(Υ 22 K 1 K 1 − Υ 21 K 1 ) y + y ⊤ ( K 2 − K 1 ) ⊤ Υ 22 K 1 ( K 2 − K 1 ) y . W e ﬁnish the pro of of the lemma. F.2 Pro of of Lemma D.2 Pr o of. W e pro v e ( D.1 6 ) and ( D.17 ) separately in the sequel. Pro of of ( D.16 ) . F rom the deﬁnition of J 1 ( K ) in ( 3.6 ), w e hav e J 1 ( K ) − J 1 ( K ∗ ) = tr( P K Ψ ǫ − P K ∗ Ψ ǫ ) = E y ∼ N (0 , Ψ ǫ ) ( y ⊤ P K y − y ⊤ P K ∗ y ) = − E  X t ≥ 0 D K,K ∗ ( y t )  , (F.7) 62 where in the la st equality , w e apply Lemma D .1 and the exp ectatio n is tak en follo wing the transition y t +1 = ( A − B K ∗ ) y t with initial state y 0 ∼ N (0 , Ψ ǫ ). Here we denote b y D K,K ∗ ( y ) as D K,K ∗ ( y ) = 2 y ⊤ ( K ∗ − K )(Υ 22 K K − Υ 21 K ) y + y ⊤ ( K ∗ − K ) ⊤ Υ 22 K ( K ∗ − K ) y . Also, w e write D K,K ∗ ( y ) as D K,K ∗ ( y ) = 2 y ⊤ ( K ∗ − K ) ( Υ 22 K K − Υ 21 K ) y + y ⊤ ( K ∗ − K ) ⊤ Υ 22 K ( K ∗ − K ) y (F.8) = y ⊤  K ∗ − K + (Υ 22 K ) − 1 (Υ 22 K K − Υ 21 K )  ⊤ Υ 22 K  K ∗ − K + (Υ 22 K ) − 1 (Υ 22 K K − Υ 21 K )  y − y ⊤ (Υ 22 K K − Υ 21 K ) ⊤ (Υ 22 K ) − 1 (Υ 22 K K − Υ 21 K ) y . Note that the ﬁrst term on the RHS o f ( F.8 ) is p ositiv e, due to the fact that it is a quadratic form o f a p ositiv e deﬁnite matrix, w e lo w er b ound D K,K ∗ ( y ) as D K,K ∗ ( y ) ≥ − y ⊤ (Υ 22 K K − Υ 21 K ) ⊤ (Υ 22 K ) − 1 (Υ 22 K K − Υ 21 K ) y . (F.9) Com bining ( F.7 ) and ( F.9 ), it holds that J 1 ( K ) − J 1 ( K ∗ ) ≤     E  X t ≥ 0 y t y ⊤ t      2 · tr  (Υ 22 K K − Υ 21 K ) ⊤ (Υ 22 K ) − 1 (Υ 22 K K − Υ 21 K )  = k Φ K ∗ k 2 · tr  (Υ 22 K K − Υ 21 K ) ⊤ (Υ 22 K ) − 1 (Υ 22 K K − Υ 21 K )  ≤   (Υ 22 K ) − 1   2 · k Φ K ∗ k 2 · tr  (Υ 22 K K − Υ 21 K ) ⊤ (Υ 22 K K − Υ 21 K )  ≤ σ − 1 min ( R ) · k Φ K ∗ k 2 · tr  (Υ 22 K K − Υ 21 K ) ⊤ (Υ 22 K K − Υ 21 K )  , where the last line comes from the fact tha t Υ 22 K = R + B ⊤ P K B  R . This complete the pro of of ( D.16 ). Pro of of ( D.17 ) . Note that for any e K , it holds b y the optimality of K ∗ that J 1 ( K ) − J 1 ( K ∗ ) ≥ J 1 ( K ) − J 1 ( e K ) = − E  X t ≥ 0 D K, e K ( y t )  , (F.10) where the exp ectation is taken follo wing the transition y t +1 = ( A − B e K ) y t with initial stat e y 0 ∼ N ( 0 , Ψ ǫ ). By taking e K = K − (Υ 22 K ) − 1 (Υ 22 K K − Υ 21 K ) a nd follow ing from a similar calculation as in ( F.8 ), the function D K, e K ( y ) take s the for m of D K, e K ( y ) = − y ⊤ (Υ 22 K K − Υ 21 K ) ⊤ (Υ 22 K ) − 1 (Υ 22 K K − Υ 21 K ) y . (F.11) 63 Com bining ( F.10 ) and ( F .11 ), it holds that J ( K ) − J ( K ∗ ) ≥ tr  Φ e K (Υ 22 K K − Υ 21 K ) ⊤ (Υ 22 K ) − 1 (Υ 22 K K − Υ 21 K )  ≥ σ min (Ψ ǫ ) · k Υ 22 K k − 1 2 · tr  (Υ 22 K K − Υ 21 K ) ⊤ (Υ 22 K K − Υ 21 K )  , where w e use the fact tha t Φ e K = ( A − B e K )Φ e K ( A − B e K ) ⊤ + Ψ ǫ  Ψ ǫ in the la st line. This ﬁnishes the pro of of ( D.17 ). F.3 Pro of of Lemma D.3 Pr o of. By Prop osition B.2 , w e hav e   J 1 ( e K n +1 ) − J 1 ( K n +1 )   = tr  ( P e K n +1 − P K n +1 )Ψ ǫ  ≤ k P e K n +1 − P K n +1 k 2 · k Ψ ǫ k F . (F.12) The follo wing lemma upp er b ounds the term k P e K n +1 − P K n +1 k 2 . Lemma F.1. Supp o se tha t the parameters K and e K satisfy that k e K − K k 2 ·  k A − B K k 2 + 1  · k Φ K k 2 ≤ σ min (Ψ ω ) / 4 · k B k − 1 2 , ( F .13) then it ho lds that k P e K − P K k 2 ≤ 6 · σ − 1 min (Ψ ω ) · k Φ K k 2 · k K k 2 · k R k 2 · k e K − K k 2 (F.14) ·  k B k 2 · k K k 2 ) · k A − B K k 2 + k B k 2 · k K k 2 + 1  . Pr o of. See Lemma 5.7 in Y ang et al. ( 20 19 ) for a detailed pro of. T o use Lemma F.1 , it suﬃces to v erify that e K n +1 and K n +1 satisfy ( F.13 ). Note that from t he deﬁnitions of K n +1 and e K n +1 in ( D.18 ) and ( D .21 ), respectiv ely , w e ha v e k e K n +1 − K n +1 k 2 ·  k A − B e K n +1 k 2 + 1  · k Φ e K n +1 k 2 ≤ γ · k b Υ K n − Υ K n k F ·  1 + k K n k 2  ·  k A − B e K n +1 k 2 + 1  · k Φ e K n +1 k 2 . (F.15) No w, we upp er b ound the RHS of ( F.15 ). F o r the term k A − B e K n +1 k 2 , it holds b y the deﬁnition of e K n +1 in ( D.21 ) that k A − B e K n +1 k 2 ≤ k A − B K n k 2 + γ · k B k 2 · k Υ 22 K n K n − Υ 21 K n k 2 ≤ k A − B K n k 2 + γ · k B k 2 · k Υ K n k 2 ·  1 + k K n k 2  . (F.16) 64 By the deﬁnition o f Υ K n in ( 3.7 ), w e upp er b ound k Υ K n k 2 as k Υ K n k 2 ≤ k Q k 2 + k R k 2 +  k A k F + k B k F  2 · k P K n k 2 ≤ k Q k 2 + k R k 2 +  k A k F + k B k F  2 · J 1 ( K 0 ) · σ − 1 min (Ψ ǫ ) , (F.17) where the la st line comes from the fact that J 1 ( K 0 ) ≥ J 1 ( K n ) = tr  ( Q + K ⊤ n RK n )Φ K n  = tr( P K n Ψ ǫ ) ≥ k P K n k 2 · σ min (Ψ ǫ ) . As f o r the term k Φ e K n +1 k 2 in ( F.15 ), from the fact that J 1 ( K 0 ) ≥ J 1 ( e K n +1 ) = tr  ( Q + e K ⊤ n +1 R e K n +1 )Φ e K n +1  ≥ k Φ e K n +1 k 2 · σ min ( Q ) , it holds t ha t k Φ e K n +1 k 2 ≤ J 1 ( K 0 ) · σ − 1 min ( Q ) . (F.18) Therefore, com bining ( F.15 ), ( F.16 ), ( F.17 ), and ( F.18 ), w e know that k e K n +1 − K n +1 k 2 ·  k A − B e K n +1 k 2 + 1  · k Φ e K n +1 k 2 ≤ p oly 1  k K n k 2  · k b Υ K n − Υ K n k F . (F.19) F rom Theorem B.8 , it holds with probabilit y at least 1 − T − 4 n − e T − 6 n that k b Υ K n − Υ K n k F ≤ p oly 3  k K n k F , k µ k 2  λ K n · (1 − ρ ) 2 · log 3 T n T 1 / 4 n (F.20) + p oly 4  k K n k F , k b 0 k 2 , k µ k 2  λ K n · log 1 / 2 e T n e T 1 / 8 n · (1 − ρ ) , whic h holds for any ρ ∈ ( ρ ( A − B K n ) , 1). Note that fro m the c ho ice of T n and e T n in the statemen t of Theorem B.4 that T n ≥ p oly 5  k K n k F , k b 0 k 2 , k µ k 2  · λ − 4 K n ·  1 − ρ ( A − B K n )  − 9 · ε − 5 , e T n ≥ p oly 6  k K n k F , k b 0 k 2 , k µ k 2  · λ − 2 K n ·  1 − ρ ( A − B K n )  − 12 · ε − 12 , it holds t ha t p oly 3  k K n k F , k µ k 2  λ K n · (1 − ρ ) 2 · log 3 T n T 1 / 4 n + p oly 4  k K n k F , k b 0 k 2 , k µ k 2  λ K n · log 1 / 2 e T n e T 1 / 8 n · (1 − ρ ) ≤ min  h p oly 1  k K n k 2  i − 1 · σ min (Ψ ω ) / 4 · k B k − 1 2 , (F.21) h p oly 2  k K n k 2  i − 1 · ε/ 8 · γ · σ min (Ψ ǫ ) · σ min ( R ) · k Φ K ∗ k − 1 2 · k Ψ ǫ k − 1 F  . 65 Com bining ( F.19 ), ( F.20 ), and ( F.21 ), w e know t ha t ( F .13 ) holds with probability at least 1 − ε 15 for suﬃcien tly small ε > 0. Meanwh ile, by ( F.16 ), ( F.17 ), and ( F.1 8 ), the R HS of ( F.14 ) is upp er b ounded as 6 · σ − 1 min (Ψ ω ) · k Φ e K n +1 k 2 · k e K n +1 k 2 · k R k 2 · k e K n +1 − K n +1 k 2 ·  k B k 2 · k e K n +1 k 2 ) · k A − B e K n +1 k 2 + k B k 2 · k e K n +1 k 2 + 1  ≤ p oly 2  k K n k 2  · k b Υ K n − Υ K n k F . (F.22) No w, b y Lemma F.1 , it ho lds with probability at least 1 − ε 15 that k P e K n +1 − P K n +1 k 2 ≤ 6 · σ − 1 min (Ψ ω ) · k Φ e K n +1 k 2 · k e K n +1 k 2 · k R k 2 · k e K n +1 − K n +1 k 2 ·  k B k 2 · k e K n +1 k 2 ) · k A − B e K n +1 k 2 + k B k 2 · k e K n +1 k 2 + 1  ≤ p oly 2  k K n k 2  · k b Υ K n − Υ K n k F ≤ ε/ 8 · γ · σ min (Ψ ǫ ) · σ min ( R ) · k Φ K ∗ k − 1 2 · k Ψ ǫ k − 1 F , ( F .23) where the second inequalit y comes f rom ( F.22 ), and the last inequalit y comes from ( F.20 ) and ( F.21 ). Combining ( F.12 ) and ( F.23 ), it holds with probability at least 1 − ε 15 that   J 1 ( e K n +1 ) − J 1 ( K n +1 )   ≤ γ · σ min (Ψ ǫ ) · σ min ( R ) · k Φ K ∗ k − 1 2 · ε/ 4 , whic h concludes the pro o f of the lemma. F.4 Pro of of Lemma D.4 Pr o of. Note tha t Υ 22 K ∗ K ∗ − Υ 21 K ∗ is the natural g radien t of J 1 at the minimizer K ∗ , which implies that Υ 22 K ∗ K ∗ − Υ 21 K ∗ = 0 . (F.24) By Lemma D.1 , it holds that J 1 ( K ) − J 1 ( K ∗ ) = tr( P K Ψ ǫ − P K ∗ Ψ ǫ ) = E y ∼ N (0 , Ψ ǫ ) ( y ⊤ P K y − y ⊤ P K ∗ y ) = E  X t ≥ 0 h 2 y ⊤ t ( K − K ∗ )(Υ 22 K ∗ K ∗ − Υ 21 K ∗ ) y t + y ⊤ t ( K − K ∗ ) ⊤ Υ 22 K ∗ ( K − K ∗ ) y t i  = E  X t ≥ 0 y ⊤ t ( K − K ∗ ) ⊤ Υ 21 K ∗ ( K − K ∗ ) y t  , (F.25) 66 where w e use ( F .2 4 ) in the last line. Here the exp ectations are tak en follo wing the tra nsition y t +1 = ( A − B K ) y t with initial state y 0 ∼ N (0 , Ψ ǫ ). Also, w e ha ve E  X t ≥ 0 y ⊤ t ( K − K ∗ ) ⊤ Υ 22 K ∗ ( K − K ∗ ) y t  = tr  Φ K ( K − K ∗ ) ⊤ Υ 22 K ∗ ( K − K ∗ )  ≥ k Φ K k 2 · k Υ 22 K ∗ k 2 · tr  ( K − K ∗ ) ⊤ ( K − K ∗ )  ≥ σ min (Ψ ǫ ) · σ min ( R ) · k K − K ∗ k 2 F , (F.26) where w e use the fact that Φ K = ( A − B K )Φ K ( A − B K )+ Ψ ǫ  Ψ ǫ and Υ 22 K ∗ = R + B ⊤ P K ∗ B  R in the last line. Com bining ( F.25 ) and ( F.2 6 ), we ha v e J 1 ( K ) − J 1 ( K ∗ ) ≥ σ min (Ψ ǫ ) · σ min ( R ) · k K − K ∗ k 2 F . W e conclude t he pro of of the lemma. F.5 Pro of of Lemma D.5 Pr o of. F ollowing from Prop osition 3.3 , w e hav e J 2 ( K N , b h +1 ) − J 2 ( K N , e b h +1 ) ≤ γ b · ∇ b J 2 ( K N , e b h +1 ) ⊤  ∇ b J 2 ( K N , b h ) − b ∇ b J 2 ( K N , b h )  + ( γ b ) 2 · ν K N / 2 ·   ∇ b J 2 ( K N , b h ) − b ∇ b J 2 ( K N , b h )   2 2 , J 2 ( K N , e b h +1 ) − J 2 ( K N , b h +1 ) ≤ − γ b · ∇ b J 2 ( K N , e b h +1 ) ⊤  ∇ b J 2 ( K N , b h ) − b ∇ b J 2 ( K N , b h )  − ( γ b ) 2 · ι K N / 2 ·   ∇ b J 2 ( K N , b h ) − b ∇ b J 2 ( K N , b h )   2 2 , (F .27) where ν K N and ι K N are deﬁned in Prop osition 3.3 . Also, follo wing fro m Prop osition B.3 , it holds that   ∇ b J 2 ( K N , e b h +1 )   2 ≤ p o ly 1  k K N k F , k b h k 2 , k µ k 2 , J ( K N , b 0 )  ·  1 − ρ ( A − B K N )  − 1 . (F.28) Com bining ( F.27 ), ( F.28 ), a nd the fact that ν K N ≤ ι K N ≤ [1 − ρ ( A − B K N )] − 2 · p oly 2 ( k K N k 2 ), w e know that   J 2 ( K N , b h +1 ) − J 2 ( K N , e b h +1 )   (F.29) ≤ ( γ b ) 2 · p oly 2  k K N k 2  ·   ∇ b J 2 ( K N , b h ) − b ∇ b J 2 ( K N , b h )   2 2 ·  1 − ρ ( A − B K N )  − 2 + γ b · p oly 1  k K N k F , k b h k 2 , k µ k 2 , J ( K N , b 0 )  ·   ∇ b J 2 ( K N , b h ) − b ∇ b J 2 ( K N , b h )   2 ·  1 − ρ ( A − B K N )  − 1 . 67 Note that from the deﬁnition of b ∇ b J 2 ( K N , b h ) and ∇ b J 2 ( K N , b h ) in ( D .31 ) and ( D .33 ), re- sp ectiv ely , it holds by triangle inequalit y that   ∇ b J 2 ( K N , b h ) − b ∇ b J 2 ( K N , b h )   2 ≤ k b Υ 22 K N − Υ 22 K N k 2 · k K N k 2 · k b µ K N ,b h k 2 + k Υ 22 K N k 2 · k K N k 2 · k b µ K N ,b h − µ K N ,b h k 2 + k b Υ 22 K N − Υ 22 K N k 2 · k b h k 2 + k b Υ 21 K N − Υ 21 K N k 2 · k b µ K N ,b h k 2 + k Υ 21 K N k 2 · k b µ K N ,b h − µ K N ,b h k 2 + k b q K N ,b h − q K N ,b h k 2 . By Theorem B.8 , combining the fact that J 2 ( K N , b h ) ≤ J 2 ( K N , b 0 ) and the fact that k µ K N ,b k 2 ≤ J ( K N , b 0 ) /σ min ( Q ), we know that with probabilit y at least 1 − ( T b n ) − 4 − ( e T b n ) − 6 , it holds for an y ρ ∈ ( ρ ( A − B K N ) , 1) that   ∇ b J 2 ( K N , b h ) − b ∇ b J 2 ( K N , b h )   2 (F.30) ≤ λ − 1 K N · p oly 3  k K N k F , k b h k 2 , k µ k 2 , J 2 ( K N , b 0 )  ·  log 3 T b n ( T b n ) 1 / 4 (1 − ρ ) 2 + log 1 / 2 e T b n ( e T b n ) 1 / 8 · (1 − ρ )  . F ollow ing fr om the c hoices of γ b , T b n , and e T b n in the statemen t of Theorem B.4 , it holds that γ b · p o ly 1  k K N k F , k b h k 2 , k µ k 2 , J ( K N , b 0 )  · λ − 1 K N · p oly 3  k K N k F , k b h k 2 , k µ k 2 , J 2 ( K N , b 0 )  ·  log 3 T b n ( T b n ) 1 / 4 (1 − ρ ) 2 + log 1 / 2 e T b n ( e T b n ) 1 / 8 · (1 − ρ )  ·  1 − ρ ( A − B K N )  − 1 +  1 − ρ ( A − B K N )  − 2 · p oly 3  k K N k F , k b h k 2 , k µ k 2 , J 2 ( K N , b 0 )  ·  log 6 T b n ( T b n ) 1 / 2 (1 − ρ ) 4 + log e T b n ( e T b n ) 1 / 4 · (1 − ρ ) 2  · ( γ b ) 2 · p oly 2  k K N k 2  · λ − 1 K N ≤ ν K N · γ b · ε/ 2 . F urther com bining ( F.29 ) and ( F.3 0 ), it holds with probabilit y at least 1 − ε 15 that   J 2 ( K N , b h +1 ) − J 2 ( K N , e b h +1 )   ≤ ν K N · γ b · ε/ 2 . W e then ﬁnish the pro o f o f the lemma. F.6 Pro of of Lemma D.6 Pr o of. W e sho w that ζ K,b ∈ V ζ and ξ ( ζ ) ∈ V ξ for an y ζ ∈ V ζ separately . P art 1. First w e sho w that ζ K,b ∈ V ζ . Note that from Deﬁnition B.7 , w e know that ζ 1 K,b = J ( K , b ) satisﬁes that 0 ≤ ζ 1 K,b ≤ J ( K 0 , b 0 ). It remains to sho w that ζ 2 K,b = α K,b 68 satisﬁes that k ζ 2 K,b k 2 ≤ M ζ . By the deﬁnition of α K,b in ( B.6 ), w e kno w tha t k α K,b k 2 2 ≤ k Υ K k 2 F + k Υ K k 2 2 ·  k µ K,b k 2 2 + k µ u K,b k 2 2  +  k A k 2 + k B k 2  2 ·  k P K k 2 · k Aµ + d k 2 + k f K,b k 2  2 (F.31) where f K,b = ( I − A + B K ) −⊤ [( A − B K ) ⊤ P K ( B b + Aµ + d ) − K ⊤ Rb ] and for notatio na l simplicit y , w e denote b y µ u K,b = − K µ K,b + b . W e only need to b ound Υ K , µ K,b , µ u K,b , P K , and f K,b . Note that by Prop osition B.2 , the exp ected to tal cost J ( K , b ) tak es the form of J ( K , b ) = tr( P K Ψ ǫ ) + µ ⊤ K,b Qµ K,b + ( µ u K,b ) ⊤ Rµ u K,b + σ 2 · tr( R ) + µ ⊤ Qµ. Th us, w e ha v e J ( K 0 , b 0 ) ≥ J ( K , b ) ≥ σ min (Ψ ω ) · t r ( P K ) ≥ σ min (Ψ ω ) · k P K k 2 , J ( K 0 , b 0 ) ≥ J ( K , b ) ≥ µ ⊤ K,b Qµ K,b ≥ σ min ( Q ) · k µ K,b k 2 , J ( K 0 , b 0 ) ≥ J ( K , b ) ≥ ( µ u K,b ) ⊤ Rµ u K,b ≥ σ min ( R ) · k µ u K,b k 2 , whic h imply that k P K k 2 ≤ J ( K 0 , b 0 ) /σ min (Ψ ω ) , k µ K,b k 2 ≤ J ( K 0 , b 0 ) /σ min ( Q ) , k µ u K,b k 2 ≤ J ( K 0 , b 0 ) /σ min ( R ) . (F.32) F or Υ K , it holds that Υ K = Q 0 0 R ! + A ⊤ B ⊤ ! P K  A B  , whic h give s k Υ K k F ≤ ( k Q k F + k R k F ) +  k A k 2 F + k B k 2 F  · k P K k F , k Υ K k 2 ≤ ( k Q k 2 + k R k 2 ) +  k A k 2 + k B k 2  2 · k P K k 2 . Com bining ( F.32 ) and the fa ct that k P K k F ≤ √ m · k P K k 2 , we know that k Υ K k F ≤  k Q k F + k R k F  +  k A k 2 F + k B k 2 F  · √ m · J ( K 0 , b 0 ) /σ min (Ψ ω ) , k Υ K k 2 ≤  k Q k 2 + k R k 2  +  k A k 2 + k B k 2  2 · J ( K 0 , b 0 ) /σ min (Ψ ω ) . (F.33) 69 No w, we upp er b ound the v ector f K,b . Note that b y a lgebra, the v ector f K,b tak es the form of f K,b = − P K µ K,b + ( I − A + B K ) − T ( Qµ K,b − K ⊤ Rµ u K,b ) . Therefore, w e upp er b ound f K,b as k f K,b k 2 ≤ J ( K 0 , b 0 ) 2 · σ − 1 min (Ψ ω ) · σ − 1 min ( Q ) +  1 − ρ ( A − B K )  − 1 · ( κ Q + κ R · k K k F ) (F.34) Com bining ( F.31 ), ( F.32 ), ( F.33 ), and ( F.34 ), it holds that k ζ 2 K,b k 2 = k α K,b k 2 ≤ M ζ , 1 + M ζ , 2 · (1 + k K k F ) · [1 − ρ ( A − B K )] − 1 . Therefore, it holds tha t ζ K,b ∈ V ζ . P art 2. No w we sho w that for an y ζ ∈ V ζ , we ha v e ξ ( ζ ) ∈ V ξ . Recall that fro m ( D.41 ), it holds that ξ 1 ( ζ ) = ζ 1 − J ( K , b ) , ξ 2 ( ζ ) = E π K,b  ψ ( x, u )  ζ 1 + Θ K,b ζ 2 − E π K,b  c ( x, u ) ψ ( x, u )  . (F.3 5 ) Then w e ha ve   ξ 1 ( ζ )   =   ζ 1 − J ( K , b )   ≤ J ( K 0 , b 0 ) , (F.36) where we use the fact that since ζ ∈ V ζ , w e ha ve 0 ≤ ζ 1 ≤ J ( K 0 , b 0 ) by Deﬁnition B.7 . Also, b y ( F.35 ), we hav e   ξ 2 ( ζ )   2 ≤    E π K,b  ψ ( x, u )  ζ 1    2 | {z } B 1 + k Θ K,b k 2 · k ζ 2 k 2 | {z } B 2 +    E π K,b  c ( x, u ) ψ ( x, u )     2 | {z } B 3 . (F.37) Note that we upp er b ound B 1 as B 1 ≤ J ( K 0 , b 0 ) ·    E π K,b  ψ ( x, u )     2 . (F.38) F ollow ing from t he deﬁnition of ψ ( x, u ) in ( B.5 ), w e kno w tha t    E π K,b  ψ ( x, u )     2 ≤ k Σ z k F , (F.39) where Σ z is deﬁned as Σ z = Co v " x u !# = Φ K − Φ K K ⊤ − K Φ K K Φ K K ⊤ + σ 2 I ! = 0 0 0 σ 2 I ! + I − K ! Φ K I − K ! ⊤ . 70 Com bining ( F.38 ) and ( F .39 ), we hav e B 1 ≤ J ( K 0 , b 0 ) · k Σ z k F . (F.40) By Prop o sition B.6 , w e upp er b ound B 2 as B 2 ≤ 4(1 + k K k 2 F ) 3 · k Φ K k 2 2 · ( M ζ , 1 + M ζ , 2 ) ·  1 − ρ ( A − B K )  − 1 , (F.41) where we use the fa ct that ζ ∈ V ζ and Deﬁnition B.7 . As for the term B 3 in ( F .37 ), we utilize the follo wing lemma to pro vide an upp er b ound. Lemma F.2. The v ector E π K,b [ c ( x, u ) ψ ( x, u )] tak es the follo wing form, E π K,b  c ( x, u ) ψ ( x, u )  =    2sv ec  Σ z diag( Q, R )Σ z + h Σ z , diag( Q, R ) i Σ z  Σ z 2 Qµ K,b 2 Rµ u K,b !    +  µ ⊤ K,b Qµ K,b + ( µ u K,b ) ⊤ Rµ u K,b + µ ⊤ Qµ     sv ec(Σ z ) 0 m 0 k    . Here the matrix Σ z tak es the form of Σ z = Φ K − Φ K K ⊤ − K Φ K K Φ K K ⊤ + σ 2 · I ! . Pr o of. See § F.11 for a detailed pro o f. F rom Lemma F.2 a nd ( F.3 2 ), it holds that B 3 ≤ 3  k Q k F + k R k F + J ( K 0 , b 0 ) · k Q k 2 /σ min ( Q ) (F.42) + J ( K 0 , b 0 ) · k R k 2 /σ min ( R )  · k Σ z k 2 2 . Moreo ve r, by the deﬁnition of Σ z in ( E.25 ), com bining the triangle inequalit y , w e hav e the follo wing b ounds for k Σ z k F and k Σ z k 2 , k Σ z k F ≤ 2( d + k K k 2 F ) · k Φ K k 2 , k Σ z k 2 ≤ 2(1 + k K k 2 F ) · k Φ K k 2 . (F.43) Also, w e ha v e J ( K 0 , b 0 ) ≥ J ( K , b ) ≥ tr  ( Q + K ⊤ RK )Φ K  ≥ k Φ K k 2 · σ min ( Q ) , 71 whic h give s the upp er b ound for Φ K as follo ws, k Φ K k 2 ≤ J ( K 0 , b 0 ) /σ min ( Q ) . (F.44) Therefore, com bining ( F.37 ), ( F.40 ), ( F .41 ), ( F.4 2 ), ( F .43 ), and ( F.44 ), w e kno w that   ξ 2 ( ζ )   2 ≤ C · ( M ζ , 1 + M ζ , 2 ) · J ( K 0 , b 0 ) 2 /σ 2 min ( Q ) (F.45) ·  1 + k K k 2 F  3 ·  1 − ρ ( A − B K )  − 1 . By ( F.36 ) and ( F.45 ), we kno w that ξ ( ζ ) ∈ V ξ for any ζ ∈ V ζ . W e conclude the pro of of the lemma. F.7 Pro of of Lemma D.7 Pr o of. Assume that e z 0 ∼ N ( µ † , Σ † ). F ollo wing from t he fact that e z t +1 = L e z t + ν + δ t , it holds t ha t e z t ∼ N L t µ † + t − 1 X i =0 L i · ν, ( L ⊤ ) t Σ † L t + t − 1 X i =0 ( L ⊤ ) i Ψ δ L i ! , ( F .46) where Ψ δ = Ψ ω K Ψ ω K Ψ ω K Ψ ω K ⊤ + σ 2 I ! . F rom ( D.47 ), w e kno w that µ z tak es the form of µ z = ( I − L ) − 1 ν = ∞ X j =0 L j ν. (F.47) Therefore, com bining ( F.46 ) and ( F.4 7 ), we hav e E ( b µ z ) = µ z + 1 e T e T X t =1 L t µ † − 1 e T e T X t =1 ∞ X i = t L i ν. (F.48) W e denote by µ e T = e T X t =1 L t µ † − e T X t =1 ∞ X i = t L i ν. 72 Mean while, it holds that     e T X t =1 L t µ † − e T X t =1 ∞ X i = t L i ν     2 ≤ e T X t =1 ρ ( L ) t · k µ † k 2 + e T X t =1 ∞ X i = t ρ ( L ) i · k ν k 2 ≤  1 − ρ ( L )  − 1 · k µ † k 2 +  1 − ρ ( L )  − 2 · k ν k 2 ≤ M µ · (1 − ρ ) − 2 · k µ z k 2 , (F.49) where M µ is a p ositiv e absolute constan t. F or t he cov ariance, note that for a n y random v ariables X ∼ N ( µ 1 , Σ 1 ) a nd Y ∼ N ( µ 2 , Σ 2 ), w e know that Z = X + Y ∼ N ( µ 1 + µ 2 , Σ), where k Σ k F ≤ 2 k Σ 1 k F + 2 k Σ 2 k F . Com bining ( F.46 ), w e kno w that b µ z ∼ N ( E b µ z , e Σ e T / e T ), where e Σ e T satisﬁes that e T / 2 · k e Σ e T k F ≤ e T X t =1 ρ ( L ) 2 t · k Σ † k F + e T X t =1 t − 1 X i =0 ρ ( L ) 2 i · k Ψ δ k F ≤  1 − ρ ( L ) 2  − 1 · k Σ † k F + e T ·  1 − ρ ( L ) 2  − 1 · k Ψ δ k F , whic h implies that k e Σ e T k F ≤ M Σ · (1 − ρ ) − 1 · k Σ z k F , (F.50) where M Σ is a p ositiv e absolute constant. Combining ( F.48 ), ( F.49 ), and ( F.50 ), w e obtain that b µ z ∼ N  µ z + 1 e T µ e T , 1 e T e Σ e T  , where k µ e T k 2 ≤ M µ · (1 − ρ ) − 2 · k µ z k 2 and k e Σ e T k F ≤ M Σ · (1 − ρ ) − 1 · k Σ z k F . Moreov er, by the Gaussian tail inequality , it holds with probabilit y at least 1 − e T − 6 that k b µ z − µ z k 2 ≤ log e T e T 1 / 4 · (1 − ρ ) − 2 · p o ly  k Φ K k 2 , k K k F , k b k 2 , k µ k 2  . Then w e ﬁnish the pro of of the lemma. F.8 Pro of of Lemma D.8 Pr o of. W e con tinue using t he notatio ns giv en in § D.3 . W e deﬁne b F ( ζ , ξ ) = n E ( b ψ ) ζ 1 + E  ( b ψ − b ψ ′ ) b ψ ⊤  ζ 2 − E ( c b ψ ) o ⊤ ξ 2 +  ζ 1 − E ( c )  · ξ 1 − 1 / 2 · k ξ k 2 2 , 73 where b ψ = b ψ ( x, u ) is the estimated feature v ector. Here the exp ectation is o nly tak en o ver the tra jectory g enerated b y the state transition and the p olicy π K,b , conditioning on the randomness induced when calculating the estimated feature v ectors. Th us, the function b F ( ζ , ξ ) is still random, where the randomness comes from the estimated feature vec tors. Note that | F ( ζ , ξ ) − e F ( ζ , ξ ) | ≤ | F ( ζ , ξ ) − b F ( ζ , ξ ) | + | b F ( ζ , ξ ) − e F ( ζ , ξ ) | . Th us, w e only need to upp er b ound | F ( ζ , ξ ) − b F ( ζ , ξ ) | a nd | b F ( ζ , ξ ) − e F ( ζ , ξ ) | . P art 1. F irst w e upp er b ound | F ( ζ , ξ ) − b F ( ζ , ξ ) | . Note that b y algebra, w e ha ve   F ( ζ , ξ ) − b F ( ζ , ξ )   =     n E ( ψ − b ψ ) ζ 1 + E  ( ψ − ψ ′ ) ψ ⊤ − ( b ψ − b ψ ′ ) b ψ ⊤  ζ 2 − E  c ( ψ − b ψ )  o ⊤ ξ 2     ≤ E  k ψ − b ψ k 2  · h | ζ 1 | + E  k ψ − ψ ′ k 2 + 2 k b ψ k 2  · k ζ 2 k 2 + E ( c ) i · k ξ 2 k 2 , (F.51) where the exp ectatio n is only taken ov er the tra jectory generated by the state transition and the p olicy π K,b . F rom Lemma D.7 , it holds that P  k b µ z − µ z + 1 / e T · µ e T k 2 ≤ C 1  ≥ 1 − e T − 6 . ( F .52) Therefore, com bining ( F.52 ), it holds with probability at least 1 − e T − 6 that E  k ψ − ψ ′ k 2 + 2 k b ψ k 2  ≤ p oly  k Φ K k 2 , k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )  , (F.53) where the exp ectatio n is conditioned on the randomness induced when calculating the esti- mated feature vec tors. Also, w e know that E ( c ) ≤ p oly  k Φ K k 2 , k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )  . (F.54) Therefore, com bining ( F.51 ), ( F.53 ), ( F.54 ), and Deﬁnition B.7 , it holds with proba bility at least 1 − e T − 6 that   F ( ζ , ξ ) − b F ( ζ , ξ )   ≤ E  k ψ − b ψ k 2  · p oly  k Φ K k 2 , k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )  . (F.55) F ollow ing from the deﬁnitions of ψ ( x, u ) in ( B.5 ) and b ψ ( x, u ) in ( B.14 ), w e upp er b ound k ψ ( x, u ) − b ψ ( x, u ) k 2 for an y x and u as k ψ ( x, u ) − b ψ ( x, u ) k 2 2 = k b µ z − µ z k 2 2 +   z ( b µ z − µ z ) ⊤ + ( b µ z − µ z ) z ⊤   2 F + k µ z µ ⊤ z − b µ z b µ ⊤ z k 2 F ≤ p oly  k Φ K k 2 , k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )  · k b µ z − µ z k 2 2 , (F.56) 74 where µ z is deﬁned in ( D.47 ), b µ z is deﬁne d in ( D.48 ), and z = ( x ⊤ , u ⊤ ) ⊤ . Also, b y Lemma D.7 , w e kno w that k b µ z − µ z k 2 ≤ log e T e T 1 / 4 · (1 − ρ ) − 2 · p o ly  k Φ K k 2 , k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )  , (F.57) whic h holds with probability at least 1 − e T − 6 . Combining ( F.55 ), ( F.56 ), and ( F.5 7 ), it holds with probabilit y at least 1 − e T − 6 that   F ( ζ , ξ ) − b F ( ζ , ξ )   ≤ log e T e T 1 / 4 · (1 − ρ ) − 2 · p oly  k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )  . (F.58) P art 2. W e now upp er b ound | b F ( ζ , ξ ) − e F ( ζ , ξ ) | in the sequel. By deﬁnitions, we hav e   e F ( ζ , ξ ) − b F ( ζ , ξ )   =     n E ( e ψ − b ψ ) ζ 1 + E  ( e ψ − e ψ ′ ) e ψ ⊤ − ( b ψ − b ψ ′ ) b ψ ⊤  ζ 2 − E ( e c e ψ − b c b ψ ) o ⊤ ξ 2 + E ( b c − e c ) ξ 1     ≤     n E ( b ψ ) ζ 1 + E ( b ψ b ψ ⊤ ) ζ 2 − E ( b c b ψ ) o ⊤ ξ 2 + E ( b c ) ξ 1     · 1 E c (F.59) +     E ( b ψ ′ b ψ ⊤ ) ζ 2  ⊤ ξ 2    · 1 ( E ′ ∩E ) c , where w e deﬁne the eve n t E ′ as E ′ =  \ t ∈ [ T ] n   k z ′ t − µ z + 1 / e T · µ e T k 2 2 − tr( e Σ z )   ≤ C 1 · log T · k e Σ z k 2 o  \ E 2 , where E 2 is deﬁned in ( D.55 ). Combining the fact that P ( E 2 ) ≥ 1 − e T − 6 and Lemma G.3 , it holds that P ( E ′ ) ≥ 1 − T − 5 − e T − 6 . F ollowing a similar argumen t as in P art 1 , it holds from ( F.59 ) that   e F ( ζ , ξ ) − b F ( ζ , ξ )   ≤  1 T + 1 e T 1 / 4  · p oly  k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )  (F.60) for suﬃcie n tly large T and e T . No w, com bining ( F.58 ) a nd ( F.60 ), b y triangle inequality , it holds with probability at least 1 − e T − 6 that   F ( ζ , ξ ) − e F ( ζ , ξ )   ≤  1 2 T + log e T e T 1 / 4  · (1 − ρ ) − 2 · p oly  k K k F , k b k 2 , k µ k 2 , J ( K 0 , b 0 )  . W e ﬁnish the pro of of the lemma. 75 F.9 Pro of of Lemma E.2 Pr o of. Recall that the feature v ector ψ ( x, u ) tak es the following form ψ ( x, u ) = sv ec  ( z − µ z )( z − µ z ) ⊤  z − µ z ! . W e then hav e ψ ( x, u ) − ψ ( x ′ , u ′ ) = sv ec  y y ⊤ − ( Ly + δ )( Ly + δ ) ⊤  y − ( Ly + δ ) ! , (F.61) where w e denote b y y = z − µ z , and ( x ′ , u ′ ) is the stat e- action pair after ( x, u ) following the state transition and the p olicy π K,b . Therefore, for any symmetric matrices M , N and an y v ectors m , n , it holds f r om ( B.7 ) a nd ( F.61 ) that sv ec( M ) m ! ⊤ Θ K,b sv ec( N ) n ! = E y , δ ( sv ec( M ) m ! ⊤ sv ec( y y ⊤ ) y ! sv ec  y y ⊤ − ( Ly + δ )( Ly + δ ) ⊤  y − ( Ly + δ ) ! ⊤ sv ec( N ) n !) = E y , δ n  h M , y y ⊤ i + m ⊤ y  ·  h N , y y ⊤ − ( Ly + δ )( Ly + δ ) ⊤ i + n ⊤ ( y − Ly − δ )  o = E y  h y y ⊤ , M i · h y y ⊤ − Ly y ⊤ L ⊤ − Ψ δ , N i  | {z } A 1 + E y  h y y ⊤ , M i · n ⊤ ( y − Ly )  | {z } A 2 (F.62) + E y  m ⊤ y · h y y ⊤ − Ly y ⊤ L ⊤ − Ψ δ , N i  | {z } A 3 + E y  m ⊤ y · n ⊤ ( y − Ly )  | {z } A 4 , where t he exp ectations are tak en ov er y ∼ N (0 , Σ z ) and δ ∼ N (0 , Ψ δ ). W e ev aluate t he terms A 1 , A 2 , A 3 , and A 4 in the sequel. F or the terms A 2 and A 3 in ( F.62 ), by the f a ct that y = z − µ z ∼ N (0 , Σ z ), w e know that t hese tw o terms v anish. F or A 4 , it holds that A 4 = E y  m ⊤ y · ( y − Ly ) ⊤ n  = E y  m ⊤ y y ⊤ ( I − L ) ⊤ n  = m ⊤ Σ z ( I − L ) ⊤ n. (F.63) F or A 1 , b y algebra, w e hav e A 1 = E y  h y y ⊤ , M i · h y y ⊤ − Ly y ⊤ L ⊤ − Ψ δ , N i  = E y  h y y ⊤ , M i · h y y ⊤ − Ly y ⊤ L ⊤ , N i  − E y  h y y ⊤ , M i · h Ψ δ , N i  = E y  y ⊤ M y · y ⊤ ( N − L ⊤ N L ) y  − h Σ z , M i · h Ψ δ , N i = E u ∼N (0 ,I )  u ⊤ Σ 1 / 2 z M Σ 1 / 2 z u · u ⊤ Σ 1 / 2 z ( N − L ⊤ N L )Σ 1 / 2 z u  − h Σ z , M i · h Ψ δ , N i . (F.64) 76 No w, b y applying Lemma G .1 to the ﬁrst term on the RHS of ( F .6 4 ), w e kno w that A 1 = 2 tr  Σ 1 / 2 z M Σ 1 / 2 z · Σ 1 / 2 z ( N − L ⊤ N L )Σ 1 / 2 z  + t r(Σ 1 / 2 z M Σ 1 / 2 z ) · t r  Σ 1 / 2 z ( N − L ⊤ N L )Σ 1 / 2 z  − h Σ z , M i · h Ψ δ , N i = 2  M , Σ z ( N − L ⊤ N L )Σ z i + h Σ z , M  · h Σ z − L Σ z L ⊤ − Ψ δ , N i = 2  M , Σ z ( N − L ⊤ N L )Σ z  , where w e use the fact tha t Σ z = L Σ z L ⊤ + Ψ δ in the last equalit y . By using the prop erty of the op erator sv ec( · ) and the deﬁnition of the symmetric Kro neck er pro duct, w e obtain that A 1 = 2sv ec ( M ) ⊤ sv ec  Σ z ( N − L ⊤ N L )Σ z  = 2sv ec ( M ) ⊤  Σ z ⊗ s Σ z − (Σ z L ⊤ ) ⊗ s (Σ z L ⊤ )  sv ec( N ) = 2sv ec ( M ) ⊤  (Σ z ⊗ s Σ z )( I − L ⊗ s L ) ⊤  sv ec( N ) . (F.65) Com bining ( F.62 ), ( F.63 ), and ( F.65 ), w e obtain that sv ec( M ) m ! ⊤ Θ K,b sv ec( N ) n ! = sv ec( M ) ⊤  2(Σ z ⊗ s Σ z )( I − L ⊗ s L ) ⊤  sv ec( N ) + m ⊤ Σ z ( I − L ) ⊤ n = sv ec( M ) m ! ⊤ 2(Σ z ⊗ s Σ z )( I − L ⊗ s L ) ⊤ 0 0 Σ z ( I − L ) ⊤ ! sv ec( N ) n ! . Th us, the matrix Θ K,b tak es the following form, Θ K,b = 2(Σ z ⊗ s Σ z )( I − L ⊗ s L ) ⊤ 0 0 Σ z ( I − L ) ⊤ ! , whic h concludes the pro o f of the lemma. F.10 Pro of of Lemma E.3 Pr o of. F rom the deﬁnition of e Θ K,b in ( B.9 ), it holds that k e Θ − 1 K,b k 2 2 ≤ 1 + k Θ − 1 K,b k 2 2 + k Θ − 1 K,b e σ z k 2 2 , (F.66) where e σ z is deﬁned as e σ z = E π K,b  ψ ( x, u )  = sv ec(Σ z ) 0 k + m ! . 77 W e b ound the R HS of ( F.66 ) in the sequel. F or the term Θ − 1 K,b e σ z , com bining Lemma E.2 , w e ha ve Θ − 1 K,b e σ z = 1 / 2 · ( I − L ⊗ s L ) −⊤ (Σ z ⊗ s Σ z ) − 1 · sv ec (Σ z ) 0 k + m ! = 1 / 2 · ( I − L ⊗ s L ) −⊤ (Σ − 1 z ⊗ s Σ − 1 z ) · sve c(Σ z ) 0 k + m ! = 1 / 2 · ( I − L ⊗ s L ) −⊤ · sv ec(Σ − 1 z ) 0 k + m ! , (F.67) where w e use the prop erty of the symmetric Kro nec k er pro duct in the second and last line. By taking t he sp ectral norm on b oth sides of ( F.67 ), it holds t ha t k Θ − 1 K,b e σ z k 2 = 1 / 2 ·   ( I − L ⊗ s L ) −⊤ · sv ec(Σ − 1 z )   2 ≤ 1 / 2 ·   ( I − L ⊗ s L ) −⊤   2 ·   sv ec(Σ − 1 z )   2 ≤ 1 / 2 ·  1 − ρ 2 ( L )  − 1 · k Σ − 1 z k F ≤ 1 / 2 · √ k + m ·  1 − ρ 2 ( L )  − 1 · k Σ − 1 z k 2 = 1 / 2 · √ k + m ·  1 − ρ 2 ( L )  − 1 · σ − 1 min (Σ z ) , (F.68) where in the third line we use Lemma G.2 to the matrix L ⊗ s L . Similarly , w e upp er b ound k Θ − 1 K,b k 2 in the sequel k Θ − 1 K,b k 2 ≤ min n 1 / 2 ·  1 − ρ 2 ( L )  − 1 σ − 2 min (Σ z ) ,  1 − ρ ( L )  − 1 σ − 1 min (Σ z ) o . (F.69) Th us, com bining ( F.66 ), ( F.68 ), and ( F.69 ), w e obtain t hat k e Θ − 1 K,b k 2 2 ≤ 1 + 1 / 2 · √ k + m ·  1 − ρ 2 ( L )  − 1 · σ − 1 min (Σ z ) + min n 1 / 2 ·  1 − ρ 2 ( L )  − 1 σ − 2 min (Σ z ) ,  1 − ρ ( L )  − 1 σ − 1 min (Σ z ) o . (F .70) No w it remains to c haracterize σ min (Σ z ). F or any vec tors s ∈ R m and r ∈ R k , w e ha ve s r ! ⊤ Σ z s r ! = E x ∼N ( µ K,b , Φ K ) ,u ∼ π K,b ( · | x ) n  s ⊤ ( x − µ K,b ) + r ⊤ ( u + K µ K,b − b )  2 o = E x ∼N ( µ K,b , Φ K ) ,η ∼N (0 ,I ) n  ( s − K ⊤ r ) ⊤ ( x − µ K,b ) + σ r ⊤ η  2 o = E x ∼N ( µ K,b , Φ K ) n  ( s − K ⊤ r ) ⊤ ( x − µ K,b )  2 o + E η ∼N (0 ,I )  ( σ r ⊤ η ) 2  . (F.71) 78 The ﬁrst t erm on the RHS of ( F.71 ) is lo w er b ounded as E x ∼N ( µ K,b , Φ K ) n  ( s − K ⊤ r ) ⊤ ( x − µ K,b )  2 o = ( s − K ⊤ r ) ⊤ Φ K ( s − K ⊤ r ) ≥ k s − K ⊤ r k 2 2 · σ min (Φ K ) ≥ k s − K ⊤ r k 2 2 · σ min (Ψ ω ) , (F.72) where the last inequality comes from the fact that σ min (Φ K ) ≥ σ min (Ψ ω ) b y ( 3.3 ). The second term o n the RHS of ( F.71 ) tak es the form of E η ∼N (0 ,I )  ( σ r ⊤ η ) 2  = σ 2 k r k 2 2 . ( F .73) Therefore, com bining ( F.71 ), ( F.72 ), and ( F.73 ), we hav e s r ! ⊤ Σ z s r ! ≥ k s − K ⊤ r k 2 2 · σ min (Ψ ω ) + σ 2 k r k 2 2 ≥ σ min (Ψ ω ) · k s k 2 2 +  σ 2 − k K k 2 2 · σ min (Ψ ω )  · k r k 2 2 . F rom this, we know that σ min (Σ z ) ≥ min  σ min (Ψ ω ) , σ 2 − k K k 2 2 · σ min (Ψ ω )  . (F.74) Th us, com bining ( F.70 ) and ( F.7 4 ), w e kno w that k e Θ − 1 K,b k 2 is upp er bounded by a constan t e λ K , where e λ K only dep ends on k K k 2 and ρ ( L ) = ρ ( A − B K ). This ﬁnishes the pro of of the lemma. F.11 Pro of of Lemma F.2 Pr o of. First, note that the cost function c ( x, u ) tak es the follow ing form, c ( x, u ) = ψ ( x, u ) ⊤    sv ec  diag( Q, R )  2 Qµ K,b 2 Rµ u K,b    +  µ ⊤ K,b Qµ K,b + ( µ u K,b ) ⊤ Rµ u K,b + µ ⊤ Qµ  . 79 F or an y matrix V and v ectors v x , v u , it holds that E π K,b  c ( x, u ) ψ ( x, u )  ⊤    sv ec( V ) v x v u    = E π K,b        ψ ( x, u ) ⊤    sv ec  diag( Q, R )  2 Qµ K,b 2 Rµ u K,b    ψ ( x, u ) ⊤    sv ec( V ) v x v u           | {z } D 1 (F.75) + E π K,b        ψ ( x, u ) ⊤ ( µ ⊤ K,b Qµ K,b + ( µ u K,b ) ⊤ Rµ u K,b + µ ⊤ Qµ )    sv ec( V ) v x v u           | {z } D 2 . In the sequel, w e calculate D 1 and D 2 resp ectiv ely . Calculation of D 1 . Note that b y the deﬁnition of ψ ( x, u ) in ( B.5 ), it holds t ha t D 1 = E π K,b        " ( z − µ z ) ⊤ diag( Q, R )( z − µ z ) + ( z − µ z ) ⊤ 2 Qµ K,b 2 Rµ u K,b !# · " ( z − µ z ) ⊤ V ( z − µ z ) + ( z − µ z ) ⊤ v x v u !#        = E π K,b  ( z − µ z ) ⊤ diag( Q, R )( z − µ z ) · ( z − µ z ) ⊤ V ( z − µ z )  (F.76) + E π K,b " 2 Qµ K,b 2 Rµ u K,b ! ⊤ ( z − µ z )( z − µ z ) ⊤ v x v u !# . Here z = ( x ⊤ , u ⊤ ) ⊤ and µ z = E π K,b ( z ). F or the ﬁrst term on the RHS of ( F.76 ), note t ha t z − µ z ∼ N (0 , Σ z ). Therefore, by Lemma G.1 , w e obtain that E π K,b  ( z − µ z ) ⊤ diag( Q, R )( z − µ z ) · ( z − µ z ) ⊤ V ( z − µ z )  = 2  Σ z diag( Q, R )Σ z , V  +  Σ z , diag( Q, R )  · h Σ z , V i = sv ec h 2Σ z diag( Q, R )Σ z +  Σ z , diag( Q, R )  · Σ z i ⊤ sv ec( V ) . ( F .77) 80 Mean while, the second term on the RHS of ( F.7 6 ) ta k es the form of E π K,b " 2 Qµ K,b 2 Rµ u K,b ! ⊤ ( z − µ z )( z − µ z ) ⊤ v x v u !# = " Σ z 2 Qµ K,b 2 Rµ u K,b !# ⊤ v x v u ! . (F.78) Com bining ( F.76 ), ( F.77 ), and ( F.78 ), w e obtain that D 1 =    2sv ec  Σ z diag( Q, R )Σ z + h Σ z , diag( Q, R ) i Σ z  Σ z 2 Qµ K,b 2 Rµ u K,b !    ⊤    sv ec( V ) v x v u    . (F .79) Calculation of D 2 . By the deﬁnition of the f eat ure ve cto r ψ ( x, u ) in ( B.5 ), w e kno w tha t D 2 = ( µ ⊤ K,b Qµ K,b + ( µ u K,b ) ⊤ Rµ u K,b + µ ⊤ Qµ )    sv ec(Σ z ) 0 m 0 k    ⊤    sv ec( V ) v x v u    . (F.80) No w, com bining ( F.75 ), ( F.79 ), and ( F.80 ), it holds that E π K,b  c ( x, u ) ψ ( x, u )  =    2sv ec  Σ z diag( Q, R )Σ z + h Σ z , diag( Q, R ) i Σ z  Σ z 2 Qµ K,b 2 Rµ u K,b !    +  µ ⊤ K,b Qµ K,b + ( µ u K,b ) ⊤ Rµ u K,b + µ ⊤ Qµ     sv ec(Σ z ) 0 m 0 k    , whic h concludes the pro o f of the lemma. G Auxiliary Results Lemma G.1. Assume that the random v ariable w ∼ N (0 , I ), and let U and V b e t wo symmetric matrices, then it holds that E ( w ⊤ U w · w ⊤ V w ) = 2 tr( U V ) + tr( U ) · tr( V ) . Pr o of. See Magn us et al. ( 1978 ) a nd Magn us ( 197 9 ) for a detailed pro of. Lemma G.2. Let M , N b e comm uting symmetric mat r ices, and let α 1 , . . . , α n , β 1 , . . . , β n denote their eigen v alues with v 1 , . . . , v n a common basis of orthogonal eigenv ectors. Then the n ( n + 1) / 2 eigen v alues of M ⊗ s N are give n b y ( α i β j + α j β i ) / 2, where 1 ≤ i ≤ j ≤ n . 81 Pr o of. See Lemma 2 in Alizadeh et al. ( 1998 ) for a detailed pro of. Lemma G.3. F or any integer m > 0, let A ∈ R m × m and η ∼ N (0 , I m ). Then, there exists some a bsolute constan t C > 0 suc h that for an y t ≥ 0, we ha v e P h   η ⊤ Aη − E ( η ⊤ Aη )   > t i ≤ 2 · exp h − C · min  t 2 k A k − 2 F , t k A k − 1 2  i . Pr o of. See Rudelson et al. ( 20 1 3 ) for a detailed pro of. 82

Actor-Critic Provably Finds Nash Equilibria of Linear-Quadratic Mean-Field Games

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment