Actor-Critic Algorithms for Learning Nash Equilibria in N-player General-Sum Games
We consider the problem of finding stationary Nash equilibria (NE) in a finite discounted general-sum stochastic game. We first generalize a non-linear optimization problem from Filar and Vrieze [2004] to a $N$-player setting and break down this prob…
Authors: H.L Prasad, L.A.Prashanth, Shalabh Bhatnagar
Actor-Critic Algorithm s for Learning Nash Equilibria in N -player Gene ral-Sum Games Prasad H.L. ∗ 1 , Prashanth L A † 2 and Shalabh Bhatnagar ‡ 3 1 Astrome T echnologies Pvt Ltd, Bangalo re, INDIA 2 Institute for Systems Research, Uni versity of Mary land, College Park, US 3 Department of Computer Science and Automation, Indian Institute of Science, Bangalore, INDIA Abstract W e consider t he problem of finding stationary Nash equilibria (NE ) in a finite discounted general-sum stochas- tic game. W e first generalize a non-linear optimization problem from F ilar and Vrieze [2004] to a N -player set- ting and break down this problem into simpler sub-prob lems that ensure there is no Bellman error for a gi ven state and an agent. W e then provide a characterization o f solution points of these sub-problems that correspond to Nash equilibria of the underlying game and for this p urpose, we deri ve a set of necessary and sufficient S G-SP (Stochas- tic Game - Sub-Problem) conditions. Using these conditions, we dev elop two actor-critic algorithms: OFF- SGSP (model-based) and ON- SGSP (model-free). B oth algorithms use a critic that estimates the value function for a fixed policy an d an acto r that performs d escent in the po licy space using a descent direction that av oids local min- ima. W e establish that both algorithms con ver ge, i n self-play , to the equilibria of a certain ordinary differential equation (ODE), whose stable limit points coincide wi th stationary NE of the underlying general-sum stochastic game. On a single state non-generic game (see Hart and Mas-Colell [2005]) as well as on a synthetic two-player game setup with 81 0 , 000 states, we establish that ON-SGSP consistently outperforms NashQ [Hu an d W ellman, 2003] and FFQ [Littman, 2001] algorithms. Keywords: General-sum discounted stochastic games, Nash equilibrium, multi-agent reinf orcemen t learning , multi-timescale stochastic approxim ation. 1 Introd uction T r aditional game theoretic de velop ments have been for sing le-shot games where all agents participate and perform their preferred ac tions, re ceiv e rewards and the game is over . Ho wev er , several emerging applications h av e the concept of multiple stages of action or most often the c oncept of time in them. One intermediate class of games to handle multiple stag es of decision is called repeated game s . Howe ver, rep eated games do not p rovide for characterizin g the in fluence of decision s made in o ne stage to future stages. Markov chain s or Markov processes are a p opular and widely applicab le class of rando m processes which are used fo r mode ling p ractical systems. Markov chains allow system designe rs to model states of a given system and then m odel the time behavior of the system by co nnecting tho se states via su itable p robabilities for transition from curr ent state to a next state . A popular exten sion to Markov c hains wh ich is used for modeling optimal co ntrol scen arios is th e c lass o f Markov Decision Processes ( MDPs ). Here, in a given state, an action is allowed to be selected from a set of av ailable actions. Based o n the choice of the action, a suitable r ewar d/co st is obtained /incurred . Also, the action selected influences the probab ilities of transition from one state to an other . Shapley [195 3] merged these concepts of MDPs ∗ prasad@ast rome.co † prashla@u md.edu ‡ shalabh@c sa.iisc.erne t.in 1 1 2 . . . N Agents En vironment Action a = a 1 , a 2 , . . . , a N Reward r = r 1 , r 2 , . . . , r N , next state y Figure 1: Multi-a gent RL setting (or basically M arkov behavior) and games to co me up with a new class of games called as sto chastic games . In a stochastic game , all participating agents select their actions, each of which influen ce the rew a rds received by all agents as well as the transition probabilities. Since the inception of stochastic games by Shapley [1953], they ha ve been an importa nt class of models for multi-agent systems. A com prehensive treatment of stochastic games under various pay-off criteria is given by Filar and Vriez e [2004]. Many proble ms like fishery gam es, advertisement games and several oligo polistic situations can be modelled a s sto chastic games citep breton198 6compu tation,filar- vrieze,olley19 92dy namics,pakes1998empirical,pakes2001stochasti c,bergemann1996learning. W e consider a finite stoc hastic g ame (also referred to as Markov gam e (cf . Littman [2001]) setting that ev olves over discrete time instants. As illustrated in Fig . 1, at each stage and in any given state x ∈ X , all agents act simultaneou sly with an action vector a ∈ A ( x ) resulting in a tran sition to th e next state y ∈ X a ccording to the transition probab ility p ( y | x, a ) a s well as a rew ard vector r ( x, a ) . No agent g ets to know wh at the other agents’ actions are b efore selectin g its own actio n and the rew ard r i ( x, a ) obtained b y any agent i in each stage d epends on b oth system state x (c ommon to all agents) an d th e ag gregate action a (wh ich in cludes o ther ag ents’ ac tions). Each individual agent’ s sole objective is m aximization of his/her value function , i.e., the expected discounted s um of re wards. Ho wev er , the transition dynamics a s well as the rew a rds depend on the actions of all agents and hence, the dynamics of the game is coupled and not independen t. W e assume that r ( x, a ) and the action vector a a re made kno wn to all agents after e very agent i has picked his/her action a i - this is the canonical model-free setting 1 . Howe ver, we d o not assum e that each agent knows the other ag ents’ policies, i.e. , the d istribution from which th e actions are picked. The central concept of s tability in a stoc hastic game is that of a Nash equilibrium . At a Nash equilibriu m p oint (with a correspond ing Nash strategy), each agent plays a best-r espon se st rategy assuming all the other agents play their equilibrium strategies (see Definition 2 fo r a precise statement). This n otion of eq uilibrium makes perfect sense in a game setting where agents do not hav e any incentive to un ilaterally deviate f rom their Nash strategies. Breton [199 1], Filar and Vrieze [2 004] estab lish tha t finding the stationary NE of a two-player discou nted stochastic game is equ iv alen t to solvin g an optimization prob lem with a non-lin ear o bjective f unction and linear constraints. W e extend this for mulation to general N -p layer stochastic games and o bserve that this generalization causes the con straints to be n on-linear as well. Previous appro aches to solving the o ptimization problem have not been able to guarantee conv ergence to global minima, ev en for the case of N = 2 . I n this light, our contrib ution is significant as we develop an alg orithm to find a global minimum for any N ≥ 2 via the following s teps: 2 Step 1 (Sub-problems): W e brea k down the m ain optimizatio n p roblem into sev eral sub-prob lems. Each su b- problem c an b e seen as ensuring n o Bellman error, for a p articular state x ∈ X and a gent i ∈ { 1 , . . . , N } , 1 While the ON-SGSP algorit hm that we propose is for this settin g, we also propose another algorith m - OFF-SGSP - that is model based. 2 A preliminary version of this paper , without the proofs, was publi shed in AAMAS 2015 - see Prasad et al. [2015]. In comparison to the confere nce version, this paper includes a more detailed problem formulatio n, formal proofs of con ver gence of the two proposed algori thms, some additi onal experiments and a revise d presentation. 2 where X is th e state space and N is th e number of agents of the stochastic game considered. Step 2 (Solution points): W e provide a ch aracterization of solution points that correspon d to Nash equ ilibria of the underlyin g game. As a resu lt, we also der iv e a set of necessary and sufficient con ditions, hen ceforth referred to as SG-SP (Stochastic Game - Sub-Problem ) condition s. Step 3 (Descent direction): Using SG- SP conditio ns, we derive a descent direction tha t avoids lo cal minima . This is not a s teepest descent direction, b ut a carefully chosen descent direction s pecific to this optimization problem , which ensures convergence only to p oints of g lobal min ima that correspo nd to SG-SP po ints (and hence Nash strategies). Step 4 (Actor-critic a lgorithms): W e propose algorithm s that incorporate the aforeme ntioned de scent direction to ensure con vergen ce to stationar y NE of the underlying game. The algorithms that we propose are as follows: OFF-SGSP . This is an offline, centralized and model- based scheme, i.e., it assumes th at the transition structu re of the underly ing game is known. ON-SGSP . This is an online, model-free schem e that is dece ntralized, i .e., learning is localized to each a gent with one instan ce of ON-SGSP running on each ag ent. ON-SGSP o nly req uires that other agents’ action s and rew ards are observed and not their policies, i.e., maps from states to actions. W e make the assumption that for all strategies, the resulting Markov chain is irre ducible an d positiv e recu rrent. This assumption is co mmon to the analysis of previous mu lti-agent RL algorithms as well (cf. Hu and W ellman [1999], Littman [200 1]) 3 . T o the best o f our knowledge, ON-SGSP is the first mo del-free online algorithm that converges in self-play to stationa ry NE for any finite discou nted general-su m stochastic game wher e the afo rementio ned assumption holds. As sugg ested by Bo wling an d V eloso [20 01], tw o d esirable pro perties of any m ulti-agent learn ing algo rithm are as follows: (a) Ration ality 4 : Learn to play optimally when other agents follow stationary strate gies; and (b) Self-p lay conver gence: Con verge to a Nash equilibrium assuming all agents are using th e same learnin g algo- rithm. Our ON-SGSP algorithm can be seen to meet b oth the prop erties m entioned a bove. Howe ver, unlike the repeated game setting of [Bowling and V eloso, 2001, Conitzer and Sandholm, 2007], ON-SGSP solves discounted general-su m stochastic games and possesses theoretical conv ergence gua rantees as well. As illustrated in Fig. 2, the basic idea in b oth OFF-SGSP and ON-SGSP is to emp loy two recursions, referred to as the actor and the critic, respectively . Concep tually , these can be seen as two nested loops that operate as follows: Critic recursion: This p erform s policy ev aluation, i.e., estimates the value fu nction fo r a fixed policy . In the model-b ased setting (i. e., of OFF-SGSP), this corresponds to the well-known dyn amic programm ing pro ce- dure - v alue iteration. On the other han d, in the model-free setting (i.e., of ON-SGSP), the critic is based on temporal dif ference (TD) learning [Sutton and Barto, 1998]. Actor recursion: This incremen tally u pdates the policy using gradient descent. For this purpose, it uses a descent direction that en sures con vergence to a global minimum (and hence NE) of the optimization p roblem we mentioned earlier . 3 For t he ca se of stoc hastic games whe re there are multiple communicat ing classes of state s or e ven transien t states, a possibl e work-a round is to re-start the game periodi cally in a random state. 4 The term rati onality is not to be confused with its common interpre tatio n in economics parlance . 3 Critic (Policy Evaluation) Actor (Policy Improvement) V alue v π i Policy π i Figure 2: Oper ational flo w of our algorithms Using m ulti-timescale stoch astic approxim ation ( see Chap ter 6 of Borkar [2008]) b oth the recursions above are run simultaneou sly , alb eit with v arying step-size parameters and th is mimics the nested two-loop p rocedu re outline d above. The formal proof of conv ergence re quires con siderable sop histication, as we base our app roach o n th e ordinar y differential equations (ODE) method for stochastic approx imation [Bo rkar, 2008]. While a fe w previous papers in the literature have adop ted this approach ( cf. Akchurina [ 2009],W eibull [199 6]), th eir results do not u sually start with an algorithm that is sho wn to track an ODE. Instead, an ODE is reached first via analysis and an appro ximate method is used to solve this ODE . On the oth er hand , we adopt the former approach an d sh ow that both OFF-SGSP and ON-SGSP con verge using the following steps: 1. Using tw o-timescale stochastic app roximatio n, we show that the value and po licy updates on the fast and slow timescales, con verge respectively to the limit points of a system of ODEs. 2. Next, we provide a simplified representation for the limiting set of the policy ODE and use this to establish that the asymptotically stable limit points of the policy ODE correspond to SG-SP (and hence Nash) points. While the first step above uses a well-known result (Kushner-Clark lemma) for analysing stochastic approx imation recursions, the te chniques used in the second step above are quite d ifferent from those used previously . The latter step is crucial in establishin g overall conv ergence, as the strategy π correspon ding to each stable limit gives a stationary NE of the under lying general-su m discounted stochastic game. W e demonstrate the practicality o f our algo rithms on two synthetic two-player setups. The first is a single st ate non-g eneric g ame ado pted from Har t and Mas-Colell [2005] that c ontains two NEs (one pu re, the o ther mixed) , while the seco nd is a stick-together g ame with 810 , 00 0 states (to th e best of ou r knowledge, previous work s on general-su m games have n ev er considered state spa ces of this size). On the first setup , we show that ON-SGSP always co n verges to NE, while NashQ [ Hu and W ellman, 2003] and FFQ [Littman, 2001] do not in a sig nificant number o f exper imental run s. On the second setup , we show that ON-SGSP outper forms NashQ an d FFQ, while exhibiting a relati vely quick conv ergence rate - req uiring approximately 21 iteration s per state. Map of the paper . Section 2 re views relevant previous works in the literature. Section 3 form alizes a g eneral- sum stochastic g ame and sets the notatio n used thro ugho ut the paper . Section 4 formulates a gen eral non-lin ear optimization problem for solving stochastic g ames and also form sub-pr oblems whose solution s corre spond to Nash strategies. Section 5 presents necessary and sufficient SG-SP c ondition s for Nash equilibria o f gen eral- sum stochastic games. Section 6 presents the offline algo rithm OFF-SGSP , while Section 7 provid es its online counterp art ON-SGSP . Section 8 sketches the p roof of c onv ergence for both th e algorithms. Simu lation results for the single state no n-gen eric g ame and the stick -together gam e settings are presented in Section 9. Finally , conclud ing remarks are provided in Section 10. 4 2 Related W o rk V ario us appr oaches have been proposed in the literature for computing Na sh eq uilibrium o f general-sum dis- counted stochastic games and we discuss some of them below . Multi-agent RL. Littman [Littman, 1994] proposed a minimax Q-learning algorithm for tw o-player zero-sum stochastic g ames. Hu and W ellman [Hu and W ellman, 1999, 20 03] extend ed the Q-learning ap proach to general- sum ga mes, but their alg orithms do not provide meanin gful con vergence guaran tees. Frien d-or-foe Q-learning (FFQ) [Littman, 2001] is a further improvement based on Q-learn ing and with guaran teed con vergence. Howe ver , FFQ conv erges to Nash eq uilibria only in restricted settings (See cond itions A and B in [ Littman, 2001]). Moreover , the ap proach es in [ Hu and W ellman, 1999, 2 003] r equire c omputatio n of Na sh equ ilibria of a bimatrix game, while the approac h of Littman [2001] requires solv ing a linear p rogram , in each rou nd o f th eir algorithms and th is is a computatio nally expe nsiv e operation. In con trast, ON-SGSP does not r equire any such eq uilibria comp utations. Zinkevich et al. [2006] sho w that the traditional Q- learning based approach es are no t sufficient to compute Nash equilibria in general-sum games 5 . Policy hill climbing. This is a category o f p revious works th at is closely related to ON-SGSP algorithm that we propo se. Imp ortant refere nces here include Bowling and V eloso [ 2001], Bowling [ 2005], Con itzer and Sandho lm [2007] a s well as Zhan g and Lesser [20 10]. All these algorith ms are gradient-based , model-fr ee and are p roven to con verge to NE for station ary o ppone nts in self-play . Howe ver, these con vergence guarantees are for repeated games only , i.e., the setting is a single state stochastic game, where the objecti ve is to learn th e Nash s trategy for a stage-game (see Definition 1 in [ Conitzer and Sandholm, 20 07]) that is repea tedly played. On the oth er hand , we consider general-sum stochastic games where the objecti ve is to learn the best-response strategy against stationary oppon ents in order to maximize the v a lue f unction (which is an infinite horizo n discoun ted sum). Fur ther, we work with a more general state space that is not restricted to be a singleton. Homotopy . Herings and Peeters [2 004] pro pose an algorithm where a hom otopic path between equ ilibrium points of N ind epende nt MDPs an d the N playe r stochastic game in question, is traced numerically . This, in turn, giv es a Nash equilibrium point of the stochastic gam e of inter est. T his appro ach is e xtended by Hering s and Peeters [2006] and Borkovsk y et al. [2010]. OFF-SGSP sh ares similarities with the aforementioned ho motop ic algorithm s in the sense that both are (i) offline and model-b ased a s they assume complete information (esp. the transition dynamics) about t he game; and (ii) the computation al complexity for e ach iteration of both algo rithms g rows exponentially with the n umber of agents N . (iii) Fur ther , both alg orithms are pr oven to con verge to stationary NE, though the ir approach ado pted is vastly different. OFF-SGSP is a gradient descent algor ithm designed to co n verge to the global minimum of a nonlinear pro gram, while th e algorith m by Herings and Peeters [200 4] in volves a tracing proced ure to find an equilib rium point. Linear progr amming. Mac Dermed and Isbell [20 09] solve sto chastic gam es by form ulating intermed iate optimization prob lems, called Multi-Objectiv e Linear Programs (MOLPs). Howe ver , the solution concep t there is of corr ela ted equilibria and Nash points are a strict subset of this class (and hence are harder to find). Also, the complexity of their algorithm scales exponentially with the problem size. Both homoto py and lin ear progr amming metho ds pr oposed b y Mac Dermed and Isbell [2009] and Herings and Peeters [2004] are tractable only for small-sized problems. The compu tational complexity of these algorithms may render them infeasible on large state games. In contrast, ON-SGSP is a m odel-fre e algo rithm with a per-iteration com- plexity th at is lin ear in N , allowing f or practical implementatio ns on large state g ame settings ( see Section 9 for one such example with a state space card inality of 8 1 0 , 0 00 ). W e howe ver men tion that per -iteration complexity alone is not sufficient to quan tify t he perfor mance of an algor ithm - see Remark 7. 5 W e a void this impossibility result by searching for both valu es and policies instead of just val ues, in our proposed algorit hms. 5 Rational learning. A po pular algorithm with guar anteed con vergence to Nash equ ilibria in general-sum stochastic games is ration al learning, proposed by Kalai and Leh rer [1993]. In their algorithm, each agen t i main- tains a prior on what he believ es to be o ther agents’ strategy an d updates it in a Bayesian m anner . C ombining this with certain ass umption s of absolute continuity and grain of truth, the algorithm there is shown to con verge to NE. ON- SGSP o perates in a similar setting as th at in [Kalai and Lehrer, 1 993], excep t that we do not assum e the knowledge of r ew a rd fun ctions. O N-SGSP is a m odel-fr ee online algo rithm and unlike [Kalai and Lehrer, 1 993], any agent’ s strategy in ON-SGSP does not depend u pon Bayesian estimates of other agents’ strategies an d hence, their absolute continuity/g rain of truth assumptions do not apply . Evolutionary alg orithm. Ak churina [20 09] em ploys nume rical method s in order to solve a system of ODEs and only establishes emp irical con vergence to NE for a grou p of randomly generated games. In contrast, ON- SGSP is a mod el-free algorithm that is provably co n vergent to NE in self -play . W e also note that the system of ODEs gi ven by Akchurin a [200 9] (also found in [ W eibull, 19 96, p p. 18 9]) tur ns o ut to b e similar to a portion of the ODEs that are tracked by ON-SGSP . Remark 1. Sh oham et al. [2003] and Shoham et al. [2007] question if Nash equilibriu m is a useful solution con- cept for general-sum games. However , if we ar e willing to concede that prescriptiv e, equilibrium agenda is indeed useful for stochastic games, then we b elieve our work is theoretically significant. Our ON-SGSP algorith m is a pr escriptive, co-operative learnin g algo rithm that observes a sample path fr om the und erlying game a nd conver ges to stationary NE. T o th e best of our knowledge, this is the first algorithm to do so, with pr oven conver gence. 3 F orm al Definitions A stochastic gam e can be seen to be an extension o f the single-ag ent Mar kov decision pro cess. A discounted rew ard stochastic game is described b y a tuple < N , X , A , p, r , β > , where N rep resents th e number of agents, X denotes the state sp ace and A = ∪ x ∈X A ( x ) is the aggregate action space, where A ( x ) = N Q i =1 A i ( x ) is the Cartesian produ ct of action spaces ( A i ( x )) of in dividual agents when the state of the gam e is x ∈ X . W e assume both state and action spaces to be finite. Let p ( y | x, a ) deno te the probability of going from the current state x ∈ X to y ∈ X when the vector of actions a ∈ A ( x ) (of the N players) is ch osen and let r ( x, a ) = r i ( x, a ) : i = 1 , 2 , . . . , N denote the vector of reward functions of all agents wh en the state is x ∈ X an d the vector of action s a ∈ A ( x ) is chosen. Also, 0 < β < 1 denotes the discoun t factor that co ntrols the influence o f the rewards ob tained in the future on the agents’ strategy (see Definition 1 belo w). Notatio n. h· · · i represents a colum n vector and 1 m is a vector of ones with m elem ents. The v arious co nstituents of the stochastic game considered are denoted as follows: 6 Action: a = a 1 , a 2 , . . . , a N ∈ A ( x ) is the aggregate actio n, a − i is the tuple of actions of all agents except i and A − i ( x ) := Q j 6 = i A j ( x ) is the set of feasible actions in state x ∈ X of all agents except i . Policy: π i ( x, a i ) is the proba bility of picking action a i ∈ A i ( x ) by agent i in state x ∈ X , π i ( x ) = π i ( x, a i ) : a i ∈ A i ( x ) is the r andom ized p olicy vector in state x ∈ X for the agent i , π i = π i ( x ) : x ∈ X , π = π i : i = 1 , 2 , . . . , N is the strategy-tuple of all agents and π − i = π j : j = 1 , 2 , . . . , N , j 6 = i is the strategy-tuple of all agents except agent i . W e focu s only on stationary str ate gies in this paper , as suggested by Theorem 1. T ransition Probability: Let π ( x, a ) = N Q i =1 π i ( x, a i ) and π − i ( x, a − i ) = N Q j =1 ,j 6 = i π j ( x, a j ) . Then, the (Mar kovian) transition prob ability from state x ∈ X to state y ∈ X wh en each agen t i plays accord ing to its rando mized 6 W e use the ter ms policy and strate gy intercha ngeably in the paper . 6 strategy π i can be written as: p ( y | x, π ) = X a ∈A ( x ) p ( y | x, a ) π ( x, a ) . Reward: r i ( x, a ) is th e single-stage rew ard obtained by ag ent i in state x ∈ X , where a ∈ A ( x ) is the agg regate action taken. Definition 1. (V alue function) The value fun ction is th e e xpected r eturn for an y ag ent i ∈ { 1 , 2 , . . . , N } an d is defined as v i π ( s 0 ) = E X t β t X a ∈A ( x ) r i ( s t , a ) π ( s t , a ) . (1) Giv en the above notion o f th e value fu nction, the go al o f each agen t is to find a strategy that achiev es a Nash equilibrium . The latter is defined as follows: Definition 2. (Na sh Equilibrium) A stationary Markov strate gy π ∗ = π 1 ∗ , π 2 ∗ , . . . , π N ∗ is said to be Nash if v i π ∗ ( s ) ≥ v i h π i ,π − i ∗ i ( s ) , ∀ π i , ∀ i, ∀ s ∈ X . The corr espo nding equilibrium of the game is said to be Nash equilibrium. Since we con sider a discounted stoch astic game with a finite state space, we ha ve the f ollowing we ll-known result that ensures the existence of stationary equilibrium: Theorem 1. Any finite discounted stochastic game has an equilibrium in stationary str a te g ies. W e shall r efer to su ch stationary rand omized strategies as Nash strate g ies . T he reade r is referr ed to Fink [19 64], T akahash i [1964], Sobel [1971] for a proof of Theorem 1. 4 A Generalized Optimization Pr oblem Basic idea. Using dynamic prog ramming the Nash equilib rium cond ition in Defin ition 2 can b e re- written as: ∀ x ∈ X , ∀ i = 1 , 2 , . . . , N , v i π ∗ ( x ) = max π i ( x ) ∈ ∆( A i ( x )) E π i ( x ) Q i π − i ∗ ( x, a i ) , (2) where Q i π − i ( x, a i ) = E π − i ( x ) r i ( x, a ) + β X y ∈ U ( x ) p ( y | x, a ) v i ( y ) , represents the m arginal value associated with pickin g actio n a i ∈ A i ( x ) , in state x ∈ X for agen t i , while other agents act according to π − i . Also, ∆( A i ( x )) denotes the set of all possible probability distributions o ver A i ( x ) . The basic idea beh ind the optimization problem t hat we formulate below is to model the ob jectiv e such that the value function is correct w .r .t. the agents’ strategies, while addin g a constraint to en sure that a feasible solu tion to the problem correspo nds to Nash equilibrium . Objective. A possible optimizatio n objective f or agent i would be f i ( v i , π ) = X x ∈X v i ( x ) − E π i Q i π − i ( x, a i ) . The objective above ha s to be minimized over all po licies π i ∈ ∆( A i ( x )) . But Q i π − i ( x, a i ) , by definition, is depend ent on strategies of all other age nts. So, an isolated minimization of f i ( v i , π i ) would not be meaningful 7 and hen ce, we consider the aggregate objective f ( v , π ) = N P i =1 f i ( v i , π ) . This o bjective is minimized over all policies π i ∈ ∆( A i ( x )) of a ll agents. Thus, we ha ve an optimization p roblem with ob jectiv e as f ( v , π ) along with the n atural con straints ensurin g that the policy vectors π i ( x ) remain a s pro babilities over all feasible ac tions A i ( x ) in all states x ∈ X and for agents i = 1 , . . . , N . Constraints. Notice tha t an optimization prob lem with the objective discussed above has o nly a set o f simple constraints en suring that π remains a valid strategy . Howe ver , this is not sufficient to accurately represent Nash equilibria of the und erlying game. Her e, we loo k at a possible set of ad ditional constraints which might make the o ptimization pro blem more u seful. Note that the ter m being m aximized in equation (2), i.e., E π i Q i π − i ( x, a i ) , represents a conv ex co mbination of the values of Q i π − i ( x, a i ) over all possible actions a i ∈ A i ( x ) in a given s tate x ∈ X for a gi ven agent i . Thus, it is implicitly implied that Q i π − i ( x, a i ) ≤ v i π ∗ ( x ) , ∀ a i ∈ A i ( x ) , x ∈ X , i = 1 , 2 , . . . , N . Formally , the optimiz ation problem for any N ≥ 2 is g i ven belo w : min v ,π f ( v , π ) = N P i =1 P x ∈X v i ( x ) − E π i Q i π − i ( x, a i ) s.t. ( a ) π i ( x, a i ) ≥ 0 , ∀ a i ∈ A i ( x ) , x ∈ X , i = 1 , 2 , . . . , N , ( b ) N P i =1 π i ( x, a i ) = 1 , ∀ x ∈ X , i = 1 , 2 , . . . , N , ( c ) Q i π − i ( x, a i ) ≤ v i ( x ) , ∀ a i ∈ A i ( x ) , x ∈ X , i = 1 , 2 , . . . , N . (3) In the ab ove, 3(a) – 3 (b) ensure that π is a v alid policy , while 3(c) i s necessary for any valid p olicy to be a NE of the underly ing game. Theorem 2 . A feasible point ( v ∗ , π ∗ ) of the optimization pr oblem (3) pr ovides a Nash eq uilibrium in stationary strate gies to the corr esponding general-sum d iscounted stochastic game if and only if f ( v ∗ , π ∗ ) = 0 . Pr oof. See [Filar and Vrieze, 2004, Theorem 3. 8.2] for a proof in the case of N = 2 . The proof works in a similar manner for general N . A fe w remarks about the difficulties in volved in solving (3) are in or der . Remark 2. (Non-linearity) F or the case o f N = 2 , the ob jective f ( v , π ) in ( 3) is o f or de r 3 , while the to ughest constraint 3( c) is q uadratic. This is a pparent fr om the fact th at the second term inside the summation in f ( v , π ) has the following variables multiplied: π 1 in the first e xpectation, π 2 inside the expectation in the definition of the Q -functio n and v inside th e second term of the expectation in Q -function . Alo ng similar lin es, the c onstraint 3(c) can be inferred to be quadratic. Thus, we have a n optimization pr o blem with a third-or der objective an d quadratic constraints, even for the case of N = 2 and the con stituent fu nctions (both objective and co nstraints) can be easily seen to be n on-linea r . F or a general N > 2 , the pr o blem (3) gets more comp licated, as mor e policy variables π 1 , . . . , π N ar e thr o wn in. Remark 3. (Beyond lo cal minima ) F ilar an d Vrieze [2004] ha ve formulated a non- linear op timization pr o b- lem fo r th e case of two-p layer zero -sum stochastic g ames. An associated result (Theorem 3.9.4, p age 141, in [F ilar and Vrieze, 2 004]) states tha t every local minimum of that o ptimization pr ob lem is also a g lobal minimum. However , this r esu lt does not hold for a gene ral-sum g ame even in the two-pla yer (and a lso N ≥ 3 ) setting and hence, the r equir ement is for a global optimization scheme that solves (3) . Remark 4. (No steepest descent) F r om th e pre vious r emark, it is a pparent that a simple steepest descent scheme is not enoug h to solve (3) e ven for the two-player setting. This is because ther e ca n be local minima of (3) that do not corr esp ond to the global minimum and steepest descent sc hemes guarantee con verg e nce to local minima only . Note that steepest descent schemes wer e sufficient to solve for Nash equilib rium str ategies in two-player zer o-sum 8 stochastic g ames, while this is not the case with general-sum N -player games, with N ≥ 2 . Sectio ns 3 a nd 4 o f [Prasad and Bhatnagar , 2015] con tain a detailed discussion on ina dequa cies of steepest descent for general-sum games. Remark 5. (No Newton method) A natural q uestion that arises is can on e emplo y a Newton m ethod in order to solve ( 3) and the answer is in the negative. Observe that the Hessian of the objective fun ction f ( v , π ) in (3) ha s its diagon al elements to be zer o and this makes it indefin ite. This make Newton methods infeasible as they r equir e in v ertibility of the Hessian to work. W e overcome the abov e difficulties by deriving a descent direction (that is not necessarily steepest) in orde r to solve (3). Before we presen t the descen t direction , we b reak-d own (3) into simpler sub-prob lems. Subsequ ently we descibe Stochastic Game - Sub-Problem (SG-SP) conditions that are both necessary an d suf ficient for th e problem (3). Sub-pr oblems f or each state and agent W e form sub-p roblems from the main optimization pro blem (3) a long th e lines of Prasad and Bhatn agar [201 2 ], for each state x ∈ X and eac h ag ent i ∈ { 1 , 2 , . . . , N } . The sub -prob lems are fo rmed with th e ob jectiv e of e nsuring that there is no Bellman erro r (see g i x,z ( θ ) b elow). For any x ∈ X , z = 1 , 2 , . . . , |A i ( x ) | an d i ∈ { 1 , 2 , . . . , N } , let θ := v i , π − i ( x ) denote the value-policy tu ple and let g i x,z ( θ ) := Q i π − i ( x, a i z ) − v i ( x ) (4) denote the Bellman erro r . Further, let p z := π i ( x, a i z ) and p = p z : z = 1 , 2 , . . . , |A i ( x ) | . Then, the sub- problem s are formu lated as follo ws: min θ ,p h x ( θ, p ) = |A i ( x ) | X z =1 p z − g i x,z ( θ ) (5) s.t. g i x,z ( θ ) ≤ 0 , − p z ≤ 0 , f or z = 1 , 2 , . . . , |A i ( x ) | , and X z p z = 1 . 5 Stochastic Game - Sub-Pr oblem ( SG-SP) Conditions In this section, we de riv e a set of necessary and sufficient conditions for solutions of (3) and establish the ir equiv- alence with Nash strategies. Definition 3 (SG-SP Po int) . A poin t ( v ∗ , π ∗ ) of the optimizatio n pr o blem ( 3) is said to b e a n S G-SP po int if it is a feasible point of (3) and for every sub-pr oblem, i.e., for all x ∈ X and i ∈ { 1 , 2 , . . . , N } , p ∗ z g i x,z ( θ ∗ ) = 0 , ∀ z = 1 , 2 , . . . , |A i ( x ) | . (6) The above conditions, which define a point to be an SG-SP point, are called SG-SP conditions. 5.1 Equiv a lence of SG-SP with Nash strategies The connectio n between SG-SP points and Nash equilibria can be seen intuitively as f ollows: (i) The objecti ve f unction f ( v ∗ , π ∗ ) in ( 3) can be expressed as a summatio n of terms of th e form p ∗ z [ − g i x,z ( θ ∗ )] over z = 1 , 2 , . . . , |A i ( x ) | and over all s ub-pr oblems. Con dition (6) suggests that each of the se terms is zero wh ich implies f ( v ∗ , π ∗ ) = 0 . (ii) The objective of the sub- problem is to ensur e that there is no Bellman error, wh ich in turn im plies that the value estimates v ∗ are correct with respect to the policy π ∗ of all agents. Combining (i) and (ii) with Theore m 3.8.2 of [Filar and Vrieze, 2004], we hav e the following result: 9 Theorem 3 (Nash ⇔ SG-SP) . A strate gy π ∗ is Nash if and only if ( v ∗ , π ∗ ) for the corr esponding o ptimization pr oblem (3) is an SG-S P point. The proof of the above theorem follows from a combinatio n of Lemmas 1 and 2, presented below . Lemma 1 (SG-SP ⇒ Nash) . An SG-SP point ( v ∗ , π ∗ ) g ives Nash strategy-tuple for the un derlying stochastic game. Pr oof. The o bjective function value f ( v ∗ , π ∗ ) of the op timization p roblem ( 3) can be expressed as a summation of terms of the form p ∗ z [ − g i x,z ( θ ∗ )] over z = 1 , 2 , . . . , m and over all sub-pro blems. Co ndition (6) suggests that e ach of these ter ms is zero wh ich implies f ( v ∗ , π ∗ ) = 0 . From Filar and Vrieze [2004, Theorem 3.8.2, page 13 2], since ( v ∗ , π ∗ ) is a feasible poin t of (3) a nd f ( v ∗ , π ∗ ) = 0 , ( v ∗ , π ∗ ) co rrespond s to Nash equilibr ium of the un derlyin g stochastic game. Lemma 2 ( Nash ⇒ SG-SP) . A strate gy π ∗ is Nash if ( v ∗ , π ∗ ) for the corr esponding o ptimization p r oblem (3) is an SG-SP point. Pr oof. From Filar and Vrieze [200 4 , T heorem 3 .8.2, p age 132] , if a strategy π ∗ is Nash, then a feasible point ( v ∗ , π ∗ ) exists for the corr espondin g o ptimization problem (3), wher e f ( v ∗ , π ∗ ) = 0 . From th e constraints of (3), it is clear that fo r a f easible point, p ∗ z [ − g i x,z ( θ ∗ )] ≥ 0 , f or z = 1 , 2 , . . . , m , f or every sub-pro blem. Since the su m of all these terms, i.e., f ( v ∗ , π ∗ ) , is zer o, each o f th ese term s is zero, i.e., ( v ∗ , π ∗ ) satisfies (6). Thu s, ( v ∗ , π ∗ ) is an SG-SP point. 5.2 Kinship to Karush-Kuhn-T uck er - Sub-Pr oblem (KKT -SP) conditions Prasad and Bhatnagar [2012 ] consider a similar optimization pro blem as (3) for th e case of two agents, i.e., N = 2 and der iv e a set of verifiable necessary and sufficient co nditions that they call KKT -SP condition s. In the f ollowing, we first extend the KKT -SP co nditions to N -playe r stochastic games, for any N ≥ 2 and later establish the equiv alence between KKT -SP condition s wit h the SG-SP conditions formu lated above. The Lagrangian correspond ing to (5) can be written as k ( θ , p, λ, δ, s, t ) = h x ( θ, p ) + |A i ( x ) | X z =1 λ z g i x,z ( θ ) + s 2 z + |A i ( x ) | X z =1 δ z − p z + t 2 z , (7) where λ z and δ z are the Lagrange mu ltipliers and s z and t z , z = 1 , 2 , . . . , |A i ( x ) | are th e slack variables, corre- sponding to the first and second constraints of the sub-pro blem (5) , r espectively . Using the L agrangian (7), the associated K KT condition s for th e sub-pro blem (5) corr espondin g to a state x ∈ S and agen t i ∈ { 1 , . . . , N } at a p oint h θ ∗ , p ∗ i are the following: ( a ) ∇ θ h x ( θ ∗ , p ∗ ) + m P z =1 λ z ∇ θ g i x,z ( θ ∗ ) = 0 , ( b ) ∂ h x ( θ ∗ , p ∗ ) ∂ p z − δ z + δ m = 0 , z = 1 , 2 , . . . , m, ( c ) δ z p ∗ z = 0 , z = 1 , 2 , . . . , m, ( d ) λ z g i x,z ( θ ∗ ) = 0 , z = 1 , 2 , . . . , m, ( e ) λ z ≥ 0 , z = 1 , 2 , . . . , m, ( f ) δ z ≥ 0 , z = 1 , 2 , . . . , m. (8) KKT -SP conditions are shown to b e n ecessary and sufficient for ( v ∗ , π ∗ ) to represen t a Nash equilibrium point of the underly ing stochastic game and π ∗ to be a Nash stra tegy-tuple in the case o f N = 2 (see Theor em 3.8 in [Prasad and Bhatnagar, 2 012]). Howev er , this req uires an a dditional assum ption that for each sub-p roblem, 10 ∇ θ g i x,z ( θ ∗ ) : z = 1 , 2 , . . . , m is a set of linearly independen t vectors. On the other hand, the SG-SP conditions (see Definition 3 ) that we formulate do not impose any ad ditional lin ear independ ence require ment, in o rder to ensure that the solution points of the sub-pro blems correspond to Nash equilibria. The following lemma establishes the kinship between SG-SP and KKT - SP condition s. Lemma 3 (KKT -SP ⇒ SG- SP) . A KKT -SP point is also an SG-SP point. Pr oof. A KKT -SP point ( v ∗ , π ∗ ) is a feasible point of (3). For every sub-pro blem, substitute and eliminate λ ∗ z = p ∗ z and δ ∗ z = − g i x,z ( θ ∗ ) , z = 1 , 2 , . . . , m . T hen 1. Con ditions (8(a)) and (8(b)) are satisfied; 2. Con ditions (8(c)) and (8(d)) reduce to (6); and 3. Con ditions (8(e)) and (8(f)) are satisfied as the point ( v ∗ , π ∗ ) is assumed to be feasible. From the ab ove, it is evident that the simpler and more g eneral (for any N ) SG-SP c ondition s can be u sed for Nash equilibria as compared to KKT -SP conditions because: (i) ev ery KKT -SP p oint is also an SG-SP point and (ii) SG-SP condition s do not impose any additional linear independen ce req uiremen t in order to be Nash points. 6 OFF-SGSP: Offline, Model-Based Basic idea. As outlined in the introduction , OFF-SGSP is an acto r-critic algor ithm that o perates using two timescale recursions as follows Critic recursion: This estimates the value function v using value iteration, along the faster timescale; and Actor recursion: This operates along the slo wer timescale and updates the policy in the descent direction so as to ensure con vergence to an SG-SP point. As mentio ned before, OFF-SGSP is a model-based algorithm and th e transition dy namics and reward s tructure of the underly ing game are used for both steps above. Update rule. Using two tim escale r ecursions, OFF-SGSP u pdates the value-po licy tuple ( v , π ) as follows: For all x ∈ X and a i ∈ A i ( x ) , Actor: π i n +1 ( x, a i ) = Γ π i n ( x, a i ) − b ( n ) p π i n ( x, a i ) g i x,a i ( v i n , π − i n ) sgn ∂ f ( v n , π n ) ∂ π i , (9) Critic: v i n +1 ( x ) = v i n ( x ) + c ( n ) X a i ∈A i ( x ) π i n ( x, a i ) g i x,a i ( v i n , π − i n ) , (10) where g i x,a i ( v i , π − i ) := Q i π − i ( x, a i ) − v i ( x ) denotes the Bellman error, f ( v , π ) is the objective func tion in (3) and Γ is a projection operato r that ensures that the u pdates to π stay within the simplex D = { ( d 2 , . . . , d |A i ( x ) | ) | d i ≥ 0 , ∀ i = 1 , . . . , |A i ( x ) | , |A i ( x ) | P j =2 d j ≤ 1 } . Further, using Γ , one ensures that d 1 = 1 − P |A i ( x ) | j =2 , d j ∈ [0 , 1] . Here sgn( · ) is a continuo us version of th e sign function and projects an y x , outside of a very small interval ar ound 11 0 , to ± 1 accord ing to the sign of x ( see Remark 1 0 for a precise definitio n). Continuity of sg n( · ) is a tec hnical requirem ent that helps in providin g stro ng con vergen ce guara ntees 7 . The following assumption on the step- sizes ensure s that the π - recursion (9) p roceeds on a slower timescale in compariso n to the v -recu rsion (10): Assumption 1. The step-sizes { b ( n ) } , { c ( n ) } satisfy ∞ X n =1 b ( n ) = ∞ X n =1 c ( n ) = ∞ , ∞ X n =1 b 2 ( n ) + c 2 ( n ) < ∞ , b ( n ) c ( n ) → 0 . Justification for descent direction. The following proposition proves t hat the decrem ent for the policy in (9) is a valid descent direction for the objective function f ( · , · ) in (3). Proposition 1. F o r each i = 1 , 2 , . . . , N , x ∈ X , a i ∈ A i ( x ) , we have that − p π i ( x, a i ) g i x,a i ( v i , π − i ) sgn ∂ f ( v , π ) ∂ π i is a non-ascen t, and in particular a descent dir ectio n if p π i ( x, a i ) g i x,a i ( v i , π − i ) 6 = 0 , in th e objective f ( v , π ) of (3). Pr oof. The objective f ( v , π ) can be rewritten as f ( v , π ) = N X i =1 X x ∈X X a i ∈A i ( x ) n π i ( x, a i ) h − g i x,a i ( v i , π − i ) io . Assume f ( v , π ) > 0 , othe rwise th e solu tion to (3) is alread y ach iev ed. For an a i ∈ A i ( x ) for some x ∈ X and i ∈ { 1 , 2 , . . . , N } , let ˆ π i ( x, a i ) = π i ( x, a i ) − δ p π i ( x, a i ) g i x,a i ( v i , π − i ) sgn ∂ f ( v , π ) ∂ π i , for a small δ > 0 . Let ˆ π be the same as π except that action a i is picked as defined above. T hen by a T aylor series expansion of f ( v , ˆ π ) till the first order term, we obtain f ( v , ˆ π ) = f ( v , π ) + δ h − p π i ( x, a i ) g i x,a i ( v i , π − i ) i sgn ∂ f ( v , π ) ∂ π i ∂ f ( v , π ) ∂ π i ( x, a i ) + o ( δ ) . The rest of the proof amounts to showing that th e second term in the e xpansion above is ≤ 0 . Th is can be inferred as follows: − p π i ( x, a i ) g i x,a i ( v i , π − i ) sgn ∂ f ( v , π ) ∂ π i ∂ f ( v , π ) ∂ π i = − p π i ( x, a i ) g i x,a i ( v i , π − i ) ∂ f ( v , π ) ∂ π i ≤ 0 , and is in particular < 0 if p π i ( x, a i ) g i x,a i ( v i , π − i ) 6 = 0 . Thus, f or a i ∈ A i ( x ) , x ∈ X and i ∈ { 1 , 2 , . . . , N } w here π i ( x, a i ) > 0 and g i x,a i ( v i , π − i ) 6 = 0 , f ( v , ˆ π ) < f ( v , π ) for small enou gh δ . The claim follows. 7 Using the normal sgn() function is problematic for an ODE approa ch based analysis, as s gn() is discontinuous. In other words, with sgn() the und erlying system for t he pol ic y π i n will be a stochasti c recursiv e inclusion and pro viding mea ningful guara ntees for such in clusions would require more assumptio ns. In comparison, the results are stronger for the ODE appr oach. This is the moti vat ion behind employing sgn() , whic h is a co ntinuous e xtension of sgn() . T he function sgn( x ) projects any x outside of a small interv al around 0 (say [ − ν, ν ] for some ν > 0 small) to either +1 or − 1 as sgn() would do and w ithin the interv al [ − ν, ν ] , one m ay choose sgn( x ) = x or any other continuous functio n with compatible end-poin t val ues. One could choose ν arbitra rily close to 0 , making sgn practica lly very close to sg n . 12 En viron ment 2 1 . . . N ON-SGSP r 1 , y a 1 r 2 , y a 2 r N , y a N Figure 3: ON-SGSP’ s dece ntralized on-line learning model with N agents Main result. Let R i π = r i ( x, π ) , x ∈ X be a column vector o f rew ards to agent i and P π = [ p ( y | x, π ) , x ∈ X , y ∈ X ] be the transition probability matrix, both for a given π . Then, the value function for a given policy π is defined as v i π = [ I − β P π ] − 1 R i π , i = 1 , 2 , . . . , N . (11) The above will be used to charac terize the limit of the cr itic-recursion (10). Before presen ting the main result, we specify the ODE that underlies the actor-recursion (9 ): For all a i ∈ A i ( x ) , x ∈ X , i = 1 , 2 , . . . , N , dπ i ( x, a i ) dt = ¯ Γ − p π i ( x, a i ) g i x,a i ( v i π , π − i ) sgn ∂ f ( v π , π ) ∂ π i , (12) where ¯ Γ is a projection o perator that r estricts the e volution of th e above ODE to the simplex D (see Section 8 f or a precise definition). The main result regarding the con vergence of OFF-SGSP is as fo llows: Theorem 4. Let G denote t he set of all feasible points of the optimization p r oblem (3) an d K the set of limit points of the ODE (12) . Further , let K 1 = K ∩ G and K ∗ = { ( v i π ∗ , π ∗ ) | π ∗ ∈ K 1 } . Then, for any agent i = 1 , . . . , N , the sequence of iterates ( v i n , π i n ) , n ≥ 0 satisfy ( v i n , π i n ) → K ∗ a.s. Pr oof. See Section 8.1. From the above theorem, we can infer the following: (i) The set o f inf easible limit po ints of th e ODE (1 2), i. e., K 2 = K \ K 1 , are asymptotically unstable (see Lemma 5 in Section 8 for a form al proof); and (ii) OFF-SGSP conver ges to the set K ∗ , which is the set of all asymptotically stable li mit points of the system of ODEs (19). Fu rther, K ∗ correspo nds to SG-SP ( and h ence Nash) poin ts and hence, OFF-SGSP is shown to conv erge almo st surely to a NE of the underlyin g d iscounted stochastic game. 7 ON-SGSP: Online and Model-Fr ee Thoug h OFF-SGSP is suitable fo r only off-line learning of Nash strategies, it is amenable for extensio n to the general ( on-line) multi-ag ent RL setting where neither the tran sition pro bability p nor the reward functio n r are 13 explicitly known. ON-SGSP operates in the latter model-fr ee setting and uses the stoc hastic game as a generative model. As illustrated in Fig. 3, ev ery iter ation in ON-SGSP rep resents a discrete-time intera ction with the environment, where each agent presen ts its ac tion to th e environment and observes the next state and the rew ar d vector of a ll agents. The learning is localized to e ach agen t i ∈ { 1 , 2 , . . . , N } , m aking the setting d ecentralized . This is in the spirit of earlier mu lti-agent RL approa ches (cf. Hu and W ellman [ 1999], Hu and W ellman [2 003] and Littman [2001]). Algorithm 1 pr esents the co mplete structure of O N-SGSP along with update r ules for the value and p olicy parameters. Th e algorithm operates along two timescales as follows: Critic (faster timescale): E ach agent estimates its own value fu nction as well as th at of other agents, using a temporal- difference (TD) [Sutton, 1 988] type update in (13). Moreover , the gradient ∂ f ( v n , π n ) ∂ π i ( x, a i ) is also estimated in an online m anner v ia th e ξ -rec ursion in (14). Note th at th e ξ -recu rsion is made necessary du e to the fact that ON-SGSP operates in a model-free setting. Actor (slower timescale): The policy update is similar to OFF-SGSP , except that the estimates of value v and gradient ξ are used to derive the dec rement in (15). Note that, since ON- SGSP op erates in a mode l-free setting, both the value and p olicy updates are d ifferent in compariso n to OFF-SGSP . The value v upd ate ( 13) on th e faster timescale can be seen to b e the stochastic approx - imation variant of value iteration and it conver ges to the same lim it as in OFF-SGSP , without knowing the model. On the o ther hand, the policy update (15) o n the slower timescale inv olves a d ecrement that is m otiv ated by the descent direction suggested by Proposition 1. Algorithm 1 ON-SGSP Input: Starting state x 0 , initial point θ i 0 = ( v i 0 , π i 0 ) , step-sizes { b ( n ) , c ( n ) } n ≥ 1 , nu mber o f iterations to r un M >> 0 . Initialization: n ← 1 , θ i ← θ i 0 , x ← x 0 for n = 1 , . . . , M do Play action a i n := π i n ( x n ) along with other agents in current state x n ∈ X Obtain next state y n ∈ X Observe re ward vector r n = < r 1 n , . . . , r N n > V alue Update: For j = 1 , . . . , N v j n +1 ( x n ) = v j n ( x n ) + c ( n ) r j n + β v j n ( y n ) − v j n ( x n ) (13) Gradient Estimation: ξ i n +1 ( x n , a i n ) = ξ i n ( x n , a i n ) + c ( n ) N X j =1 r j n + β v j n ( y n ) − v j n ( x n ) − ξ i n ( x n , a i n ) (14) Policy Update: π i n +1 ( x n , a i n ) = Γ( π i n ( x n , a i n ) − b ( n ) p π i n ( x n , a i n ) × r i n + β v i n ( y n ) − v i n ( x n ) sgn( − ξ i n +1 ( x n , a i n ))) (15) end for A fe w remarks about ON-SGSP are in order . Remark 6. (Coup led dynamics) In the ON-SGSP algorithm, an agent i observes the r ewards of other a gents and uses this information to compute the r espec tive value es timates. These quantities ar e then u sed to derive t he decr e- ment in the policy update (15) . Th is is meaningfu l in the lig ht of the impossibility r esu lt of Hart and Mas-Colell [2003], wher e the autho rs show that in order to con verge to a Nash equilibrium ea ch agent’ s str ate gy n eeds to factor in the r ewar ds of the other agents. 14 Remark 7. (Per-iteration co mplexity) OFF-SGSP: Let A be the typica l numb er of action s availab le to any agent in any given state and let U be th e typical number of next states for any state x ∈ X . Th en, the typica l number of multiplications in OFF-SGSP per iteration is N × ( U + 1) × A N + 4 A × |X | . Thu s, the c omputatio nal complexity g r ows e xponen tially in terms of the number o f agents while being lin ear in the size of the state spac e. Note that th e exponentia l behavio ur in N appears bec ause o f the computation o f expectation over possible n ext states and strate gies of agents. This computation is avoided in ON-SGSP . ON-SGSP: F or each agent, a typical iteration would t ake just (2 A + 1) n umber of multiplications. Thus, p er-iter a tion co mplexity of OFF -SGSP is Θ(2 N ) while that of ON-SGSP is Θ(1) (fr om the point o f view o f each agent). Thus, ON-SGSP is computa tionally efficient and this is also confirmed b y simulation r esults, which establish that the total run-time till con ver gence of ON-SGSP is indeed very small when compared to that of the off-line algorithm OFF-SGSP . I n comparison, the stochastic tracing pr o cedur e of Herings and P e eters [2004] has a complexity of O ( |X | × A N ) per iteration w hich is similar to that of OFF-SGSP . However , the per-iteration complexity a lone is not sufficient and an analysis of the numb er o f iterations r equired is necessary to complete the picture 8 . On the oth er hand, c onver gence rate results for general mu lti-timescale stochastic appr oxima tion schemes are no t available, see however , K on da and Tsitsiklis [2004] for rate results of two timescale schemes with linear r ecursions. Remark 8. (Descent directions) I t is shown in Pr oposition 1 that − p π i ( x, a i ) g i x,a i ( v i , π − i ) sgn ∂ f ( v , π ) ∂ π i is a descent d ir ectio n in π i ( x, a i ) for every i = 1 , 2 , . . . , N , x ∈ X , a i ∈ A i ( x ) . S ince π i ( x, a i ) α ≥ 0 for any α ≥ 0 , − π i ( x, a i ) α p π i ( x, a i ) g i x,a i ( v i , π − i ) sgn ∂ f ( v , π ) ∂ π i can also be seen to be a descen t dir ection in π i ( x, a i ) for every i = 1 , 2 , . . . , N , x ∈ X , a i ∈ A i ( x ) . In other wor ds, the fo llowing is a descent dir ection in π i ( x, a i ) for every i = 1 , 2 , . . . , N , x ∈ X , a i ∈ A i ( x ) : − π i ( x, a i ) α ′ g i x,a i ( v i , π − i ) sgn ∂ f ( v , π ) ∂ π i , for any α ′ ≥ 1 2 So, the policy updates in OFF-SGS P/ON-SGSP can be generalized as follows: W ith α ≥ 1 2 , OFF-SGSP: π i ( x, a i ) ← Γ π i ( x, a i ) − γ π i ( x, a i ) α ′ g i x,a i ( v i , π − i ) sgn ∂ f ( v , π ) ∂ π i , ON-SGSP: π i ( x, a i ) ← Γ π i ( x, a i ) − γ π i ( x, a i ) α ′ r i + β v i ( y ) − v i ( x ) sgn( q i ( x, a i ) , wher e γ is a step-size parameter . Remark 9. (Con vergence result) Th eor em 4 holds for ON-S GSP as well, while the pr o of deviates significan tly . OFF-SGSP assumes model information, i.e., knowledge of transition dynamics. On the other hand, ON-SGSP op- erates in a mo del-fr ee setting and h ence, th e analyses fo r both the timescales c hange. In particular , ON-SGSP uses a TD-critic and using stand ar d stochastic appr ox imation ar guments (as in earlier literatur e), it is straightforwar d to p r ove the va lue upd ates in (1 3) conver ge to the true value fun ction. However , the ana lysis of the ξ -r ecursion changes signifi cantly . The la tter is a consequence o f th e fa ct th at ON-S GSP operates in a mo del-fr ee setting and hence does n ot have access to f ( v n , π n ) (and hence ∂ f ( v n , π n ) ∂ π i ( x, a i ) which is r equir ed for the p olicy update). F inally , the p olicy u pdates can be shown to tr ack the system of ODEs (1 9) a s in OFF-SGSP , after handling an additiona l martingale sequence that arises due to the ξ -recur sion. The detailed pr oof is available in Section 8.2. 8 A well-kno wn complexi ty resul t [Papadimitri ou , 1994] esta blishes that finding the Nash equilibrium of a two-pl ayer game is PP AD- complete . 15 8 Pr oof of Con ver gence W e provide a pro of o f c onv ergence of the tw o propo sed algorith ms - OFF-SGSP and ON-SGSP , respectively . In addition to Assumption 1, we make the following assum ption for the analysis of our algorithm s: Assumption 2. The und erlying Markov chain with transition pr obab ilities p ( y | x, π ) , x, y ∈ X , corr espon ding to the general-sum discounted stochastic game, is irr ed ucible and positive r ecu rr ent for all possible strate g ies π . The above assumption is standard in the analy sis of multi-agent RL algorithms and can be seen in earlier works as well (for instance, see Hu and W ellman [1999], Littman [200 1]). In the f ollowing section, we provide the d etailed analysis fo r OFF-SGSP an d later, in Section 8.2, p rovide the necessary modifications to the analysis for ON-SGSP . 8.1 Pr oof of Theor em 4 f or OFF-SGSP As mention ed earlier , OFF-SGSP em ploys two time-scale stocha stic approxim ation [Bor kar, 2008, Chapter 6]. That is, it compr ises of iteration sequ ences that are updated using two d ifferent time-scales or step-size schedules defined via { b ( n ) } and { c ( n ) } , respectively . The step-sizes, satisfying Assumption 1, ensure the following: (i) The policy π ( on slower timescale) appears quasi-static for upd ates of v ; and (ii) The value v (on faster time scale) appears alm ost equilibrated for updates o f π . W e let v π denote the value for a given policy π . Claim (i) above can be inferred as follows : First rewrite the π -recu rsion in (9) as π i n +1 ( x, a i ) = Γ π i n ( x, a i ) − c ( n ) H ( n ) , where H ( n ) = b ( n ) c ( n ) p π i n ( x, a i ) g i x,a i ( v i n , π − i n ) sgn ∂ f ( v n , π n ) ∂ π i , with g i ( · , · ) as defined in ( 4) and f ( · , · ) as defined in (3). Since we consider a finite state-action spaced stochastic g ame, g i x,a i is bounde d, while one can trivially up per bound π and sgn . Thu s, sup n | H ( n ) | is finite. Sin ce, b ( n ) c ( n ) = o (1) by as sumption 1, it can be clearly seen that the π -recu rsion in (9) tracks the ODE dπ i ( x, a i ) dt = 0 . Claim (i) now follo ws. Inferrin g claim (ii) ab ove is technically mo re inv olved, but f ollows using arguments similar to those used in Theorem 2 of Chapter 6 in [Borkar, 2008]. In order to prove T heorem 4, we analyse each timescale separately in the following. Step 1: Analysis of v -rec ursion W e first sho w that the updates o f v , that are on the faster tim e-scale, conver ge to a limit point of the following system of ODEs: ∀ x ∈ X , i = 1 , 2 , . . . , N , dv i ( x ) dt = r i ( x, π ) + β X y ∈ U ( x ) p ( y | x, π ) v i ( y ) − v i ( x ) , (16) where π is assumed to be time-in variant. W e will also see that the system of ODEs above h as a unique limit point, hencefo rth referred to as v π , which is stable. Let R i π = r i ( x, π ) , x ∈ X be a column vector of rew ards to agent i and P π = [ p ( y | x, π ) , x ∈ X , y ∈ X ] be the transition prob ability matrix, both for a given π . 16 Lemma 4. Th e system of ODEs (16) has a unique globally asymptotically stable limit point given by v i π = [ I − β P π ] − 1 R i π , i = 1 , 2 , . . . , N . (17) Pr oof. The system of ODEs (16) can be re-written in vector form as given b elow . d v i dt = R i π + β P π v i − v i . (18) Rearrangin g terms, we get d v i dt = R i π + ( β P π − I ) v i , where I is the iden tity matrix of suitab le dimension. Note that fo r a fixed π , this ODE is linear in v i with state transition matrix as ( β P π − I ) . Since P π is a stochastic matrix, the magn itude of all its eigen-values is up per bound ed by 1. Hence all the eigen-values of the state transition matrix ( β P π − I ) hav e negative re al par ts and the m atrix ( β P π − I ) is in par ticular no n-singular . Thu s by standard linear systems th eory , th e above ODE has a unique globally asym ptotically stable limit poin t which can be com puted by setti ng d v i dt = 0 , i = 1 , 2 , . . . , N , i.e., R i π + ( β P π − I ) v i = 0 , i = 1 , 2 , . . . , N . The tr ajectories o f the ODE (1 8) converge to the above point starting f rom any initial condition in lieu of th e above. For a gi ven π , the updates of v in equation (10) (OFF-SGSP) can be seen as Euler discretization of the system of ODEs (16). W e now show that v n in equatio n (10) of OFF-SGSP conv erges to v π as given in eq uation (17). While the following claim is identical for both OFF-SGSP/ON-SGSP , the pr oofs are quite different. In the former case, it am ounts to proving value iteration con verges (a standard result in dynam ic programming ), while the latter case am ounts to proving a stochastic approximatio n v a riant o f value iter ation co n verges (also a standard resu lt in RL). Proposition 2 . F or a g iven π , i.e., with π i n ≡ π i , updates o f v g overned by (10) ( OFF-SGSP ) satisfy v n → v π , as n → ∞ , wher e v π is the globally asymptotically stable equilibrium point of the system of ODEs (16). Pr oof. W e verify h ere assum ptions (A1) and (A2) of Borkar and Meyn [2000] in o rder to use their result [Borkar and Meyn, 2000, Th eorem 2.2] . Let h ( v i ) = R i π + ( β P π − I ) v i . Since h ( v i ) is lin ear in v i , it is Lipschitz con tinuous. Let h r ( v i ) = h ( r v i ) r for a scalar real num ber , r > 0 . It is easy to see that h r ( v i ) = R i π r + ( β P π − I ) v i . Now , h ∞ ( v i ) = lim r →∞ h r ( v i ) = ( β P π − I ) v i . No w since all eigenv a lues of ( β P π − I ) have negati ve real p arts, the ODE d v i dt = h ∞ ( v i ) has the origin as its uniq ue globally asym ptotically stable equ ilibrium. Furthe r , as shown in Lemma 4 , v i π is the unique glo bally asy mptotically stable equilibrium fo r the ODE (1 8). Assump tion (A1) o f Borkar and Meyn [ 2000] is thus satisfied. Since the upd ates o f v in eq uation (10) do not hav e any n oise term in them, assumption (A2 ) of citet bor kar200 0ode is tri v ially satisfied. Thus by Borkar and Meyn [2000, Theorem 2.2], v n in equation ( 10) conv erges to the globally asymptotically stable limit point v π giv en i n equation (17). Thus, on the faster time-scale { c ( n ) } , the updates of v obtained from (10) con verge to v π , as giv en by (17). Step 2: Analysis of π -r ecursion Using the conver ged values of v co rrespond ing to strategy update π n , i.e., v π n on the slower time-scale, we show that updates of π con verge to a limit point of the following system of ODEs: For all a i ∈ A i ( x ) , x ∈ X , i = 1 , 2 , . . . , N , dπ i ( x, a i ) dt = ¯ Γ − p π i ( x, a i ) g i x,a i ( v i π , π − i ) sgn ∂ f ( v π , π ) ∂ π i , (19) 17 where g i x,a i ( · , · ) is the Bellman erro r (see (4)), f ( · , · ) is the o bjective in (3) and ¯ Γ is a projec tion operator that restricts the ev olu tion of the above ODE to the simplex D and is defined as follo ws: ¯ Γ( v ( x )) = lim η → 0 Γ( x + η v ( x )) − x η , (20) for any continuous v : D → R N . Let K denote the limit set of the ODE ( 19). Before we analyse the π - recursion in (9), we show that the po ints in K that are infeasible fo r the optimizatio n problem ( 3) a re asym ptotically u nstable. I n o ther words, each stab le limit point of the ODE (19) is an SG-SP point. Define the set of all feasible points of the optimization problem (3) as follows: G = π ∈ L g i x,a i ( v i π , π − i ) ≤ 0 , ∀ a i ∈ A i ( x ) , x ∈ X , i = 1 , 2 , . . . , N (21) The limit set K of the ODE (19) can be partitioned using the feasible s et G as K = K 1 ∪ K 2 where K 1 = K ∩ G and K 2 = K \ K 1 . In the f ollowing lem ma, we show that the set K 2 is the set of lo cally unstab le equilibr ium points of (19). Lemma 5. A ll π ∗ ∈ K 2 ar e un stable equilibrium points of the system of ODEs (19). Pr oof. For any π ∗ ∈ K 2 , th ere exists some a i ∈ A i ( x ) , x ∈ X , i ∈ { 1 , 2 , . . . , N } , such that g i x,a i ( v i π , π − i ) > 0 and π i ( x, a i ) = 0 because K 2 is not in the feasible set G . Le t B δ ( π ∗ ) = { π ∈ L | k π − π ∗ k < δ } . Choose δ > 0 such that g i x,a i ( v i π , π − i ) > 0 for all π ∈ B δ ( π ∗ ) \ K and consequently ∂ f ( v π , π ) ∂ π i < 0 . So, ¯ Γ − p π i ( x, a i ) g i x,a i ( v i π , π − i ) sgn ∂ f ( v π , π ) ∂ π i > 0 for any π ∈ B δ ( π ∗ ) \ K which suggests that π i ( x, a i ) will increase when moving away from π ∗ . Thu s, π ∗ is a n unstable equilibrium poin t of the system of ODEs (19). Remark 10. (On the sign function) Recall that sgn was employed since the normal sg n () function is discon- tinuous. Since sg n can r esult in the valu e 0 , one can no longer conclud e tha t √ π ∗ g = 0 for the points in the equilibrium set K . Note tha t the former condition (cou pled with feasibility) implies it is an SG- SP poin t. A naive fix would be to change OFF- SGSP/ON-SGS P to r epeat an actio n if sgn( · ) r eturned 0 . Th is would ensur e that th er e ar e no spu rious points in the set K du e to sgn being 0 . H enceforth, we shall assume th at ther e are n o such sgn induced spurious limit points in the set K . Lemma 6. F or all a i ∈ A i ( x ) , x ∈ X and i = 1 , 2 , . . . , N , π ∈ K ⇒ π ∈ L and p π i ( x, a i ) g i x,a i ( v i π , π − i ) = 0 , (22) wher e L = π | π ( x ) is a pr obability vector over A i ( x ) , ∀ x ∈ X . Pr oof. The operator ¯ Γ , by definitio n, en sures that π ∈ L . Suppo se for some a i ∈ A i ( x ) , x ∈ X and i ∈ { 1 , 2 , . . . , N } , we ha ve ¯ Γ( − p π i ( x, a i ) g i x,a i ( v i π , π − i ) sgn ∂ f ( v π , π ) ∂ π i ) = 0 , but p π i ( x, a i ) g i x,a i ( v i π , π − i ) 6 = 0 . Then, g i x,a i ( v i π , π − i ) 6 = 0 and since π ∈ L , 1 ≥ π i ( x, a i ) > 0 . W e an alyze this condition by considering the following two cases. Case 1 > π i ( x, a i ) > 0 and g i x,a i ( v i π , π − i ) 6 = 0 . In this case, it is possible to find a ∆ > 0 such that for all δ ≤ ∆ , 1 > π i ( x, a i ) − δ p π i ( x, a i ) g i x,a i ( v i π , π − i ) sgn ∂ f ( v π , π ) ∂ π i > 0 . 18 This implies that ¯ Γ − p π i ( x, a i ) g i x,a i ( v i π , π − i ) sgn ∂ f ( v π , π ) ∂ π i = − p π i ( x, a i ) g i x,a i ( v i π , π − i ) sgn ∂ f ( v π , π ) ∂ π i 6 = 0 , ⇒ p π i ( x, a i ) g i x,a i ( v i π , π − i ) 6 = 0 , which contradicts the initial supposition. Case π i ( x, a i ) = 1 and g i x,a i ( v i π , π − i ) 6 = 0 . Since v i π is solution of the system of ODEs (16), the following should hold: X ˆ a i ∈A i ( x ) π i ( x, ˆ a i ) g i x, ˆ a i ( v i π , π − i ) = π i ( x, a i ) g i x,a i ( v i π , π − i ) = 0 . This again leads to a contradiction. The result follows. In order to prove Theorem 4 , we require the well-known Kushner-Clark lemma (see [Kushner and Clark, 1 978, pp. 19 1-19 6]). For the sake of completeness, we recall this result below . Theorem 5. (Kus hner-Clark lemma) Consider the following r ecu rsion in d -dimensions: x n +1 = Γ( x n + b ( n )( h ( x n ) + ζ n + β n )) , (23) wher e Γ pr ojects the iterate x n onto a compact a nd conve x set, s ay C ∈ R d . The ODE associated with (23) is g iven by ˙ x ( t ) = ¯ Γ( h ( x ( t ))) , (24) wher e ¯ Γ is a pr ojectio n operator that k eeps the ODE evolution within the set C and is defined as in (20) . W e ma ke the following assumptions: (B1) h is a con tinuous R d -valued function. (B2) Th e sequence β n , n ≥ 0 is a boun ded r a ndom sequence with β n → 0 almost sur ely as n → ∞ . (B3) Th e step-sizes b ( n ) , n ≥ 0 satisfy b ( n ) → 0 as n → ∞ a nd P n b ( n ) = ∞ . (B4) { ζ n , n ≥ 0 } is a sequence such that for any ǫ > 0 , lim n →∞ P sup m ≥ n m X i = n b ( i ) ζ i ≥ ǫ ! = 0 . Suppo se that the ODE (2 4) ha s a co mpact set K ∗ as its set of a symptotically stab le eq uilibrium points. Then, x n conver ges almost sur ely to K ∗ as n → ∞ . Pr oof of Theor em 4 Pr oof. The updates of π given b y (9) on the slo wer time-scale { b ( n ) } can be rewritten as: For all a i ∈ A i ( x ) , x ∈ X and i = 1 , 2 , . . . , N , π i n +1 ( x, a i ) = Γ π i n ( x, a i ) − b ( n )( H ( π i n ) + β n ) , (25) 19 where H ( π i n ) = p π i n ( x, a i ) g i x,a i ( v i π n , π − i n ) sgn ∂ f ( v π n , π n ) ∂ π i , β n = p π i n ( x, a i ) g i x,a i ( v i n , π − i n ) sgn ∂ f ( v n , π n ) ∂ π i − H ( π i n ) . W e now verify the assumptions (B1)–(B4) for the recursion above: • H ( π i n ) is continuo us since each of its components √ π i , g i x,a i ( · , · ) and sgn( · ) ar e con tinuou s. In par tic- ular , the con tinuity of g i x,a i follows from the fact that b oth the value fu nction v i ( · ) an d Q-v alue f unction Q i π − i ( x, a i ) are continuou s in π i . Th is verifies assumption (B1). • β n → 0 almost surely sinc e | v n − v π n | → 0 as n → ∞ , from T heorem 2. Further, β n is bo unded as each of its c ompon ents ar e bounded . I n particular , g i x,a i is bounded as we consider fi nite state-action spaced stochastic games, while π i and sgn are tri vially upper-bounded . Thus (B2) is satisfied. • Assump tion 1 implies (B3) is satisfi ed. • ζ n is absent, obviating assumption (B4). The claim now follo w s from Kushner-Clark lemma. Remark 11. Note that fr om th e fo r egoing, th e set K comprises of both stable and unstab le attractors a nd in principle fr om Lemma 5, the iterates π i n governed by ( 19) can con ver ge to a n un stable equilibrium. In most practical scena rios, however , a g radient descent scheme is observed to con v er ge to a stable equilibrium. In fa ct, the δ - offset policy computed u sing p ertu rb ( · , δ ) for every Q > 0 iterations (see Section 9 below) for b oth o f o ur algorithms ensur es numerically th at a s n → ∞ , π n 9 π ∗ ∈ K 2 . In other words, conver gence of the strate gy sequence π n governed by (9) is to the stable set K 1 . 8.2 Pr oof of Theor em 4 f or ON-SGSP As mentioned earlier, the analysis for ON-SGSP chang es fo r b oth timescales an d we outline the crucia l differences below , before presenting the detailed analysis. Step 1: This step establishes that the TD updates along faster tim escale conv e rge to the tru e value functio ns, u sing standard tec hniques from stoc hastic a pprox imation. Unlike OFF-SGSP , this step also in volves the analysis of the ξ -recur sion. The latter is a consequence of the f act that we work i n a model-f ree s etting and hence do not ha ve access to f ( v n , π n ) (and hence ∂ f ( v n , π n ) ∂ π i ( x, a i ) which is require d for the policy update). Step 2: This step establishes th at the policy updates track the same ODE (i.e., (19)) as that of OFF-SGSP and the analysis in volves an additional martingale sequence that needs to be boun ded. Step 1: Analysis of v and ξ -r ecursions Proposition 3. F or a given π , i.e., with π i n ≡ π i , updates of v g overned by ( 13) (O N-SGSP) satisfy v n → v π almost sur ely as n → ∞ , where v π is th e globa lly asymptotica lly stable equilibrium p oint of th e system o f ODEs (16). Pr oof. Fix a state x ∈ X . Let { ¯ n } represen t a sub-seq uence of iterations in ON-SGSP wh en the state is x ∈ X . Also, let Q n = { ¯ n : ¯ n < n } . For a g iv en π , the updates of v on the faster time-scale { c ( n ) } g iv en in equation (13) can be re-written as v i ¯ n +1 ( x ) = v i ¯ n ( x ) + c ( ¯ n ) J ( v i ¯ n ) + ˜ χ ¯ n , 20 where J ( v i ¯ n ) = X a i ∈A i ( x ) π i ( x, a i ) g i x,a i ( v i ¯ n , π − i ) , and ˜ χ ¯ n = r i n + β v i n ( y n ) − v i n ( x n ) − X a i ∈A i ( x ) π i ( x, a i ) g i x,a i ( v i ¯ n , π − i ) . Using arguments as before, it is easy to s ee that J ( v i ¯ n ) is continuou s, ˜ χ ¯ n is such that E ˜ χ 2 ¯ n < ˆ C < ∞ . Thus, lim ¯ n →∞ P sup m ≥ ¯ n m X l = ¯ n c ( l ) ˜ χ l ≥ ǫ ! ≤ ˆ C ǫ 2 lim ¯ n →∞ ∞ X l = ¯ n c ( l ) 2 = 0 . For the last equality , we have used the fact that the step- sizes are square-sum mable (see assumption 1). Thus. all the assumption s of Kushner -Clark Lem ma (see Theorem 5 above) are satisfied and we c an co nclude that v n governed by (1 3) converges t o th e glo bally asymp totically stable limit p oint v π (see e quation (1 7)) o f the system of ODEs (16). Before establishing the c onv ergence of th e grad ient estimation recursion , i.e., (14), we require the following technical result. Lemma 7. ∂ f ( v , π ) ∂ π i ( x, a i ) = − N X j =1 g j x,a i ( v j , π − i ) , where (26) g j x,a i ( v j , π − i ) = ¯ r j ( x, π − i , a i ) + β X y ∈ U ( x ) ¯ p j ( y | x, π − i , a i ) v j ( y ) − v j ( x ) . (27) Pr oof. Let a ∈ A ( x ) de note the aggregate action vector . Then, we have ∂ f ( v , π ) ∂ π i ( x, a i ) = − g i x,a i ( v i , π − i ) − X j 6 = i X a j ∈A j ( x ) π j ( x, a j ) X k 6 = i,j X a k ∈A k ( x ) Y k 6 = i,j π k ( x, a k ) r j ( x, a ) + β X y ∈ U ( x ) p ( y | x, a ) v j ( y ) − v j ( x ) = − g i x,a i ( v i , π − i ) − X j 6 = i X k 6 = i X a k ∈A k ( x ) Y k 6 = i π k ( x, a k ) r j ( x, a ) + β X y ∈ U ( x ) p ( y | x, a ) v j ( y ) − v j ( x ) . Now , from the definition of g j x,a i ( v i , π − i ) in (27), it is easy to see that ∂ f ( v , π ) ∂ π i ( x, a i ) = − N X j =1 g j x,a i ( v j , r j , π − i ) . Recall that the gradient estimation recursion is as follows: ξ i n +1 ( x n , a i n ) = ξ i n ( x n , a i n ) + c ( n ) N X j =1 r j n + β v j n ( y n ) − v j n ( x n ) − ξ i n ( x n , a i n ) . The following theorem establishes that ξ i n ( x, a i ) con verges to − ∂ f ( v n , π n ) ∂ π i ( x, a i ) in the long run. 21 Proposition 4. ξ i n ( x, a i ) − − ∂ f ( v n , π n ) ∂ π i ( x, a i ) → 0 as n → ∞ almo st sur ely . Pr oof. As in the proof of Proposition 3, let { ¯ n } repr esent a sub- sequence of iterations in ON-SGSP when the state is x ∈ X and Q n := { ¯ n : ¯ n < n } . For a given π , the updates of ξ on the faster ti me-scale { c ( n ) } given in equation (14) can be re-written as ξ i ¯ n +1 ( x, a i ) = (1 − c ( ¯ n )) ξ i ¯ n ( x, a i ) + c ( ¯ n ) N X j =1 g j x,a i ( v j ¯ n , r j ¯ n , π − i ) + ξ ¯ n , where ξ ¯ n := N P j =1 r j ¯ n + β v j ¯ n ( y ) − v j ¯ n ( x ) − N P j =1 g j x,a i ( v j ¯ n , r j ¯ n , π − i ) . Let F l := σ ( v k , ξ k , k ≤ l ) , l ≥ 0 denote an increasing family of σ -fields. By defin ition of g j x,a i , we have E h r j ¯ n + β v j ¯ n ( y ) − v j ¯ n ( x ) | F ¯ n , π i ¯ n ( x ¯ n , a i ¯ n ) i = N X j =1 g j x,a i ( v j ¯ n , r j ¯ n , π − i ) . Hence, { ξ ¯ n } is a m artingale dif ference sequence. As in Proposition 3, define ˜ M n = P m ∈ Q n c ( m ) ˜ χ m . It can be easily verified that ( ˜ M m , F m ) , m ≥ 0 is a square- integrable mar tingale sequ ence obtained fro m the corresp onding marting ale difference { ξ m } . Further, from the square sum mability of c ( n ) , n ≥ 0 , and assumptio n 2 wh ich ensur es that the und erlying Markov chain is ergodic for any gi ven π , it ca n be verified from the martingale con vergence th eorem that { ˜ M m , m ≥ 0 } , con verges almost surely . Hence, | ξ m | → 0 almost sure ly o n the ‘ natural tim escale’, as m → ∞ . The ‘natural tim escale’ is clea rly faster than the algorithm’ s timescale and hence ξ m can be igno red in the analysis of Bellman error recursion ((14) in the main paper), see [Borkar, 2008, Chapter 6.2] for detailed treatment of natur al timescale algor ithms. The final claim follows from Kushner-Clark lemma. Step 2: Analysis of π -r ecursion Pr oof. W e first re-write the update of π as follows: For all i = 1 , 2 , . . . , N , π i ¯ n +1 ( x, a i ) =Γ π i ¯ n ( x, a i ) − b ( ¯ n ) q π i ¯ n ( x, a i ) g i x,a i ( v i π ¯ n , π − i ¯ n ) sgn ∂ f ( v π ¯ n , π ¯ n ) ∂ π i + ζ ¯ n , where ζ ¯ n = p π i ¯ n ( x, a i ) h ˆ g i x,a i − g i x,a i ( v i π ¯ n , π − i ¯ n ) i sgn ∂ f ( v π ¯ n , π ¯ n ) ∂ π i . In the ab ove, we hav e used the con- verged v alue s of th e value upd ate v n and g radient estimate ξ n and this is allowed due to tim escale separ ation and the fact th at v n → v π n (see Proposition 3) an d ξ n → − ∂ f ( v n , π n ) ∂ π i ( x, a i ) (see Proposition 4). Now , in order to apply Kushner-Clark lemma (see Th eorem 5 above), it is enough if we verify that E ζ 2 ¯ n < ∞ , since the rest of the term s are as in OFF-SGSP (which imply a ssumptions (B1) to ( B3) in Theor em 5 a re verified) . Now , a rguing as before, it is straightforward to infer that E ζ 2 ¯ n < ∞ , since we co nsider finite state-action space s an d the square of each o f the quan tities in the first term in ζ ¯ n can be up per-bounded . Thus, assumption (B4) in Theo rem 5 is verified and updates of π in ON-SGSP con verge to a stable lim it point of the system of ODEs (19). 9 Simulation Experiments W e test ON-SGSP , NashQ [Hu and W ellman, 2003] and FFQ [ Littman, 20 01] algorithm s o n two gener al-sum g ame setups. W e imp lemented F riend Q-learning v ariant of FF Q, as each iteration of its F o e Q-learning v ariant in volves a compu tationally intensive o peration to solve a linear program. 22 Player 2 → a 1 a 2 a 3 Player 1 ↓ a 1 1 , 0 0 , 1 1 , 0 a 2 0 , 1 1 , 0 1 , 0 a 3 0 , 1 0 , 1 1 , 1 (a) Payof f matrix. NashQ FFQ (Friend Q) ON-SGSP Oscillate or con verge 95 % 40 % 0% to non-Nash strategy Con verge to (0 . 5 , 0 . 5 , 0) 2% 0% 99 % Con verge to (0 , 0 , 1) 3% 60% 1% (b) Results from 100 simulation runs. Figure 4: Payoff matrix and simulation results for a single state non-g eneric two-player game 9.1 Single State (Non-Generic) Game This is a simple tw o-player game adopted from Hart and Mas-Colell [2 005], where the p ayoffs to th e individual agents are g iv en in T able 4a. I n this game, a strategy that picks a 3 (denoted by (0 , 0 , 1) ) constitutes a pure-strategy NE, while a strategy that picks either a 1 or a 2 with equal probab ility (denoted by (0 . 5 , 0 . 5 , 0) ) is a mixed-strategy NE. W e conduct a stochastic game experiment where at each stage, the payoffs to the agents are according t o T able 4a and the pay offs accu mulate with a discount factor β = 0 . 8 . W e performed 100 experimental runs, with each run c orrespon ding to a length of 10 000 stages. The aggregated results from this experiment are presented in Fig. 4b. It is e v ident that NashQ oscillates and does n ot con verge to NE in mo st o f th e runs, while Fr iend Q-learning conv erges to a non-Nash strategy tuple in most of the runs. On the other h and, ON -SGSP co n verges to NE in all the iterations. Figure 5: Stick-T og ether Game for M = 3 23 9.2 Stick-T ogether Game (STG) W e also define a simp le general-sum discounted stochastic game, named “Stick-T og ether Game” or in short STG, where two participating agents located on a r ectangular terrain would like to come tog ether and stay close to each other (see Fig. 5). A precise description of the v ar ious componen ts o f STG is provided belo w: 1. Sta te Space X : The state specifies the location of both the agen ts on a rectangu lar grid of size M × M . Mo re precisely , let O = { ( x, y ) | x, y ∈ Z } . Denote the possible positions of an agent by W := { s = ( x, y ) ∈ O | 0 ≤ x, y < M } . The n the state space is gi ven by the Cartesian produ ct X = W × W . 2. Act ion Space A : The actio ns a vailable to ea ch agent are to either move to on e of the neigh boring cells or stay in th e cur rent locatio n. For s ∈ W , let k s k 1 = | x | + | y | be its L 1 norm. Then, A ( s ) = { a ∈ O |k s + a k ≤ 1 } represents the actions available fo r a n agent to move to one-step neighbo uring positions of s ∈ W . The action space is then defined by A = ∪ s 1 ,s 2 ∈ W A ( s 1 ) × A ( s 2 ) . Let U ( s ) = { s ′ ∈ W |k s ′ − s k 1 ≤ 1 } r epresent the set of all next states for an agent in state s ∈ W . 3. T ransition probability p : W e assume that state transitions of individual ag ents are independ ent. L et q ( s ′ | s, a i ) re present, for agent i , th e pro bability of transition from state s ∈ W to s ′ ∈ W u pon taking action a i ∈ A i ( s ) . W e d efine q ( s ′ | s, a ) = 2 −k s ′ − a k 1 P s ′′ ∈ U ( s ) 2 −k s ′′ − a k 1 . Then, the transition probability is giv en by p (( s ′ 1 , s ′ 2 ) | ( s 1 , s 2 ) , ( a 1 , a 2 )) = q ( s ′ 1 | s 1 , a 1 ) q ( s ′ 2 | s 2 , a 2 ) . This tr ansition probability f unction has the h ighest v alue to wards that next state to which the action points to. 4. Rewa rd r : The rew ar d for the two agents is defined as r i ( s i , a i ) = 1 − e k s 1 − s 2 k 1 , for state ( s 1 , s 2 ) ∈ X a nd action ( a 1 , a 2 ) ∈ A ( s 1 ) × A ( s 2 ) . Thus, the rew ard is zero if the distance b etween the two agents is zero. Otherwise, it is a negati ve a nd monotonica lly decreasing function with respect to the distance between the two agents. Results. W e first show simulatio n results for a small sized version of the STG game, where M = 3 . Th e number of states with M = 3 is |X | = 81 . W e u se β = 0 . 8 for all our experiments. Also, we use the following step-size sequences in our experiments: b ( n ) = ( 0 . 2 for n < 10 0 0 , 1 n 0 . 75 otherwise, c ( n ) = ( 0 . 1 for n < 1000 , 1 n otherwise. It is easy to see th at { c ( n ) } co rrespon ds to slo we r time-scale tha n { b ( n ) } . It was ob served in our experiments that, using constant step-sizes upto n = 100 0 leads to better initial exploration and fas ter con vergence. T o e nsure sufficient exploration of the state sp ace (see assump tion 2) and also to pu sh the po licy π out of the dom ain of attraction of any loc al equilibrium, we perturb the policy as follows: For ev ery Q > 0 iterations, per tur b ( · , δ ) is u sed to deriv e a δ -offset policy for picking actions, i.e., ˆ π i ( x ) is used instead of π i ( x ) , where ˆ π i ( x, a i ) = π i ( x, a i ) + δ P a i ∈A i ( x ) ( π i ( x, a i ) + δ ) , a i ∈ A i ( x ) . (28) 24 0 0 . 2 0 . 4 0 . 6 0 . 8 1 × 10 4 0 50 100 150 Number of iterations Objectiv e value f (a) OFF-SGS P fo r STG with M = 3 0 1 2 3 4 5 × 10 7 0 10 20 Number of iterations A vg. distance d n FFQ NashQ ON-SGSP (b) ON-SGSP for STG with M = 30 Figure 6: Perform ance of our algorithms for STG Fig. 6a shows the e volution o f th e objective function f as a function o f th e number of iterations for OFF-SGSP . Note that f sho uld go to zero for a Nash equilibrium point. Fig. 6b shows the ev olu tion of the distanc e d n (in ℓ 1 norm) between the agents f or a STG game wher e M = 30 , which cor respond s to |X | = 8 10 , 0 00 . Notice that the results are sh own only for th e mo del-free algo rithms: ON- SGSP , NashQ an d FFQ. This is because OFF-SGSP and even the homotopy methods [Herings and Peeters, 200 4] have expo nential blo w up with M in their computa tional complexity and hen ce, are practically infeasible for STG with M = 30 . From Fig. 6b, it is e v ident that following the ON-SGSP strategy , the agents co n verge to a 4 × 4 -grid within the 3 0 × 30 -grid. For achieving this result, ON-SGSP takes about 2 × 10 7 iterations, implying an average 2 × 10 7 / |X | ≈ 21 iter ations pe r state. Howe ver, NashQ gets the agents to an 8 × 8 -gr id after a large numb er of iterations ( ≈ 5 × 10 7 ). Moreover , fro m Fig. 6b it is clear that NashQ ha s not stabilized its strategy in the end. Friend Q-learning gets the age nts to 8 × 8 -gr id, b y dri ving them to on e of the corners of the 30 × 30 -grid. While it takes a short number of iterations ( ≈ 300 0 0 ) to achie ve this, FFQ does not explore the st ate space well and hence, FFQ’ s strategy correspon ding to the rest of the grid (excluding the corn er to wh ich it tak es the agents) is not Nash. Remark 12. ( Runtime performance. ) W e observed that to comp lete 5 × 10 7 iterations, ON-SGSP took ≈ 42 25 0 1 2 3 4 5 6 × 10 5 0 10 20 Number of iterations A vg. distance, d n Figure 7: ON-SGSP with par tial information for STG with M = 30 minutes, while NashQ [Hu and W ellman, 2 003] took n early 50 hours, as it involves solving for Nash eq uilibria of a bimatrix game in e ach iteration. The F riend Q-learning variant of FFQ [Littman, 2 001] took ≈ 33 minutes. The F oe Q-lea rning variant of FFQ was not implemented owing to its high per - iteration complexity . Simple Function A ppr oxi mation for S TG While OFF-SGSP assumes full informatio n of the game , ON-SGSP assumes tha t neither rewards nor state tran - sition probabilities are known. Here, we exp lore an intermediate info rmation case albeit restricted to STG w here a partial stru cture of r ew ar ds is made known. I n particular , we assume that the reward dep ends o n the d ifference ∆ = ( | x 1 1 − x 2 1 | , | x 1 2 − x 2 2 | ) ∈ X in positions x 1 = ( x 1 1 , x 1 2 ) , x 2 = ( x 2 1 , x 2 2 ) ∈ X of the two agents. W e appr ox- imate th e value function v and strategy π as f ollows: v i ( x ) ≈ ˆ v i (∆) an d π i ( x ) ≈ ˆ π i (∆) , ∀ x ∈ X . Th us, the algorithm s n eed to compu te ˆ v and ˆ π on a low-dimen sional su bspace W of X . Fig. 7 presents results of ON-SGSP in this setting for M = 30 , which corresponds to | W | = 9 00 . The solutio n is seen to h ave con verged by 200 , 00 0 iterations ( ≈ 5 seconds r untime) which su ggests tha t it too k on an a verage 200 , 000 | W | ≈ 22 iterations per ∆ ∈ W to conv erge. 10 Conclusions In this paper, we der iv ed ne cessary and sufficient SG-SP conditions to solve a gen eralized optimization problem and established their eq uiv alen ce with Nash strategies. W e derived a descent (not necessarily steepest) d irec- tion that av oid s local minima. Incorporating this direction , we propo sed two algorith ms - offl ine, mod el-based algorithm OFF-SGSP and on line, mo del-free algo rithm ON-SGSP . Both algor ithms were sh own to conv erge, in self-play , to the equilibria of a certain ordinary differential equa tion (ODE), whose stable limit points coinc ide with stationary Nash eq uilibria o f the u nderlyin g general-sum stochastic g ame. Synthe tic experiments on two g eneral- sum game setups sho w that ON-SGSP outperfor ms two well-known multi-agen t RL algorith ms. T he e x perimental ev aluatio n also suggests that conv ergence is relati vely quick. There are se vera l future directions to be e xplored and we outline a fe w of them below: 1. I n simulations, we obser ved th at ON-SGSP can successfully ru n for large state spaces ( |X | ≈ 8 , 00 , 000 ). Howe ver, in m any cases, the state spaces can b e h uge and it would be necessary to look f or function ap- proxim ation tech niques for bo th the value function v as well as strategy-tuple π . Functio n a pprox imation technique s a re popu lar in reinfo rcement learnin g appro aches for high -dimension al MDPs and they bring in the following advantages: (a) They can cater to hug e state and action spaces; a nd (b) They also aid a designer to br ing in his understand ing about the und erlying system in terms o f features u sed f or function approx imations. 26 2. Ex tensions to the case of co nstrained games: By constrained stocha stic g ames, we mean th ose stocha s- tic gam es that ha ve a dditional con straints on value functions or strategy-tu ple which mig ht arise from the modelling o f a practical scenario. Th ere are some resu lts a nd app lications in this b y Altman et al. [2005] and Altman et al. [2 007] which provide the necessary m otiv ation for extending our results to general-sum constrained stochastic games. 3. Detailed experimental ev aluatio n o n a sophisticated benchmark for N -player general-sum stochastic games. Refer ences N. Akchurina. Multi-agen t Reinforcem ent learning : Algorithm con verging to Nash equilibrium in gener al-sum stochastic games. I n 8th International Confer ence on Autonomous Agents and Multi-ag ent Systems , 2009. E. Altman, K. A vr achenkov , R. Marquez, and G. Miller . Z ero-sum constrained stochastic games with indepen dent state processes. Mathematical Methods of Operations Resear ch , 62(3):375 –386 , 2005. E. Altman, K. A vratchen kov , N. Bo nneau, M. Deb bah, R. E l-Azouzi, and D. S. Menasche. Constrained stochastic games in wireless networks. In IEEE Global T elecommunica tions C onfer ence , pages 315–32 0. IEEE, 2007. V .S. Borkar . Stochastic Appr o ximation: A Dynamical Systems V iewpoint . Camb ridge Univ Pr ess, 2008. V .S. Borkar and S.P . M eyn. Th e ODE method for convergence of stochastic approxim ation and reinfor cement learning. SIAM Journal o n Contr ol a nd Optimization , 38(2):44 7–469 , 2000. R. N. Borkovsky , U. Do raszelski, a nd Y . Kryukov . A user’ s gu ide to solving d ynamic stochastic g ames u sing the homoto py method. Operations Resear ch , 58(4 -Part-2):111 6–1132, 2010. M. Bo wling . Conv ergence an d no-regret in multiagent learn ing. Advan ces in neural info rmation p r ocessing systems , 17:209 –216, 2005. M. Bowling and M. V eloso. Rational and convergent learnin g in stochastic gam es. In Internationa l join t co nfer ence on artificial intelligence , volume 17, pages 1021– 1026, 2001. M. Breton. A lgorithms for stochastic games . Springe r , 1991. V . Conitzer a nd T . Sandho lm. A wesome: A general multiagent learn ing algorithm tha t conv erges in self-play and learns a best response against stationary opponen ts. Machine Learning , 67(1 -2):23– 43, 200 7. J. Filar and K. Vrie ze. Comp etitive Marko v Decision Pr ocesses . Sp ringer-V erlag , New Y ork, I nc., 1st edition, November 2004 . ISBN 0-387-9 4805 -8. A. M. Fink. Equ ilibrium in a stochastic n -person game. Journal of Science of the Hir o shima University , S eries AI (Mathematics) , 28(1 ):89–9 3, 196 4. S. Hart and A. Mas-Colell. Unco upled dynamics do n ot lead to Nash equilibrium. American Econo mic Review , pages 1830 –1836 , 200 3. Sergiu Hart and Andreu Mas-Colell. Stoch astic uncoup led dyn amics an d nash equilibr ium. In Pr oceedin gs of the 10th confer ence on Theo r etica l aspects o f rationality a nd knowledge , pages 52–61. National Univ ersity o f Singapor e, 2005. P . J. Herings and R. Peeters. Ho motopy methods to co mpute equilibria in g ame theory . Research Memora nda 046, Maastricht : ME TEOR, Maastricht Research School of Economics of T echno logy and Or ganization , 20 06. URL http://idea s.repec.org /p/dgr/umamet/200 6046.html . P . J. Herings and R. J. A. P . Peeter s. Stationary equ ilibria in stoch astic gam es: Structure, selection, an d comp uta- tion. Journal of Economic Theory , 118(1 ):32–6 0, 2004 . 27 J. Hu and M . P . W ellman . Multiagent reinfor cement lear ning: T heoretical framework and an algorithm. I n Pr o- ceedings of 15th Internation al Confer ence on Machine Learning , pages 242– 250, 1999. J. Hu and M . P . W ellman. Nash Q-Learning for g eneral-sum stochastic games. In Journal o f Machin e Lea rning Resear ch , volume 4, pages 1039–106 9, 20 03. E. Kalai and E. Lehrer . Ration al learn ing lead s to Nash equilibrium. Eco nometrica: Journal of the E conometric Society , pages 1019– 1045 , 1993 . V . R. K ond a an d J. N. Tsitsiklis. Con vergence rate of line ar two-time-scale sto chastic approx imation. A nnals of Applied Pr obab ility , pages 796–819 , 2 004. H.J. Kushner and D.S. Clark. Stochastic ap pr oxima tion methods for co nstrained and unconstrained systems . Springer-V er lag, 1 978. I SBN 0-387 -9034 1-0. M. L. Littman. Mar kov games as a framework f or multi-agent reinforcem ent lea rning. I CML , 94:157 –163, 1994. M. L. Littman. Frien d-or-Foe Q-Learn ing in General Sum Games. In Pr oce edings of th e 1 8th Interna tional Confer e nce on Machine L earning , pages 322– 328. Morgan Kau fmann, 2001. L. Mac Derm ed and C. L. Isbell. Solving stochastic gam es. Advan ces in Neural Information Pr ocessing Systems , 22:118 6–11 94, 2009 . C. H. Papadimitriou. On th e co mplexity of the parity argumen t and other in efficient proo fs of existence. Journal of Computer and System Sciences , 48(3):4 98–5 32, 1994 . H. L. Prasad and S. Bhatnag ar . General-sum stoch astic gam es: verifiability conditions f or Nash equilibria. Auto- matica , 48(1 1):292 3–29 30, nov 2012. H. L. Prasad an d S. Bhatnag ar . A Study of Gradient Descent Sch emes for Gener al-Sum Stochastic Games. arXiv pr ep rint arXiv:1507 .0009 3 , 2015 . H. L. Prasad, L.A. Prashanth , and S. B hatnagar . T wo-timescale algorithms for learning nash equilibr ia in general- sum stochastic g ames. In Pr o ceeding s of the 2015 In ternationa l C onfer ence on Autonomous Agents and Multi- agent Systems , pages 1371–137 9, 20 15. L. S. Shapley . Stochastic g ames. In Pr oceedin gs o f th e National Academy of S ciences , volume 39, pages 1 095– 1100, 1953 . Y . Shoham, R. Powers, and T . Grenager . Multi-agent r einforc ement learn ing: a critical survey . W eb manu script , 2003. Y . Shoh am, R. Powers, a nd T . Gr enager . If multi-agen t learning is the answer, wh at is the q uestion? Artificial Intelligence , 171(7) :365–3 77, 2007. M. J. Sobel. No ncoop erative stochastic games. The Annals of Mathematica l Statistics , 42(6) :1930– 1935 , 19 71. R. S. Sutton. Le arning to predict by the methods of temporal differences. Machine learning , 3(1):9 –44, 1988. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Intr o duction . MI T Press, C ambridg e, MA, 1998. ISBN 0-262 -1939 8-1. M. T akahashi. Equ ilibrium poin ts of stochastic non-coo perative n -person games. J ourna l of Scien ce of th e Hi- r oshima University , Series AI (Mathematics) , 28(1):95–9 9, 19 64. J. W . W eibull. Evolutio nary Game Theory . MIT Press, 1996 . C. Zhang and V . R. Lesser . M ulti-agent learning with policy prediction. In AAAI , 2010. M. Zinkevich, A. Green wald, and M. Littman. Cyclic eq uilibria in Markov games. Adv ances in Neu ral Information Pr ocessing Systems , 18:16 41–1 648, 2 006. 28
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment