Probabilistic Systems with LimSup and LimInf Objectives
We give polynomial-time algorithms for computing the values of Markov decision processes (MDPs) with limsup and liminf objectives. A real-valued reward is assigned to each state, and the value of an infinite path in the MDP is the limsup (resp. limin…
Authors: ** Krishnendu Chatterjee (UC Berkeley) – krish@eecs.berkeley.edu Thomas A. Henzinger (UC Berkeley & EPFL) – tah@eecs.berkeley.edu **
Probabilistic Systems with LimSup and LimInf Ob jectiv es Krishnendu Chatterjee 1 and Thomas A. Henzinger 1 , 2 1 EECS, UC Berkeley , USA 2 EPFL, Switzerland { c krish,tah } @eecs.berk eley.edu Abstract. W e give polyn omial-time algorithms for comput ing the v al- ues of Marko v decision pro cesses (MDPs) with limsup and liminf ob jec- tives. A real-v alued reward is assigned to each state, and t he va lue of an in fi nite path in the MDP is the limsup (resp. liminf ) of all rewa rds along t h e path. The v alue of an MD P is t h e maximal exp ected va lue of an infinite path that can b e achiev ed by resolving the decisions of the MDP . Using our result on MDPs, w e show that turn- b ased stochastic games with limsup and liminf ob jectiv es can b e solved in NP ∩ coNP . 1 In t r o duction A turn-b ase d sto chastic game is played on a finite gra ph with three t yp es o f states: in player-1 states, the first player chooses a successo r state from a given set of o utgoing edges; in play er- 2 states, the second play er chooses a successor state fro m a given set of outgoing edges; and proba bilistic states, the successor state is chosen according to a g iven probabilit y distribution. The g ame results in an infinite path through the graph. Ev er y such path is ass igned a real v alue, and the ob jective of play er 1 is to resolve her c ho ices s o as to maximize the expected v alue of the r esulting path, while the ob jective o f play er 2 is to minimize the exp ected v alue. If the function that assigns v alues to infinite paths is a Bor el function (in the C a nt or top olo gy on infinite paths), then the game is determined [12]: the maximal expected v alue achiev able b y play er 1 is e qual to the minimal exp ected v alue ac hiev able b y player 2, and it is called the value o f the ga me. There ar e several canonica l functions for assig ning v alues to infinite pa ths. If each sta te is given a reward, then the max (resp. min ) functions c ho o se the maximum (resp. minim um) of the infinitely many rew ards along a path; the limsup (resp. liminf ) functions choo se the limsup (resp. liminf ) of the infinitely many r ewards; and the limavg function cho oses the lo ng-run av era ge o f the rewards. F or the B orel level-1 functions max and m in , as well as for the Bor el level-3 function limavg , computing the v a lue of a game is known to b e in NP ∩ coNP [1 0]. How ever, fo r the Borel level-2 functions limsup and liminf , only sp ecial cases have been considered so far. If there are no pro babilistic states (in this case, the ga me is called deterministic ), then the game v alue can be computed in p olynomial time using v alue- iteration algo rithms [1]; likewise, if all states are given reward 0 o r 1 (in this case, limsu p is a B ¨ uchi o b jective, and liminf is a coB ¨ uc hi ob jectiv e), then the game v alue can b e decided in NP ∩ coNP [3]. In this pap er, we sho w that the v alues of g eneral turn-based sto chastic games with limsup and liminf ob jectives can b e computed in NP ∩ co NP . It is known that pure memo r yless strategies suffice for achieving the v alue o f turn-based s to chastic games with limsup and liminf ob jectives [9]. A stra tegy is pur e if the player alw ays choos e s a unique successor state (rather than a proba - bilit y distribution of succe ssor states); a pure stra tegy is memoryless if at every state, the play er alwa ys chooses the same succes sor state. Hence a pure memory- less strategy for pla yer 1 is a function from play er -1 states to outgoing edges (and similarly for play er 2). Since pure memo ryless strategies offer p olynomial wit- nesses, our res ult will follo w from p oly no mial-time a lgorithms for computing the v alues of Markov decision pro cesses (MDPs) with limsup and liminf ob jectives. W e provide such alg o rithms. An MDP is the sp ecial c ase of a turn-based sto chastic ga me which contains no play er- 1 (or player-2) states. Using a lgorithms for solving MDPs with B ¨ uc hi and coB ¨ uc hi ob jectives, we give p olynomial- time reductions from MDP s with limsup and liminf ob jectives to MDPs with max o b jectives. The solution of MDPs with max ob jectives is co mputable b y linear pr o gramming, and the linear progra m for MDPs with max ob jectives is obtained by g e ne r alizing the linear progra m for MDPs with reachabilit y ob jectives. This will conclude our argument. Related work. Games with limsup and liminf o b jectives hav e bee n widely studied in game theory; for example, Maitra and Sudderth [11] present several results a bo ut ga mes with limsup and liminf ob jectives. In particula r, they show the existence of v alues in limsup and liminf games that a re more general than turn-based s to chastic games, such as c o ncurrent games, where the tw o pla yers re- pea tedly choo s e their mov es sim ulta neo usly and indep endently , and games with infinite state spaces. Gim b ert and Zielonk a have studied the strategy complex ity of games with limsup and liminf ob jectiv es: the sufficiency of pure memoryless strategies for deterministic games was shown in [8 ], a nd for turn-based sto chas- tic games, in [9]. Polynomial-time algo rithms for MDPs with B¨ uc hi and co B ¨ uchi ob jectives were pr esented in [5], and the s olution turn-based sto chastic games with B ¨ uc hi and coB ¨ uc hi o b jectives was shown to be in NP ∩ coNP in [3]. F or deterministic games with limsup and liminf ob jectives p olynomia l-time algo- rithms have b een known, for ex a mple, the v alue-iteratio n algo rithm terminates in po lynomial time [1]. 2 Definitions W e consider the class o f turn-based probabilistic games and some of its sub- classes. Game graphs. A t urn-b ase d pr ob abilistic game gr aph (2 1 / 2 -player game gr aph ) G = (( S, E ) , ( S 1 , S 2 , S P ) , δ ) consists of a directed gra ph ( S, E ), a partition ( S 1 , S 2 , S P ) of the finite set S of states, and a probabilistic transitio n function δ : S P → D ( S ), where D ( S ) denotes the set o f probability distr ibutions o ver the state space S . The s ta tes in S 1 are the player- 1 sta tes, where play er 1 decides the successor state; the states in S 2 are the player- 2 s tates, where player 2 decides the suc c essor state; and the states in S P are the pr ob abilistic states, where the successor state is c ho sen according to the probabilistic transition function δ . W e assume that for s ∈ S P and t ∈ S , w e have ( s, t ) ∈ E iff δ ( s )( t ) > 0 , and we often write δ ( s, t ) for δ ( s )( t ). F or technical conv enience w e assume that every state in the g raph ( S, E ) ha s at lea st one outgo ing edge. F or a sta te s ∈ S , we write E ( s ) to deno te the set { t ∈ S | ( s, t ) ∈ E } of p ossible successo rs. The turn- b ase d deterministic game gr aphs ( 2-player game gr aphs ) are the sp ecia l cas e of the 2 1 / 2 -play er g ame graphs with S P = ∅ . The Markov de cision pr o c esses (1 1 / 2 - player game gr aphs ) are the sp ecial case of the 2 1 / 2 -play er game gra phs with S 1 = ∅ or S 2 = ∅ . W e refer to the MDPs with S 2 = ∅ as player- 1 MDPs, a nd to the MDPs with S 1 = ∅ as player- 2 MDPs. Pla ys and strategies. An infinite path, or a play , of the game g raph G is an infinite sequence ω = h s 0 , s 1 , s 2 , . . . i of states such that ( s k , s k +1 ) ∈ E for all k ∈ N . W e write Ω fo r the set of all plays, and for a state s ∈ S , w e write Ω s ⊆ Ω for the set of plays that start from the state s . A str ate gy for play er 1 is a function σ : S ∗ · S 1 → D ( S ) that assig ns a probability distribution to all finite sequences w ∈ S ∗ · S 1 of sta tes ending in a play er-1 state (the sequence represents a pr efix o f a play). Player 1 follows the strategy σ if in each play er -1 mov e, given that the c urrent history o f the game is w ∈ S ∗ · S 1 , s he cho oses the next state according to the pr obability distribution σ ( w ). A stra teg y must prescr ib e only av ailable moves, i.e., fo r all w ∈ S ∗ , s ∈ S 1 , a nd t ∈ S , if σ ( w · s )( t ) > 0, then ( s, t ) ∈ E . The stra tegies for play er 2 are defined ana logously . W e denote b y Σ and Π the set of all str ategies for play er 1 a nd player 2, resp e ctively . Once a starting state s ∈ S and str a tegies σ ∈ Σ and π ∈ Π for the t wo play ers are fixed, the outcome of the game is a random walk ω σ,π s for which the probabilities of even ts are uniquely defined, where an event A ⊆ Ω is a measurable set of plays. F or a state s ∈ S and an even t A ⊆ Ω , we write Pr σ,π s ( A ) for the pro bability that a play b elongs to A if the ga me star ts from the state s and the players follow the strateg ies σ and π , resp ectively . F or a measurable function f : Ω → I R we denote b y E σ,π s [ f ] the exp e ctation of the function f under the pro bability measure Pr σ,π s ( · ). Strategies that do not use rando mization are called pure. A player-1 strat- egy σ is pu r e if for all w ∈ S ∗ and s ∈ S 1 , there is a state t ∈ S such that σ ( w · s )( t ) = 1 . A memoryless play er- 1 strategy do es no t dep end on the history of the play but only on the cur rent state; i.e., for all w , w ′ ∈ S ∗ and for all s ∈ S 1 we have σ ( w · s ) = σ ( w ′ · s ). A memo r yless str ategy can be represented as a function σ : S 1 → D ( S ). A pur e memoryless str ate gy is a strategy that is b oth pure a nd memoryle s s. A pure memory less strateg y for pla yer 1 can be repre- sented as a function σ : S 1 → S . W e denote by Σ PM the set of pur e memoryless strategies for play er 1. The pure memor y less pla yer-2 strategies Π PM are defined analogo usly . Given a pure memory le ss strateg y σ ∈ Σ PM , let G σ be the g ame gra ph obtained from G under the constra int that play er 1 follows the stra tegy σ . The corres p o nding definition G π for a play er- 2 stra tegy π ∈ Π PM is analog o us, and we wr ite G σ,π for the g ame graph obtained from G if both players follow the pure memoryles s strategies σ and π , resp ectively . Obser ve that given a 2 1 / 2 - play er game graph G and a pure memor yless player-1 strategy σ , the result G σ is a player-2 MDP . Similarly , for a play er- 1 MDP G and a pur e memoryless play er- 1 strateg y σ , the r esult G σ is a Marko v c hain. Hence, if G is a 2 1 / 2 -play er game g raph and the tw o play er s follow pure memoryless str ategies σ and π , the result G σ,π is a Marko v chain. Quan titative ob jectiv es. A quantitative ob jective is sp ecified as a measurable function f : Ω → I R. W e consider zer o-sum games, i.e., games that ar e strictly comp etitive. In zero-sum games the ob jectives o f the play ers are functions f and − f , resp ectively . W e consider quan titative ob jectives specified as lim sup and lim inf ob jectives. These ob jectiv es are complete for the second lev els of the Borel hier arch y: lim sup ob jectives ar e Π 2 complete, a nd lim inf o b jectiv es are Σ 2 complete. The definitions of lim sup and lim inf o b jectives are as follows. – Limsup obje ctives. Let r : S → I R b e a real-v alued reward function that assigns to every state s the reward r ( s ). The limsup ob jective lim sup assig ns to every play the maximum r eward that appe ars infinitely often in the play . F ormally , for a play ω = h s 1 , s 2 , s 3 , . . . i we have lim sup( r )( ω ) = lim sup h r ( s i ) i i ≥ 0 . – Liminf obje ctives. Let r : S → I R b e a real-v alued reward function that a s- signs to every state s the rew a rd r ( s ). The liminf ob jective lim inf as signs to every play the maximum reward v such that the r ewards that app ear even- tually a lwa ys in the play is at lea st v . F ormally , for a play ω = h s 1 , s 2 , s 3 , . . . i we hav e lim inf ( r )( ω ) = lim inf h r ( s i ) i i ≥ 0 . The ob jectives lim s up and lim inf are complementary in the sense that for all pla ys ω we hav e lim sup( r )( ω ) = − lim inf ( − r )( ω ). W e also define the max ob jectiv es, as it will b e useful in study of MDPs with lim sup and lim inf ob jectives. Later we will reduce MDPs with lim sup and lim inf ob jectives to MDPs with max ob jectives. F or a reward function r : S → I R the max ob jective max assigns to every play the maximum rew ard that appea rs in the play . Observe that since S is finite, the num b er o f different rewards ap- pea ring in a pla y is finite a nd hence the max im um is defined. F o rmally , for a play ω = h s 1 , s 2 , s 3 , . . . i we have max( r )( ω ) = max h r ( s i ) i i ≥ 0 . B ¨ uc hi and coB¨ uc hi ob jectives. W e define the qualitative v ar ia nt o f lim sup and lim inf ob jectives, namely , B ¨ uc hi and coB ¨ uc hi ob jectives. The notion of qualitative v ariants of the ob jectives will b e useful in the algorithmic analysis of 2 1 / 2 -play er games with lim sup and lim inf ob jectives. F or a play ω , we define Inf ( ω ) = { s ∈ S | s k = s for infinitely many k ≥ 0 } to be the set o f states that o ccur infinitely often in ω . – B¨ uchi ob je ctives. Giv en a set B ⊆ S of B ¨ uc hi states, the B¨ uc hi ob jective B ¨ uc hi( B ) requires that some s tate in B be visited infinitely often. The set of winning pla ys is B ¨ uc hi( B ) = { ω ∈ Ω | Inf ( ω ) ∩ B 6 = ∅ } . – c o-B¨ uchi obje ct ives. Given a set C ⊆ S of coB ¨ uch i states, the co-B ¨ uc hi ob jective coB ¨ uc hi( C ) r equires that only states in C b e vis ited infinitely often. Thu s, the set of winning plays is coB ¨ uc hi( C ) = { ω ∈ Ω | Inf ( ω ) ⊆ C } . The B ¨ uc hi a nd coB ¨ uc hi ob jectiv es are dual in the sense that B¨ uc hi( B ) = Ω \ coB ¨ uchi ( S \ B ). Given a set B ⊆ S , consider a bo olea n reward function r B such that for all s ∈ S we hav e r B ( s ) = 1 if s ∈ B , and 0 otherwise. Then for all plays ω we hav e ω ∈ B ¨ uc hi( B ) iff lim sup( r B )( ω ) = 1. Similarly , given a set C ⊆ S , consider a bo ole an rew ard function r C such that for all s ∈ S we have r C ( s ) = 1 if s ∈ C , and 0 otherwise. Then for all plays ω we ha ve ω ∈ coB ¨ uc hi( C ) iff lim inf ( r C )( ω ) = 1. V alues and optim al strategies. Given a game gra ph G , qualitative ob jectives Φ ⊆ Ω for play er 1 a nd Ω \ Φ for play er 2, and measur able functions f and − f for player 1 and player 2, res pec tiv ely , we define the value functions h h 1 i i val and h h 2 i i val for the play er s 1 and 2, resp ectively , as the following functions from the state space S to the set I R o f rea ls: for a ll sta tes s ∈ S , let h h 1 i i G val ( Φ )( s ) = sup σ ∈ Σ inf π ∈ Π Pr σ,π s ( Φ ); h h 1 i i G val ( f )( s ) = sup σ ∈ Σ inf π ∈ Π E σ,π s [ f ]; h h 2 i i G val ( Ω \ Φ )( s ) = sup π ∈ Π inf σ ∈ Σ Pr σ,π s ( Ω \ Φ ); h h 2 i i G val ( − f )( s ) = sup π ∈ Π inf σ ∈ Σ E σ,π s [ − f ] . In other w ords, the v alues h h 1 i i G val ( Φ )( s ) and h h 1 i i G val ( f )( s ) give the maximal prob- ability and exp ectatio n with which player 1 can achiev e her ob jectiv es Φ and f from s tate s , and ana lo gously for play er 2. The str ategies that achiev e the v a lues are called optimal: a stra tegy σ for play er 1 is optimal from the state s for the ob jective Φ if h h 1 i i G val ( Φ )( s ) = inf π ∈ Π Pr σ,π s ( Φ ); and σ is optimal from the s tate s for f if h h 1 i i G val ( f )( s ) = inf π ∈ Π E σ,π s [ f ]. The o ptimal strateg ies for play er 2 a re de- fined analog ously . W e now state the classical determinacy results for 2 1 / 2 -play er games with limsup and liminf ob jectives. Theorem 1 (Quantitativ e determinacy). F or al l 2 1 / 2 -player game gr aphs G = (( S, E ) , ( S 1 , S 2 , S P ) , δ ) , the fol lowing assertions hold. 1. F or al l r ewar d functions r : S → I R and al l states s ∈ S , we have h h 1 i i G val (lim sup ( r ))( s ) + h h 2 i i G val (lim inf ( − r ))( s ) = 0; h h 1 i i G val (lim inf ( r ))( s ) + h h 2 i i G val (lim sup( − r ))( s ) = 0 . 2. Pur e memoryless optimal str ate gies exist for b oth players fr om al l states. The above r esults can b e derived from the results in [11]; a more direc t pr o of can b e obtained as follows: the e x istence of pure memoryless optimal strategies for MDPs with limsup and liminf ob jectives can be prov ed by extending the re- sults known for B ¨ uc hi and coB¨ uchi ob jectiv es. The results (Theorem 3.19) of [7] prov ed that if for a quantitativ e ob jectiv e f and its complement − f pure memo- ryless optimal strategies exist in MDPs, then pure memoryles s o ptimal strategie s also e xist in 2 1 / 2 -play er g ames. Hence the pure memoryless determinacy follows for 2 1 / 2 -play er games with limsup a nd liminf o b jectives. 3 The Complexit y of 2 1 / 2 -Pla yer Games wit h Limsup and LimInf Ob jectives In this section we study the complexit y o f MDPs and 2 1 / 2 -play er ga mes with limsup a nd liminf ob jectives. W e present p olynomial time algo rithms for MDPs and show that 2 1 / 2 -play er games can be decided in NP ∩ coNP . In the next subsections we pr e s ent po lynomial time algor ithms for MDPs with limsup and liminf ob jectives by reductions to a simple linear-pr ogramming formulation, and then sho w that 2 1 / 2 -play er games can b e decided in NP ∩ coNP . W e first present a remark a nd then present some basic results on MDPs . R emark 1. Giv en a 2 1 / 2 -play er game graph G with a reward function r : S → I R and a rea l constant c , consider the reward function ( r + c ) : S → I R defined a s follows: fo r s ∈ S w e ha ve ( r + c )( s ) = r ( s ) + c . Then the following asser tions hold: for all s ∈ S h h 1 i i G val (lim sup ( r + c ))( s ) = h h 1 i i G val (lim sup( r ))( s ) + c ; h h 1 i i G val (lim inf ( r + c ))( s ) = h h 1 i i G val (lim inf ( r ))( s ) + c. Hence we can shift a r eward function r by a re al consta n t c , a nd from the v alue function for the reward function ( r + c ), we ca n easily compute the v a lue function for r . Hence without los s o f generality for computational purp ose we ass ume that we hav e reward function with p ositive r ewards, i.e., r : S → I R + , wher e I R + is the set of p os itiv e reals. 3.1 Basic results o n MDPs In this section we recall sev eral basic prop erties on MDPs . W e start with the definition of end c omp onents in MDPs [5,4] that play a role equiv alent to closed recurrent sets in Marko v chains. End comp onent s. Given an MDP G = (( S, E ) , ( S 1 , S P ) , δ ), a set U ⊆ S o f states is an end c omp onent if U is δ -closed (i.e., for all s ∈ U ∩ S P we ha ve E ( s ) ⊆ U ) a nd the s ub-game gr aph of G restricted to U (denoted G ↾ U ) is strongly co nnected. W e denote by E ( G ) the set of end comp onents of an MDP G . The following lemma states that, giv en a ny strategy (memory less o r no t), with probability 1 the set of states visited infinitely often along a play is an end comp onent. This lemma allows us to der ive conclusions on the (infinite) set of plays in an MDP by analyzing the (finite) set o f end co mpone nts in the MDP . Lemma 1. [5,4] Given an MDP G , for al l states s ∈ S and al l str ate gies σ ∈ Σ , we have Pr σ s ( { ω | Inf ( ω ) ∈ E ( G ) } ) = 1 . F or a n end comp onent U ∈ E ( G ), consider the memoryless strategy σ U that at a s tate s in U ∩ S 1 plays all edges in E ( s ) ∩ U uniformly at ra ndom. Given the strategy σ U , the end compone nt U is a closed connected recurrent set in the Marko v chain obtained by fixing σ U . Lemma 2. Given an MDP G and an end c omp onent U ∈ E ( G ) , the str ate gy σ U ensur es that for al l states s ∈ U , we have Pr σ U s ( { ω | Inf ( ω ) = U } ) = 1 . Almost-sure winning states. Given an MDP G with a B¨ uc hi or a coB ¨ uc hi ob jective Φ for play er 1, w e denote by W G 1 ( Φ ) = { s ∈ S | h h 1 i i val ( Φ )( s ) = 1 } ; the se ts of states such that the v a lues for player 1 is 1. These sets of states a re also referred as the almost-sur e w inning states for the player and an optimal strategy fro m the almost-sure winning states is referr ed as an almost-sure win- ning strategy . The s et W G 1 ( Φ ), for B¨ uchi or co B ¨ uc hi ob jectives Φ , for a n MDP G can b e co mputed in O ( n 3 2 ) time, where n is the size of the MDP G [2]. A ttractor of probabilistic states. W e define a notion of attr actor of prob- abilistic states: given an MDP G and a set U ⊆ S o f states, we denote by Attr P ( U, G ) the set of states from where the probabilistic play er has a s trategy (with prop er choice of edge s) to force the g ame to reach U . The set Attr P ( U, G ) is inductively defined as follows: T 0 = U ; T i +1 = T i ∪ { s ∈ S P | E ( s ) ∩ T i 6 = ∅ } ∪ { s ∈ S 1 | E ( s ) ⊆ T i } and A ttr P ( U, G ) = S i ≥ 0 T i . W e no w present a lemma ab out MDPs with B ¨ uc hi and coB ¨ uc hi ob jectives and a pro p er t y of end comp onents a nd attractor s . The first tw o proper ties of Lemma 3 follows from Lemma 2. The last prop er t y follows from the fac t that an end comp onent is δ -closed (i.e., for an end comp onent U , for all s ∈ U ∩ S P we hav e E ( s ) ⊆ U ). Lemma 3. L et G b e an MDP. Giv en B ⊆ S and C ⊆ S , the fol lowing assertions hold. 1. F or al l U ∈ E ( G ) such that U ∩ B 6 = ∅ , we have U ⊆ W G 1 ( B¨ uchi ( B )) . 2. F or al l U ∈ E ( G ) such that U ⊆ C , we have U ⊆ W G 1 ( c oB¨ uchi ( C )) . 3. F or al l Y ⊆ S and al l end c omp onents U ∈ E ( G ) , if X = Attr P ( Y , G ) , then either (a) U ∩ Y 6 = ∅ or (b) U ∩ X = ∅ . 3.2 MDPs with limsup ob jectives In this subs e c tion we present p olynomial time a lgorithm for MDPs with limsup ob jectives. F or the sake of simplicity we will consider bipar tite MDPs. Bipartite MDPs. An MDP G = (( S, E ) , ( S 1 , S P ) , δ ) is bip artite if E ⊆ S 1 × S P ∪ S P × S 1 . An MDP G can b e co n verted into a bipar tite MDP G ′ by a dding dumm y states w ith an unique suc c essor, and G ′ is linear in the size of G . I n sequel without loss of genera lity we will consider bipartite MDPs. The k ey prop erty of bipartite MD Ps that will b e useful is as follows: for a bipartite MDP G = (( S, E ) , ( S 1 , S P ) , δ ), for all U ∈ E ( G ) we have U ∩ S 1 6 = ∅ . Informal descripti on of algorithm. W e first present a n algorithm that tak e s an MDP G with a p ositive re ward function r : S → I R + , and co mputes a s et S ∗ and a function f ∗ : S ∗ → I R + . The output of the algorithm will be us eful in reduction of MDPs with limsup ob jectiv es to MDPs with max ob jectives. Let the r ewards b e v 0 > v 1 > · · · > v k . The alg orithm pro c eeds in iteration and in iteration i w e denote the MDP as G i and the s ta te space as S i . At iter ation i the alg o rithm considers the se t V i of reward v i in the MDP G i , and c o mputes the set U i = W G i 1 (B¨ uchi( V i )), (i.e., the a lmost-sure winning set in the MDP G i for B¨ uc hi ob jective with the B¨ uc hi set V i ). F or all u ∈ U i ∩ S i we assign f ∗ ( u ) = v i and add the set U i ∩ S 1 to S ∗ . Then the s et A tt r P ( U i , G i ) is remov ed from the MDP G i and we pro ceed to iteration i + 1. In G i all end comp o nent s that in ter s ect with reward v i are co ntained in U i (b y Lemma 3 pa rt (1)), and all end comp onents in S i \ U i do not intersect with A ttr P ( U i , G i ) (by Lemma 3 part(3)). This gives us the following lemma. Lemma 4. L et G b e an MDP with a p ositive r ewar d function r : S → I R + . L et f ∗ b e the outpu t of Algo r ithm 1. F or al l end c omp onent s U ∈ E ( G ) and al l states u ∈ U ∩ S 1 , we have max( r ( U )) ≤ f ∗ ( u ) . Pr o of. Let U ∗ = S k i =0 U i (as computed in Algorithm 1). Then it follows from Lemma 3 that for all A ∈ E ( G ) we ha ve A ∩ U ∗ 6 = ∅ . Consider A ∈ E ( G ) and let v i = max( r ( A )). Suppose for some j < i we have A ∩ U j 6 = ∅ . Then ther e is a strategy to ensure that U j is reached with probability 1 fro m all states in A and then play an almo s t-sure winning stra teg y in U j to e nsure B¨ uchi ( r − 1 ( v j ) ∩ S j ). Then A ⊆ U j . Hence for all u ∈ A ∩ S 1 we ha ve f ∗ ( u ) = v j ≥ v i . If for all j < i we ha ve A ∩ U j = ∅ , then w e show that A ⊆ U i . The uniform memoryless strategy σ A (as used in Lemma 2) in G i is a witness to pr ov e that A ⊆ U i . In this ca se for all u ∈ A ∩ S 1 we ha ve f ∗ ( u ) = v i = ma x( r ( A )). The desired res ult follows. Algorithm 1 MDPLimSup Input: MDP G = (( S, E ) , ( S 1 , S P ) , δ ), a positive rew ard function r : S → I R + . Output: S ∗ ⊆ S and f ∗ : S ∗ → I R + 1. Let r ( S ) = { v 0 , v 1 , . . . , v k } with v 0 > v 1 > · · · > v k ; 2. G 0 := G ; S ∗ = ∅ ; 3. for i := 0 to k do { 3.1 U i := W G i 1 (B¨ uchi( r − 1 ( v i ) ∩ S i )); 3.2 for all u ∈ U i ∩ S 1 f ∗ ( u ) := v i ; 3.3 S ∗ := S ∗ ∪ ( U i ∩ S 1 ); 3.4 B i := Att r P ( U i , G i ); 3.5 G i +1 := G i \ B i , S i +1 := S i \ B i ; } 4. return S ∗ and f ∗ . T ransformation to MDPs wi th max ob jectiv e. Given an MDP G = (( S, E ) , ( S 1 , S P ) , δ ) with a p ositive rew a rd function r : S → I R + , and let S ∗ and f ∗ be the output of Algorithm 1. W e construct an MDP G = (( S , E ) , ( S 1 , S P ) , δ ) with a r eward function r as follows: – S = S ∪ b S ∗ ; i.e., the set of states consists of the state space S and a copy b S ∗ of S ∗ . – E = E ∪ { ( s, b s ) | s ∈ S ∗ , b s ∈ b S ∗ where b s is the co py of s } ∪ { ( b s, b s ) | b s ∈ b S ∗ } ; i.e., along with edges E , for all s tates s in S ∗ there is a n edg e to its copy b s in b S ∗ , and a ll s tates in b S ∗ are absorbing s tates. – S 1 = S 1 ∪ b S ∗ . – δ = δ . – r ( s ) = 0 for all s ∈ S and r ( s ) = f ∗ ( s ) for b s ∈ b S ∗ , where b s is the co py of s . W e refer to the ab ov e cons tr uction as limsup conv er sion. The following lemma prov es the r elationship b etw een the v alue function h h 1 i i G val (lim sup( r )) and h h 1 i i G val (max( r )). Lemma 5. L et G b e an MDP with a p ositive r ewar d function r : S → I R + . L et G and r b e obtaine d fr om G and r by t he limsup c onversion. F or al l states s ∈ S , we have h h 1 i i G val (lim sup ( r ))( s ) = h h 1 i i G val (max( r ))( s ) . Pr o of. The result is obtained from the following tw o case a nalysis. 1. Let σ be a pure memoryles s optimal strategy in G for the o b jective lim sup( r ). Let C = { C 1 , C 2 , . . . , C m } be the set of closed connected re- current sets in the Mar kov chain obtained from G after fixing the stra tegy σ . Note that s ince we c onsider bipa rtite MDPs, fo r all 1 ≤ i ≤ m , w e hav e C i ∩ S 1 6 = ∅ . Let C = S m i =1 C i . W e define a pure memor yless strategy σ in G as follows σ ( s ) = ( σ ( s ) s ∈ S 1 \ C ; b s b s ∈ b S ∗ and s ∈ S 1 ∩ C. By Lemma 4 it follows that the strategy σ ensures that for all C i ∈ C and all s ∈ C i , the maximal reward rea ched in G is at le a st max( r ( C i )) with probability 1. It follows that for all s ∈ S we have h h 1 i i G val (lim sup( r ))( s ) ≤ h h 1 i i G val (max( r ))( s ) . 2. Let σ b e a pure memoryless optimal str ategy for the o b jective ma x( r ) in G . W e fix a strategy σ in G as follo ws: if at a state s ∈ S ∗ the strategy σ c ho oses the edg e ( s, b s ), then in G on re a ching s , the strateg y σ plays an almost-sur e winning stra tegy for the ob jective B ¨ uc hi( r − 1 ( f ∗ ( s ))), otherwis e σ follows σ . It follows that for all s ∈ S we hav e h h 1 i i G val (lim sup( r ))( s ) ≥ h h 1 i i G val (max( r ))( s ) . Thu s w e hav e the des ired result. Linear programming for the max ob jectiv e in G . The following linear pro- gram c har acterizes the v alue function h h 1 i i G val (max( r )). F or all s ∈ S we hav e a v ariable x s and the ob jectiv e function is min P s ∈ S x s . The set of linear con- straints are as follows: x s ≥ 0 ∀ s ∈ S ; x s = r ( s ) ∀ s ∈ b S ∗ ; x s ≥ x t ∀ s ∈ S 1 , ( s, t ) ∈ E ; x s = P t ∈ S δ ( s )( t ) · x t ∀ s ∈ S P . The cor rectness pro o f o f the ab ov e linear pr ogram to characterize the v alue function h h 1 i i G val (max( r )) follows by extending the result for r eachabilit y ob jec- tives [6]. The k ey proper t y that can be used to prov e the co r rectness of the ab ov e claim is as follows: if a pur e memoryles s optimal strategy is fixed, then from all states in S , the set b S ∗ of abso rbing states is reached with probability 1. The ab ov e prop erty ca n b e proved as follows: since r is a p os itive reward function, it follows that for a ll s ∈ S w e hav e h h 1 i i G val (lim sup( r ))( s ) > 0. Moreover, for all states s ∈ S we hav e h h 1 i i G val (max( r ))( s ) = h h 1 i i G val (lim sup ( r ))( s ) > 0. Observe that for all s ∈ S we hav e r ( s ) = 0. Hence if we fix a pur e memoryless o ptimal strategy σ in G , then in the Markov chain G σ there is no close d recurrent se t C s uc h that C ⊆ S . It follows that fo r all states s ∈ S , in the Ma rko v G σ , the set b S ∗ is r e ached with probability 1. Using the above fact and the corr ectness of linear -progr amming for r eachabilit y ob jectives, the corre ctness pro of of the ab ov e linea r-prog ram for the ob jective max ( r ) in G can be obtained. This shows that the v a lue function h h 1 i i G val (lim sup( r )) for MDPs with reward function r c a n be c o mputed in poly no mial time. This gives us the following r esult. Theorem 2. Given a n MDP G with a r ewar d function r , the value function h h 1 i i G val (lim sup ( r )) c an b e c ompute d in p olynomial time. Algorithm 2 MDPLimInf Input: MDP G = (( S, E ) , ( S 1 , S P ) , δ ), a positive rew ard function r : S → I R + . Output: S ∗ ⊆ S and f ∗ : S ∗ → I R + . 1. Let r ( S ) = { v 0 , v 1 , . . . , v k } with v 0 > v 1 > · · · > v k ; 2. G 0 := G ; S ∗ = ∅ ; 3. for i := 0 to k do { 3.1 U i := W G i 1 (coB¨ u c hi( S j ≤ i r − 1 ( v j ) ∩ S i )); 3.2 for all u ∈ U i ∩ S 1 f ∗ ( u ) := v i ; 3.3 S ∗ := S ∗ ∪ ( U i ∩ S 1 ); 3.4 B i := Att r P ( U i , G i ); 3.5 G i +1 := G i \ B i , S i +1 := S i \ B i ; } 4. return S ∗ and f ∗ . 3.3 MDPs with liminf ob jectives In this subsection w e pr esent polyno mia l time algorithms fo r MDPs with liminf ob jectives, and then present the c omplexity result for 2 1 / 2 -play er games with limsup and liminf ob jectives. Informal descripti on of algorithm. W e first present a n algorithm that tak e s an MDP G with a p ositive re ward function r : S → I R + , and co mputes a s et S ∗ and a function f ∗ : S ∗ → I R + . The output of the algorithm will be useful in r eduction o f MDPs with liminf ob jectives to MDPs with max ob jectives. Let the r ewards b e v 0 > v 1 > · · · > v k . The algorithm pro ce eds in iteration and in iteration i w e denote the MDP as G i and the s tate space as S i . At iteration i the algorithm co nsiders the set V i of re ward at le a st v i in the MDP G i , and computes the set U i = W G i 1 (coB ¨ uc hi( V i )), (i.e., the almost-sure winning set in the MDP G i for co B ¨ uchi ob jective with the co B ¨ uc hi set V i ). F or all u ∈ U i ∩ S i we assig n f ∗ ( u ) = v i and add the set U i ∩ S 1 to S ∗ . Then the set Attr P ( U i , G i ) is remov ed from the MDP G i and we pr o ceed to iteration i + 1. In G i all end comp onents that contain rew ard at lea st v i are con tained in U i (b y Lemma 3 part (2)), and all end co mp onents in S i \ U i do not in ter sect with Attr P ( U i , G i ) (b y Lemma 3 pa rt(3)). This gives us the fo llowing lemma. Lemma 6. L et G b e an MDP with a p ositive r ewar d function r : S → I R + . L et f ∗ b e the output of Algorithm 2. F or al l end c omp onent s U ∈ E ( G ) and al l states u ∈ U ∩ S 1 , we have min( r ( U )) ≤ f ∗ ( u ) . Pr o of. Let U ∗ = S k i =0 U i (as computed in Algorithm 2). Then it follows from Lemma 3 that for a ll A ∈ E ( G ) we hav e A ∩ U ∗ 6 = ∅ . Consider A ∈ E ( G ) a nd let v i = min( r ( A )). Suppos e for so me j < i we ha ve A ∩ U j 6 = ∅ . Then there is a strategy to ensure that U j is reached with probability 1 fro m all states in A and then play an almost-sur e winning strateg y in U j to ensure coB ¨ uch i( S l ≤ j r − 1 ( v l ) ∩ S j ). Then A ⊆ U j . Hence for a ll u ∈ A ∩ S 1 we have f ∗ ( u ) = v j ≥ v i . If for all j < i we ha ve A ∩ U j = ∅ , then w e show that A ⊆ U i . The uniform memoryless strategy σ A (as used in Lemma 2) in G i is a witness to pr ov e that A ⊆ U i . In this case for all u ∈ A ∩ S 1 we hav e f ∗ ( u ) = v i = min( r ( A )). The desired result follows. T ransformation to MDPs wi th max ob jectiv e. Given an MDP G = (( S, E ) , ( S 1 , S P ) , δ ) with a p ositive reward function r : S → I R + , and let S ∗ and f ∗ be the output of Alg orithm 2. W e construct an MDP G = (( S , E ) , ( S 1 , S P ) , δ ) with a r eward function r as follows: – S = S ∪ b S ∗ ; i.e., the set of states consis ts o f the state spa c e S and a co py b S ∗ of S ∗ . – E = E ∪ { ( s, b s ) | s ∈ S ∗ , b s ∈ b S ∗ where b s is the cop y of s } ∪ { ( b s, b s ) | b s ∈ b S ∗ } ; along with edges E , for all states s in S ∗ there is an edge to its co p y b s in b S ∗ , and all states in b S ∗ are a bsorbing states. – S 1 = S 1 ∪ b S ∗ . – δ = δ . – r ( s ) = 0 for all s ∈ S and r ( s ) = f ∗ ( s ) for b s ∈ b S ∗ , where b s is the co py of s . W e re fer to the ab ove constr uction as liminf co nv ersio n. The follo wing lemma prov es the relationship betw een the v alue function h h 1 i i G val (lim inf ( r )) and h h 1 i i G val (max( r )). Lemma 7. L et G b e an MDP with a p ositive r ewar d function r : S → I R + . L et G and r b e obtaine d fr om G and r by the liminf c onversion. F or al l st ates s ∈ S , we have h h 1 i i G val (lim inf ( r ))( s ) = h h 1 i i G val (max( r ))( s ) . Pr o of. The result is obtained from the following tw o case a nalysis. 1. Let σ b e a pur e memoryles s optimal str ategy in G for the o b jective lim inf ( r ). Let C = { C 1 , C 2 , . . . , C m } b e the set of closed connected recurrent sets in the Markov chain obtained from G a fter fixing the strategy σ . Since G is an bipartite MDP , it follows that for all 1 ≤ i ≤ m , we have C i ∩ S 1 6 = ∅ . Let C = S m i =1 C i . W e define a pure memo ryless strategy σ in G as fo llows σ ( s ) = ( σ ( s ) s ∈ S 1 \ C ; b s b s ∈ b S ∗ and s ∈ S 1 ∩ C. By Lemma 6 it follows that the strategy σ ensures that for all C i ∈ C and all s ∈ C i , the maximal re ward rea ched in G is at least min( r ( C i )) with probability 1. It follows that for all s ∈ S we have h h 1 i i G val (lim sup( r ))( s ) ≤ h h 1 i i G val (max( r ))( s ) . 2. Let σ be a pure memor yless optimal stra teg y for the ob jective max( r ) in G . W e fix a str a tegy σ in G as follo ws : if at a state s ∈ S ∗ the strategy σ choos es the edg e ( s, b s ), then in G on re a ching s , the strateg y σ plays an almost-sur e winning strategy for the ob jectiv e co B ¨ uchi ( S v j ≥ f ∗ ( s ) r − 1 ( v j )), o therwise σ follows σ . It follows that for all s ∈ S we hav e h h 1 i i G val (lim inf ( r ))( s ) ≥ h h 1 i i G val (max( r ))( s ) . Thu s w e hav e the des ired result. Linear programming for the max ob jectiv e in G . The linear pro gram of subsection 3.2 characterizes the v alue function h h 1 i i G val (max( r )). This shows that the v alue function h h 1 i i G val (lim inf ( r )) for MDPs with reward function r can b e computed in po lynomial time. This g ives us the following result. Theorem 3. Given a n MDP G with a r ewar d function r , the value function h h 1 i i G val (lim inf ( r )) c an b e c ompute d in p olynomial time. 3.4 2 1 / 2 -pla yer games with limsup and liminf ob jectives W e now s how that 2 1 / 2 -play er games with limsup and liminf ob jectives can b e decided in NP ∩ coNP . The pure memor yless optimal strategie s (existence follows from Theo rem 1) provide the po lynomial witnesses and to obtain the desired result we need to pr esent a p olynomial time verification pro cedure. In o ther words, we need to present p oly no mial time alg orithms for MDPs with limsup and liminf o b jectives. Since the v alue functions in MDPs with limsup and liminf ob jectives can b e computed in p olyno mia l time (Theorem 2 and Theor em 3 ), w e obtain the following result ab out the complexity 2 1 / 2 -play er games with limsup and liminf ob jectives. Theorem 4. Given a 2 1 / 2 -player game gr aph G with a r ewar d function r , a s tate s and a r ational value q , the fol lowing assertions hold: (a) whether h h 1 i i G val (lim sup ( r ))( s ) ≥ q c an b e de cide d in NP ∩ c oNP; and (b) whether h h 1 i i G val (lim inf ( r ))( s ) ≥ q c an b e de cide d in NP ∩ c oNP. Ac knowledgmen ts. W e thank Hugo Gim b ert for explaining his results and p ointing out releva nt literature on games with limsup and liminf ob jectives. This researc h was supp orted in part b y the NSF gran t s CCR-0132780, CNS- 0720884, and CC R-0225610, by the S wiss National Science F oundation, and by the COMBEST pro ject of the Eu - ropean Union. References 1. K. Chatterjee and T.A. H enzinger. V alue iteration. In 25 Y e ars of Mo del Che cking , LNCS. Springer, 2007. 2. K. Chatterjee, M. Jurdzi ´ nski, and T.A. H enzinger. Simple stochastic parit y games. In CSL’03 , volume 2803 of LNCS , pages 100–113. Springer, 2003. 3. K. Chatterjee, M. Jurdzi ´ nski, and T.A. H enzinger. Quantitativ e sto c hastic parity games. In SODA’04 , pages 121–130. SIAM, 2004. 4. C. Courcoubetis and M. Y annak akis. Marko v decision pro cesses and regular ev ents. In I CALP 90: Automata, L anguages, and Pr o gr amming , v olume 4 43 of L e ctur e Notes i n Computer Scienc e , pages 336–349. Springer-V erlag, 1990. 5. L. de A lfaro. F ormal V erific ation of Pr ob abilistic Syste ms . PhD thesis, S tanford Universit y , 199 7. 6. J. Filar and K. V rieze. Comp etitive Markov De cision Pr o c esses . Sp ringer-V erlag, 1997. 7. H. Gim b ert. Jeux p ositionnels . PhD thesis, Universit ´ e P aris 7, 2006. 8. H. Gim b ert and W. Zielonk a. Games where you can pla y optimally without any memory . In CO N C UR’05 , pages 428–442. Springer, 2005. 9. H. Gim b ert and W. Zielonk a. Perfect information sto chas tic priorit y games. I n ICALP’07 , pages 850–861. Sp ringer, 2007. 10. T. A . Liggett and S . A. Lippman. S tochastic games with p erfect information and time av erage pay off. Siam R eview , 11:604–6 07, 1969. 11. A . Maitra and W . Sudderth, editors. Discr ete Gambling and Sto chastic Games . Springer, 1996. 12. D .A. Martin. The d eterminacy of Blackw ell games. T he Journal of Symb olic L o gic , 63(4):1565 –1581, 1998.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment