Mean-Variance Optimization in Markov Decision Processes

We consider finite horizon Markov decision processes under performance measures that involve both the mean and the variance of the cumulative reward. We show that either randomized or history-based policies can improve performance. We prove that the …

Authors: Shie Mannor, John Tsitsiklis

1 Mean-V ariance Optimizati on in Mark o v Decision Processes Shie Mannor (Corresponding Author) Department of Electrical and Engineering, T echnion, Haifa, ISRAEL 32000, tel ++972-4-8293284, fax ++972-4-8295757 John N. Tsitsiklis Laboratory for Information and Decision Systems, Massachusetts Institute of T echnolog y , Cambridge, MA, 02139 Abstract W e consider finite horizon Markov decision proce sses und er pe rforman ce measures that inv olve both the mean and the variance of the cumulative re ward. W e show that eith er ran domized or histor y-based p olicies can impr ove perfor mance. W e p rove that the complexity of compu ting a policy that maximizes th e mea n reward unde r a variance constraint is NP-hard for s ome cases, and strongly NP-hard for others. W e finally offer pseudo polyn omial exact and app roximatio n algorith ms. keyw ords : M arkov processes; dynamic p rogramming; control ; complexity theory . I . I N T RO D U C T I O N The class ical t heory of Markov decisio n processes (MDPs) deals with th e maxi mization of the cumu- lativ e (possib ly discou nted) expected rew ard, to be denoted by W . Howev er , a risk-ave rse d ecision maker may be interested in additional distributional p roperties of W . In this paper , we focus on the case wh ere the d ecision maker i s interested in both th e mean and t he variance o f the cumul ativ e reward, and we explore the ass ociated comput ational issues. Risk a version in MDPs is of course an old subject. In one approach, t he focus is on t he maximization of E [ U ( W )] , where U is a concav e utility function. Problems of this type can be hand led by state augmentation (e.g., Bertsekas, 1995), namely , by int roducing an auxiliary stat e va riable that keeps t rack of the cumulati ve past reward. In a fe w special cases, e.g., with an exponential ut ility function, stat e augmentation is unn ecessary , and optimal policies can be found by sol ving a modified Bellman equatio n (Chung & Sobel, 1987). Anot her interesting case where optimal poli cies can be fou nd efficiently in v olves piece wise lin ear util ity functions with a single break point ; see Li u and K oenig (2005). In a nother approach, the objectiv e is t o optimize a so-called coherent risk measure (Artzner , Delb aen, Eber , & Heat h 1999), which turns out to be equiv alent to a robust optimization problem: one assumes a family of probabilisti c models and optim izes the worst-case performance over thi s family . In the multi stage case 2 (Riedel, 2004), problems of this t ype can be difficult (Le T allec, 2007), except for some special cases (Iyengar, 2005; Nilim & El Ghaoui, 20 05) that can be reduced to Markov games (Shapley, 195 3). Mean-var iance o ptimization lacks some of the desi rable prop erties of approaches in v olving coherent risk measures and sometim es leads t o cou nterintuitive policies. Bellman’ s principle of optimali ty does not hol d, and as a consequence, a decision maker who has received unexpectedly large re wards in th e first stages, may actively seek to incur losses in subsequent stages in order to keep the v ariance small. Ne vertheless, mean-variance optim ization is an important approach in financial decision making (e.g., Luenberger , 1997), especially for stati c (on e-stage) problems. Consi der , for example, a fund manager who is interested in t he 1-year p erformance of the fund , as measured by the mean and variance of the return. Assuming t hat the manager is allowed to und ertake perio dic re-balancing actions in the course of the y ear , one obtains a Markov decision process with m ean-va riance criteria. Mean-var iance optimization can als o be a meaningful objective in various engineering contexts. Cons ider , for example, an engineering process whereby a certain material is deposited on a surface. Suppose t hat the primary objectiv e is to maxim ize the amou nt d eposited, but that th ere is also an int erest in having all manufactured compo nents be simil ar to each ot her; t his secondary obj ectiv e can be address ed by keeping the variance of the amoun t deposited small. W e n ote that e xpressions for the v ariance of the di scounted re ward for stationary policies were de ve loped in Sobel (1982). Ho wev er , these expressions are quadratic in the u nderlying transit ion probabilities, and do not lead to con vex opt imization problems. Motiv ated by considerations such as the above, this paper deals with the compu tational com plexity aspects of mean-variance optim ization. The problem is not straight forward for v arious reasons. One is t he absence of a principle of optimalit y that could lead to simp le recursi ve algorithms. An other reason i s that, as is evident from the formula V ar( W ) = E [ W 2 ] − ( E [ W ]) 2 , t he variance i s not a linear functi on o f th e probability m easure of the underlying process. Nev ertheless, E [ W 2 ] and E [ W ] are li near functions, and as such can be addressed s imultaneousl y u sing metho ds from multicriteria or constrained Markov decision processes (Altman, 1999). Indeed, we will use such an approach in order to dev elop p seudopolynom ial exact or approximation algorith ms. On the ot her hand, we wi ll als o obt ain various NP-hardness results, which show that t here is litt le hope for signi ficant improvement of our algorithms. The rest of t he p aper is o r ganized as follows. In Section II, we describe the model and our notation. W e 3 also define various classes of policies and performance objectives of interest. In Section III, we compare diffe rent poli cy classes and show that performance t ypically improves strictl y as more general policies are allowed. In Section IV, we establish NP-hardness r esults for the policy cl asses we have introduced. Then, in Sections V and VI, we de velop exact and approximate pseudopo lynomial time algorit hms. Unfortunately , such algorithms do not seem possi ble for some of the more restricted classes of policies, d ue to strong NP-completeness results established in Section IV. Finally , Section VII contains some brief concluding remarks. I I . T H E M O D E L In this section, we d efine the mod el, notati on, and performance objectives t hat we will be studying. Throughout, we focus on finite horizon prob lems. 1 A. Markov Decisi o n Pr ocesses W e consi der a Markov decision process (MDP) with finit e state, action, and re ward spaces. An MDP is formally defined by a sextuple M = ( T , S , A , R , p, g ) where: (a) T , a p ositive integer , is the time hori zon; (b) S is a finite collection of states, one of which is designat ed as the i nitial state; (c) A is a coll ection of finite sets of possibl e actions, one set for each state; (d) R is a finite subs et of Q (the set of rational num bers), and is th e set of possib le values of the immediate re wards. W e let K = max r ∈R | r | . (e) p : { 0 , . . . , T − 1 } × S × S × A → Q describes the transition probabi lities. In particular , p t ( s ′ | s, a ) is the probability that the st ate at t ime t + 1 is s ′ , given that the state at time t is s , and that action a is chosen at ti me t . (d) g : { 0 , . . . , T − 1 } × R × S × A → Q is a set of rew ard dis tributions. In particular , g t ( r | s, a ) is the probabili ty that the immediate rew ard at tim e t is r , g iv en that the state and action at time t i s s and a , respectiv ely . W i th few exceptions (e.g., for the tim e ho rizon T ), we use capital letters t o denote random va riables, and lower case letters t o denote ordinary variables. The process starts at the designated initial state. At every stage t = 0 , 1 , . . . , T − 1 , t he decision maker o bserves the current state S t and chooses an action A t . 1 Some of the results such as t he approximation algorithms of Section VI can be extended to t he infinite horizon discounted case; this is beyo nd the scope of this paper . 4 Then, an immediate rewa rd R t is obtain ed, distributed according to g t ( · | S t , A t ) , and t he next state S t +1 is chosen, according to p t ( · | S t , A t ) . Note that we have assumed th at the possi ble values of the immediate re ward and the various probabil ities are all rati onal numbers. This i s in order to address the computational complexity o f various prob lems within the standard framew ork of digit al computation. Finally , we will use t he notation x 0: t to indi cate the tuple ( x 0 , . . . , x t ) . B. P olicies W e will use the symb ol π to d enote policies. Under a determinist ic policy π = ( µ 0 , . . . , µ T − 1 ) , the action at each time t is determined according to a mapping µ t whose ar gument is the history H t = ( S 0: t , A 0: t − 1 , R 0: t − 1 ) o f th e process, by letting A t = µ t ( H t ) . W e let Π h be t he set of all such hist ory-based policies. (The subscript s are used as a mnemonic for the variables on which the action is allowed to depend.) W e wi ll also cons ider randomized policies. For this purp ose, we assum e that there is a vailable a s equence o f i.i.d. uniform random var iables U 0 , U 1 , . . . , U T − 1 , which are independent from e verything else. In a random ized policy , the action at time t is determined b y lettin g A t = µ t ( H t , U 0: t ) . Let Π h,u be the s et of all randomized poli cies. In classical MD Ps, it is well known that restricting to Markovian policies (pol icies that take into account only the current state S t ) results i n no loss of performance. In ou r setting, there are two different possi ble “states” of interest: t he orig inal state S t , or the aug mented state ( S t , W t ) , where W t = t − 1 X k =0 R k , (with the conv ention that W 0 = 0 ). Accordingl y , we define the following classes of p olicies: Π t,s (under which A t = µ t ( S t ) ), and Π t,s,w (under which A t = µ t ( S t , W t ) ), and their randomi zed count erparts Π t,s,u (under whi ch A t = µ t ( S t , U t ) ), and Π t,s,w ,u (under whi ch A t = µ t ( S t , W t , U t ) . No tice that Π t,s ⊂ Π t,s,w ⊂ Π h , and s imilarly for their randomi zed count erparts. C. P erformance Crit eri a Once a policy π and an ini tial state s is fixed, the cumulative rew ard W T becomes a well-defined random var iable. The performance m easures of in terest are its mean and v ariance, defined by J π = E π [ W T ] and V π = V ar π ( W T ) , respectively . Under our assum ptions (finite horizon, and bounded rew ards), it foll ows 5 that there are finite upper boundsof K T and K 2 T 2 , for | J π | and V π , respective ly , independent of the policy . Giv en our interest in com plexity results, we wi ll focus on “decisi on” p roblems th at adm it a yes/ no answer , except for Section VI. W e define the following problem. Pr oblem M V - M D P ( Π ): Giv en an M DP M and rational numbers λ , v , does there exist a policy in th e set Π such that J π ≥ λ and V π ≤ v ? Clearly , an algorithm for the problem M V - M D P ( Π ) can be combined with binary search t o sol ve (up to any desired precision) the problem of maximizing the expected va lue of W T subject t o an upper bound on its variance, or th e problem of minim izing the variance of W T subject to a lower bound on its m ean. I I I . C O M P A R I S O N O F P O L I C Y C L A S S E S Our first step i s to compare the p erformance obtained from diff erent p olicy classes. W e introdu ce som e terminology . Let Π and Π ′ be two poli cy classes. W e say that Π is inferior t o Π ′ if, loo sely speaking, the policy class Π ′ can always m atch or excee d the “performance” of poli cy class Π , and for some ins tances it can exceed i t strictly . Formally , Π is in ferior to Π ′ if the following hold : (i) i f ( M , c, d ) i s a “yes” instance of M V - M D P ( Π ), then it is also a “yes” instance of M V - M D P ( Π ′ ); (ii ) there exists some ( M , c, d ) which i s a “no ” instance of M V - M D P ( Π ) but a “yes” ins tance of M V - M D P ( Π ′ ). Similarly , we say that two policy classes Π and Π ′ are equivalent if every “yes” (respectively , “no”) instance of M V - M D P ( Π ) is a “yes” (respectively , “no”) instance of M V - M D P ( Π ′ ). W e define one more con venient term. A st ate s is said to be termina l if it i s absorbi ng (i.e., p t ( s | s, a ) = 1 , for eve ry t and a ) and provides zero rew ards (i.e., g t (0 | s, a ) = 1 , for ev ery t and a ). A. Randomizat ion Imp ro ves P erforman ce Our first observation is that randomization can improve performance. This is not surpris ing given that we are dealing si multaneously with two criteria, and that random ization is helpful in cons trained MDPs (e.g., Al tman, 1999). Theor em 1. (a) Π t,s is inferi or to Π t,s,u ; (b) Π t,s,w is inferi or to Π t,s,w ,u ; (c) Π h is inferior to Π h,u . 6 Pr oof. It is clear t hat performance cannot deteriorate when random ization is allowed. It therefore suffices to disp lay an instance in which randomization improves performance. Consider a one-stage MDP ( T = 1 ). At time 0 , we are at the initial state and there are two a vailable actions, a and b . The mean and variance of the resulting rew ard are bot h zero under action a , and both equal to 1 under action b . After the decision i s made, the rew ards are obtained and the p rocess terminates. Thus W T = R 0 , t he rewar d obtain ed at time 0. Consider the probl em o f maximizin g E [ R 0 ] sub ject t o the const raint t hat V ar( R 0 ) ≤ 1 / 2 . There is only one feasible deterministic policy (choose action a ), and it has zero expected re ward. On the oth er hand, a randomized p olicy that chooses action b with probability p has an expected reward of p and the corresponding variance satisfies V ar( R 0 ) ≤ E [ R 2 0 ] = p E [ R 2 0 | A 0 = b ] = 2 p. When 0 < p ≤ 1 / 4 , such a randomized poli cy is feasible and improves upon th e determi nistic one. Note that for the above instance we have Π t,s = Π t,s,w = Π h , and Π t,s,u = Π t,s,w ,u = Π h,u . Hence the above example establ ishes all three of the claim ed s tatements. q.e.d. B. Information Impr oves P erforman ce W e now s how that in m ost cases, performance can improve strictl y when we allow a po licy to have access t o more information. Th e only exception arises for the pair of classes Π t,s,w ,u and Π h,u , which we show i n Section V to be equiv alent (cf. Theorem 6). Theor em 2. (a) Π t,s is inferi or to Π t,s,w , an d Π t,s,u is inferi or to Π t,s,w ,u . (b) Π t,s,w is inferi or to Π h . Pr oof. (a) Consi der the following MDP , wi th time h orizon T = 2 . The process s tarts at the ini tial state s 0 , at which there are two actions . Under action a 1 , the i mmediate re ward is zero and the process mov es to a termi nal state. Under actio n a 2 , t he imm ediate rew ard R 0 is either 0 or 1, wi th equal probabili ty , and the process moves to state s 1 . At state s 1 , there are two actions, a 3 and a 4 : under action a 3 , the immediate rew ard R 1 is equal to 0, and under action a 4 , it is equal to 1. W e are int erested in the optimal va lue of the expected rew ard E [ W 2 ] = E [ R 0 + R 1 ] , subject to the constraint that the variance 7 is less t han o r equal to zero (and therefore equal to zero). Let p be the probabil ity that action a 2 is chosen at state s 0 . If p > 0 , and under any po licy in Π t,s,u , the re ward R 0 at state s 0 has positiv e var iance, and the rew ard R 1 at the next st age is uncorrelated with R 0 . Hence, the variance of R 0 + R 1 is positive, and su ch a policy is not feasibl e; in particular , the const raint on the variance requires that p = 0 . W e conclude t hat the l ar gest possible expected rew ard under any policy i n Π t,s,u (and, a fortiori, und er any poli cy in Π t,s ) is equal to zero. Consider now the following policy , whi ch belongs to Π t,s,w and, a fortiori, to Π t,s,w ,u : at state s 0 , choose action a 2 ; then, at state s 1 , choose a 3 if W 1 = R 0 = 1 , and choose a 4 if W 1 = R 0 = 0 . In either case, the total re war d is R 0 + R 1 = 1 , while the variance of R 0 + R 1 is zero, thus ensuring feasibility . This est ablishes th e first part of the theorem. (b) Cons ider the following MDP , with time h orizon T = 3 . At state s 0 there is only one av ailable action; the next state S 1 is either s 1 or s ′ 1 , with probabili ty p and 1 − p , respectively , and the im mediate rew ard R 0 is zero. At either state s 1 or s ′ 1 , there is again only on e ava ilable action; the next state, S 2 , is s 2 , and the rewa rd R 1 is zero. At state s 2 , there are t wo actions, a and b . Under action a , the mean and var iance of the resulting re ward R 2 are b oth zero, and under action b , they are both equal to 1. Let us examine the largest poss ible value of E [ W 3 ] = E [ R 2 ] , subject to the con straint V ar( W 2 ) ≤ 1 / 2 . The class Π t,s,w contains two po licies, corresponding t o the two determin istic choices of an action at state s 2 ; onl y one of them is feasible (the on e that chooses action a ), resul ting in zero expected re ward. Howe ver , the foll owing policy in Π h has positive expected rewa rd: choos e action b at s tate s 2 if and only if th e st ate at time 1 was equal to s 1 (which happens with probabilit y p ). As long as p is sufficiently small, the constraint V ar( W ) ≤ 1 / 2 is met, and this poli cy is feasible. It foll ows that Π t,s,w is inferior to Π h . q.e.d. I V . C O M P L E X I T Y R E S U L T S In this section, we establish t hat mean-variance optim ization in finit e ho rizon MDPs is unlikely to admit polynomial t ime algorit hms, in cont rast to classical M DPs. Theor em 3. The pr oblem M V - M D P ( Π ) is NP-har d, when Π is Π t,s,w , Π t,s,w ,u , Π h , or Π h,u . Pr oof: W e will actually show NP-hardness for t he special case of M V - M D P ( Π ), in which we wish to determine wh ether there exists a policy wh ose rewar d var iance is equal to zero. (In terms of t he problem 8 definition, this corresponds to letting λ = − K T and v = 0 .) The proof uses a reduction from the S U B S E T S U M problem: Give n n positive integers, does there exist a subs et B of { 1 , . . . , n } such that P i ∈ B r i = P i / ∈ B r i ? Giv en a n instance ( r 1 , . . . , r n ) o f S U B S E T S U M , and for any of the policy c lasses of interest, we construct an instance of M V - M D P ( Π ), with ti me hori zon T = n + 1 , as follows. At the initial s tate s 0 , there i s only one av ailable action, result ing i n zero i mmediate rew ard ( R 0 = 0 ). W ith probabi lity 1/2, t he process moves to a terminal state; wi th prob ability 1/2, the process moves (determi nistically) alo ng a sequence of st ates s 1 , . . . , s n . At each state s i ( i = 1 , . . . , n ), there are two actions: a i , wh ich results in an immediate re ward of r i , and b i , whi ch results in an im mediate rew ard of − r i . Suppose t hat there exists a set B ⊂ { 1 , . . . , n } such that P i ∈ B r i = P i / ∈ B r i . Consi der the policy t hat chooses actio n a i at state s i if and only if i ∈ B . This policy achiev es zero total re ward, with probabili ty 1, and therefore meets the zero variance constraint. Con versely , if a policy results in zero v ariance, then the tot al rewa rd must be equal to zero, wit h probabi lity 1, which implies that such a set B exists. This completes th e reduction. Note that this argument applies no matter which particular cl ass of policies is being considered. q.e.d. The above proof als o applies to the policy classes Π t,s and Π t,s,u . Howe ver , for these two classes, a stronger result is p ossible. Recall that a prob lem is str ongly NP-har d , if i t remains NP-hard when restricted t o i nstances in which the numerical part o f the instance descrip tion inv olves “small” numbers; see Garey and Johns on (1979) for a precise d efinition. Theor em 4. If Π i s either Π t,s or Π t,s,u , t h e pr oblem M V - M D P ( Π ) is str ongly NP-har d . Pr oof. As in the proof of Theorem 3, we will prove the result for the special case of M V - M D P , in which we wish to determine whether t here e xists a policy under which the variance of the reward is equal to zero. The proof inv olves a reduction from t he 3-Satisfiability problem ( 3 S A T ). An instance of 3 S A T consists of n Boolean variables x 1 , . . . , x n , and m clauses C 1 , . . . , C m , with three l iterals per clause. Each clause is the disj unction of t hree literals, where a literal is eit her a v ariable or its negation. (For example, x 2 ∨ x 4 ∨ x 5 is such a clause, where a bar stands for negation.) The question is whether there exists an assignment of truth v alues (“true” or “false”) to the variables such that all claus es are satisfied. Suppose that we are given an instance of 3 S A T , with n variables and m clauses, C 1 , . . . , C m . W e 9 construct an instance of M V - M D P ( Π ) as follows. Th ere is an initial state s 0 , a state d 0 , a state c j associated with each clause C j , and a state y i associated with each literal x i . The actions, dynamics, and re wards are as follows: (a) Out of state s 0 , t here is equal p robability , 1 / ( m + 1) , of reaching any on e of t he s tates d 0 , c 1 , . . . , c m , independent of the acti on; the imm ediate re ward is zero. (b) State d 0 is a terminal state. At each state c j , there are three actions a vailable: each action selects one of the th ree li terals in the clause, and the p rocess moves t o the s tate y i associated with t hat li teral; the immedi ate reward i s 1 if the literal appears in the clause unnegated, and − 1 if the literal appears in the claus e negated. For an e xample, suppose t hat the clause i s of the form x 2 ∨ x 4 ∨ x 5 . Under the first action, the next state is y 2 , and the re ward is 1; under the second action, the next state is y 4 and the rew ard is − 1 ; und er the third action, the next state i s y 5 , and the re ward is 1. (c) At each state y i , there are two possible actions a i and b i , resulting in immediate rewa rds of 1 and − 1 , respectively . The process t hen moves to th e terminal state d 0 . Suppose that we h a ve a “yes” i nstance of 3 S A T , and consi der a t ruth assignm ent that s atisfies all claus es. W e can then construct a pol icy in Π t,s (and a fortiori in Π t,s,u , whose t otal rewa rd is zero (and th erefore has zero variance) as follows. If x i is set t o be true (respectively , false), we choo se actio n b i (respectiv ely , a i ) at s tate y i . At state c j we choos e an action associated wit h a literal that makes the claus e to be true. Suppose that state c j is visited after the first t ransition, i.e., S 1 = c j . If the literal associated with t he selected action at c j is unnegated, e.g., the literal x i , then the im mediate re ward is 1. Since this literal makes the clause to be true, it follows t hat the action chosen at th e subsequ ent state, y i , is b i , resulti ng in a rew ard of − 1 , and a total rew ard of zero. The ar gument for the case wh ere the literal ass ociated with the selected action at state c j is negated is sim ilar . It follows that the t otal rewar d is zero, with probabil ity 1. For the con verse direction, su ppose t hat th ere exists a pol icy in Π t,s , or more generally , in Π t,s,u under which the va riance of the total rewar d is zero. Since th e total rew ard is equal to 0 whenev er the first transition leads t o state d 0 (which happens wi th probability 1 / ( m + 1) , it follows th at the total rew ard must be always zero. Consider now the following trut h assignment: x i is set to be true i f and only if the policy chooses actio n b i at st ate y i , wi th posi tiv e probabil ity . Suppose th at t he state visited after the first transition is c j . Supp ose that the action chosen at state c j leads next to stat e y i and that the literal x i 10 appears unn egated i n clause C j . Then, th e rewar d at state c j is 1, w hich implies that the rew ard at state y i is − 1 . It follows t hat th e action chosen at y i is b i , and therefore x i has been set to be true. It follows that clause C j is satisfied. A similar argument sh ows that clause C j is satis fied wh en the li teral x i associated with the chosen acti on at c j appears negated. In either case, we conclude that clause C j is satisfied. Since e very stat e c j is possible at time 1, it follows that ev ery clause i s satisfied, and we ha ve a “yes” inst ance of 3 S A T . q.e.d. V . E X AC T A L G O R I T H M S The comparison and complexity result s of the preceding two sectio ns in dicate that the policy classes Π t,s , Π t,s,w , Π t,s,u , and Π h are inferior to th e class Π h,u , and furthermore some of th em ( Π t,s , Π t,s,w ) appear to have higher com plexity . Thus, there is no reason to consider them further . While the problem M V - M D P ( Π h,u ) is NP-har d, there is still a possibilit y for approximate or pseudopolynom ial ti me algorithms. In th is section , we focus on exact pseudopolynom ial t ime algorit hms. Our approach in v olves an augmented state, defined by X t = ( S t , W t ) . Let X be the set o f all pos sible values of t he augment ed state. Let |S | be the cardinality of t he set S . Let |R| b e t he cardinalit y of the set R . Recall also th at K = max r ∈R | r | . If we assume that the immediate re wards are int egers, then W t is an integer between − K T and K T . In th is case, the cardinality |X | of th e augmented state space X is bounded by |S | · (2 K T + 1) , which is poly nomial. W i thout the integrality assumpt ion, the cardinalit y of the set X remains finite, but it can increase exponentially with T . For t his reason, we study the integer case separately in Section V -B . A. State-Action F r equencies In this section, we pro vide some results o n the representation of MDPs in t erms of a state-action frequency polytope, thu s settin g the st age for our subsequent algorith ms. For any policy π ∈ Π h,u , and any x ∈ X , a ∈ A , we define the state-action frequencies at time t by z π t ( x, a ) = P π ( X t = x, A t = a ) , t = 0 , 1 , . . . , T − 1 , and z π t ( x ) = P π ( X t = x ) , t = 0 , 1 , . . . , T . Let z π be a vector that lists all of the above defined state-action frequencies. 11 For any family Π of policies, let Z (Π) = { z π | π ∈ Π } . The following result is well known (e.g., Altman, 1999 ). It asserts t hat any feasib le state-action frequency vector can be attained by policies that depend only on tim e, t he (augmented) state, and a randomization variable. Furthermore, the set of feasible state-action frequency vectors i s a polyhedron, h ence amenable to l inear programm ing methods . Theor em 5. (a) W e have Z (Π h,u ) = Z (Π t,s,w ,u ) . (b) The set Z (Π h,u ) i s a polyhedr on, s pecified by O ( T · |X | · |A| ) linear cons t raints. Note that a certain m ean-va riance pair ( λ, v ) is attainable by a p olicy in Π h,u if and only if th ere exists some z ∈ Z (Π h,u ) t hat satisfies X ( s,w ) ∈X w z T ( s, w ) = λ , (1) X ( s,w ) ∈X w 2 z T ( s, w ) = v + λ 2 . (2) Furthermore, sin ce Z (Π h,u ) = Z (Π t,s,w ,u ) , it fol lows th at if a pair ( λ, v ) is attainable by a policy in Π h,u , it is also attainable by a policy in Π t,s,w ,u . Th is establi shes the following result . Theor em 6. The policy classes Π h,u and Π t,s,w ,u ar e equivalent. Note that checkin g the feasibility of the conditions z ∈ Z (Π h,u ) , (1), and (2) am ounts to solvin g a l inear programming problem, wit h a nu mber of constraint s proportional to the cardinal ity of the augmented state space X and, therefore, in general, exponential in T . B. Inte g er Rewar ds In this section, we ass ume that the immedi ate re wards are integers, with absolute value bounded by K , and we show that pseud opolynomi al time algorit hms are possi ble. Recall that an algorithm is a pseudopolynom ial time al gorithm if its running time is pol ynomial i n K and the instance size. (This is in cont rast to polyno mial tim e algorith ms in which the runni ng ti me can only gro w as a p olynomial of log K .) Theor em 7. Suppose t hat the immediate r ewar ds ar e inte g ers, with absolute value bou n d ed by K . Consider the f ollowing two pr oblems: (i) determine whether ther e exists a poli cy in Π h,u for which ( J π , V π ) = ( λ, v ) , wher e λ an d v ar e given rational numbers; and, 12 (ii) determine whether ther e exists a policy in Π h,u for which J π = λ and V π ≤ v , wher e λ and v ar e given rational numb ers. Then, (a) th ese two problems admi t a pseudopolynomia l time algori t hm; and, (b) u nless P=NP , these problems cannot be s olved i n polynomi al time. Pr oof. (a) As already dis cussed, these prob lems amount to solving a linear program. In the int eger case, the number of var iables and constraint s is bou nded by a poly nomial in K and the inst ance si ze. The result follows because li near programm ing can b e sol ved in polyn omial time. (b) T his is proved b y cons idering th e special case where λ = v = 0 and the exact s ame argument as in the proof of Theorem 3. q.e.d. Similar to con strained MDPs, m ean-va riance optimizatio n in v olves two different performance criteria. Unfortunately , h owe v er , the li near programming approach to con strained MDPs does not translat e into an algorithm for the problem M V - M D P ( Π h,u ). The reason is that the set P M V = { ( J π , V π ) | π ∈ Π h,u } of achieva ble mean-variance pairs need not be con vex. T o bring th e constrained MDP methodo logy to bear on our problem, instead of focusing on th e pair ( J π , V π ) , we define Q π = E π [ W 2 T ] , and focus on t he pair ( J π , Q π ) . This is now a pair of objectiv es that depend l inearly on the state frequencies associated with the final augmented st ate X T . Accordin gly , we d efine P M Q = { ( J π , Q π ) | π ∈ Π h,u } . Note that P M Q is a pol yhedron, because it is the image o f the polyhedron Z (Π h,u ) under the lin ear mapping specified by the left-hand sid es of Eq s. (1)-(2). In contrast, P M V is the image of P M Q under a nonlinear mappi ng: P M V = { ( λ, q − λ 2 ) | ( λ, q ) ∈ P M Q } , and i s not, in general, a polyhedron. As a coroll ary of the above discussion , and for the case of integer re wards, we can e xploit con ve xity to devise pseudopoly nomial algorithm s for problems that can be formul ated i n terms of the con v ex 13 set P M Q . On the other hand, because of the no n-con v exity of P M V , we have n ot been able to devise pseudopolynom ial tim e algorith ms for the problem M V - M D P ( Π h,u ), or even the simpler problem of deciding whether there exists a policy π ∈ Π h,u that sati sfies V π ≤ v , for s ome given num ber v , except for the very special case where v = 0 , which is the subject o f our next resul t. For a general v , an approximatio n algorithm will be presented in the next section. Theor em 8. (a) If t her e e xists some π ∈ Π h,u for which V π = 0 , then t her e e xists some π ′ ∈ Π t,s,w for which V π ′ = 0 . (b) S uppose that the immediate r ewar ds ar e i nte ge rs, with absolute value bound ed by K . Then the p roblem of determini ng whether t her e ex ists a pol i cy π ∈ Π h,u for which V π = 0 admits a pseudop o lynomial time algor ithm. Pr oof. (a) Suppos e that there exists some π ∈ Π h,u for which V π = 0 . By Theorem 6, π can be assum ed, without loss of generality , to lie in Π t,s,w ,u . Let V ar π ( W T | U 0: T ) , be the condition al v ariance of W T , cond itioned on the realization o f the random ization variables U 0: T . W e hav e V ar π ( W T ) ≥ E π [V ar π ( W T | U 0: T )] , which i mplies that there exists som e u 0: T such that V ar π ( W T | U 0: T = u 0: T ) = 0 . By fixing the random ization var iables t o this particul ar u 0: T , we obtain a det erministic policy , in Π t,s,w under which the rewar d v ariance is zero. (b) If there exists a policy under which V π = 0 , then there exists an integer k , with | k | ≤ K T such that, under this po licy , W T is gu aranteed to be equal to k . Thus, we on ly need to check, for each k in the rele vant range, whether there exists a pol icy such that ( J π , V π ) = ( k , 0) . By Theorem 7, this can be done in pseudopol ynomial time. q.e.d. The approach in th e p roof of part (b) above leads to a short argument, but yields a rather inef ficient (albeit pseudopolyn omial) alg orithm. A much more efficient and si mple algorithm is obt ained by realizing that the q uestion of whether W T can be forced to be k , with probabili ty 1, is just a reachability game: the d ecision maker picks the actions and an adversary picks the ensuing transitions and rewards (among those that have positive probabilit y of occurring). The decision maker wins the gam e if it can guarantee that W T = k . Such sequential games are easy t o solve in time p olynomial in the number of (augmented) states, decisio ns, and the time horizon, by a straightforward backward recursion. On the other h and a 14 genuinely po lynomial time algo rithm do es not appear t o be poss ible; in deed, the proof of Theorem 3 shows that the p roblem is NP-complete. V I . A P P ROX I M A T I O N A L G O R I T H M S In this section, we deal wi th the optimi zation counterparts of the problem M V - M D P ( Π h,u ). W e are interested in comput ing approximately th e following two functions : v ∗ ( λ ) = inf { π ∈ Π h,u : J π ≥ λ } V π , (3) and λ ∗ ( v ) = sup { π ∈ Π h,u : V π ≤ v } J π . (4) If the constraint J π ≥ λ (respectively , V π ≤ v ) is infeasible, we us e the standard con vention v ∗ ( λ ) = ∞ (respectiv ely , λ ∗ ( v ) = −∞ ). Note that the infimu m and s upremum in the above definitions are both attained, because the set P M V of achiev able mean-variance pairs is t he image of the polyhedron P M Q under a continuo us map, and is t herefore compact. W e do not know how to effi ciently compute or ev en generate a uniform approximation of either v ∗ ( λ ) or λ ∗ ( v ) (i.e., find a value v ′ between v ∗ ( λ ) − ǫ and v ∗ ( λ ) + ǫ , and s imilarly for λ ∗ ( v ) ). In the following two resul ts we cons ider a weaker notion of approximation that is computabl e in pseudopol ynomial t ime. W e d iscuss v ∗ ( λ ) as the issues for λ ∗ ( v ) are similar . For any positive ǫ and ν , we wil l say that ˆ v ( · ) is an ( ǫ, ν ) -aproximation of v ∗ ( · ) if, for e very λ , v ∗ ( λ − ν ) − ǫ ≤ ˆ v ( λ ) ≤ v ∗ ( λ + ν ) + ǫ. (5) This is an approxim ation of the same kind as t hose consid ered i n Papadimitriou and Y annakakis (2000): it returns a value ˆ v such that ( λ, ˆ v ) is an element of the “ ( ǫ + ν ) -approximate Pa reto boundary” of t he set P M V . For a diff erent view , the g raph of the fun ction ˆ v ( · ) is within Hausdorf di stance ǫ + ν from the graph of the function v ∗ ( · ) . W e wil l show how to compute an ( ǫ, ν ) -aproximation in time which is pseudopolyn omial, and polyno- mial in the parameters 1 /ǫ , and 1 / ν . W e start in Section VI-A with the case of integer re wards, and build on the pseudopolyn omial time algorithms of the preceding section. W e then cons ider the case of general rew ards in Section VI-B. W e finally s ketch an alt ernativ e algo rithm i n Section VI-C based on set -va lued dynam ic programmin g. 15 A. Inte g er Rewar ds In this section, we prove the foll owing result. Theor em 9. Suppose that the imm edi ate re war ds ar e inte gers. Ther e exists an algo r ithm that, given ǫ , ν , and λ , outpu ts a value ˆ v ( λ ) that satisfies (5), and which runs in time po lynomial in |S | , |A| , T , K , 1 /ǫ , and 1 /ν . Pr oof. Since the rew ards are bo unded in absolute value by K , we hav e v ∗ ( λ ) = ∞ for λ > K T and v ∗ ( λ ) = v ∗ ( − K T ) for λ < − K T . For this reason, we only need to consider λ ∈ [ − K T , K T ] . T o simplify the p resentation, we assume that ǫ = ν . W e let δ be such that ǫ = 3 δ K T . The algorithm is as follows. W e consi der grid points λ i defined by λ i = − K T + ( i − 1) δ , i = 1 , . . . , n , where n is chosen so th at λ n − 1 ≤ K T , λ n > K T . Not e that n = O ( K T /δ ) . For i = 1 , . . . , n − 1 , we calculate ˆ q ( λ i ) , the smallest possible value of E [ W 2 T ] , w hen E [ W T ] i s restricted to lie in [ λ i , λ i +1 ] . Formally , ˆ q ( λ i ) = min n q   ∃ λ ′ ∈ [ λ i , λ i +1 ] s . t . ( λ ′ , q ) ∈ P M Q o . W e let ˆ u ( λ i ) = ˆ q ( λ i ) − λ 2 i +1 , which can be interpreted as an estimate of the least possi ble variance when E [ W T ] is restricted to the interval [ λ i , λ i +1 ] . Finall y , we set ˆ v ( λ ) = min i ≥ k ˆ u ( λ i ) , if λ ∈ [ λ k , λ k +1 ] . The m ain computati onal ef fort is in computing ˆ q ( λ i ) for ev ery i . Since P M Q is a polyhedron, this amounts to solving O ( K T /δ ) li near programmi ng problems. Thus, th e running time of the algorithm has the clai med properties. W e n ow prove correctness. Let q ∗ ( λ ) = min { q | ( λ, q ) ∈ P M Q } , and u ∗ ( λ ) = q ∗ ( λ ) − λ 2 , which is t he least possi ble variance for a given value of λ . Note t hat v ∗ ( λ ) = min { u ∗ ( λ ′ ) | λ ′ ≥ λ } . W e hav e ˆ q ( λ i ) ≤ q ∗ ( λ ′ ) , for all λ ′ ∈ [ λ i , λ i +1 ] . Also, − λ 2 i +1 ≤ − ( λ ′ ) 2 , for all λ ′ ∈ [ λ i , λ i +1 ] . By addin g these two inequalities, we obtain ˆ u ( λ i ) ≤ u ∗ ( λ ′ ) , for all λ ′ ∈ [ λ i , λ i +1 ] . Given some λ , let k be such that λ ∈ [ λ k , λ k +1 ] . Th en, ˆ v ( λ ) = min i ≥ k ˆ u ( λ i ) ≤ min λ ′ ≥ λ k u ∗ ( λ ′ ) ≤ min λ ′ ≥ λ u ∗ ( λ ′ ) = v ∗ ( λ ′ ) , so that ˆ v ( λ ) is always an underestim ate of v ∗ ( λ ) . 16 W e now prove a reverse inequal ity . Fix some λ and let k be such that λ ∈ [ λ k , λ k +1 ] . Let i ≥ k be such that ˆ v ( λ ) = ˆ u ( λ i ) . Let also ¯ λ ∈ [ λ i , λ i +1 ] be such t hat q ∗ ( ¯ λ ) = ˆ q ( λ i ) . No te t hat λ 2 i +1 − ¯ λ 2 ≤ λ 2 i +1 − λ 2 i = δ ( λ i + λ i +1 ) ≤ 2 δ ( K T + δ ) ≤ 3 δ K T . (6) Then, ˆ v ( λ ) ( a ) = ˆ u ( λ i ) ( b ) = ˆ q ( λ i ) − λ 2 i +1 ( c ) = q ∗ ( ¯ λ ) − λ 2 i +1 ( d ) ≥ q ∗ ( ¯ λ ) − ¯ λ 2 − 3 δ K T ( e ) = u ∗ ( ¯ λ ) − 3 δ K T ( f ) ≥ v ∗ ( ¯ λ ) − 3 δ K T ( g ) ≥ v ∗ ( λ − δ ) − 3 δ K T ( h ) ≥ v ∗ ( λ − ǫ ) − ǫ. In the above, (a) holds by the definition of i ; (b) by the definition of ˆ u ( λ i ) ; (c) by the definition of ¯ λ ; and (d) follows from Eq. (6). Equali ty (e) follows from the definition of u ∗ ( · ) . Inequali ty (f) follows from the definiti on of v ∗ ( · ) ; and (g) is obtained because v ∗ ( · ) i s nondecreasing and because ¯ λ ≥ λ − δ . (The latter fact is seen as follows: (i) if i > k , t hen λ ≤ λ k +1 ≤ λ i ≤ ¯ λ ; (ii) if i = k , then both λ and ¯ λ belong to [ λ k , λ k +1 ] , and their difference is at m ost δ .) Inequality (h) is obtai ned b ecause of the definition ǫ = 3 δ K T , the observation δ < ǫ , and the mono tonicity of v ∗ ( · ) . q.e.d. B. General R ewar ds When re wa rds are arbitrary , we can discretize the rew ards and obtain a ne w MDP . The new MDP is equiv alent to one with integer rew ards t o which t he algorithm of the preceding subsection can be applied. This is a legitimate approx imation algorithm for the original problem because, as we will show shortly , the fun ction v ∗ ( · ) changes very little when we discretize u sing a fine enou gh discretization. W e are giv en an original MDP M = ( T , S , A , R , p, g ) in which the rew ards are rational numbers in the i nterval [ − K, K ] , and an approx imation parameter ǫ . W e fix a p ositive number δ , a discretization parameter whose v alue will be specified later . W e th en construct a new M DP M ′ = ( T , S , A , R ′ , p, g ′ ) , in whi ch the re wards are rounded down to an integer mul tiple of δ . More precisely , all elements of the re ward range R ′ are integer mult iples of δ , and for e very t, s, a ∈ { 0 , 1 , . . . , T − 1 } × S × A , and any integer n , we hav e g t ( δ n | s, a ) = X r : δ n ≤ r <δ ( n +1) g t ( r | s, a ) . 17 W e denote by J , Q and by J ′ , Q ′ the first and second moments of the total rew ard in the original and new M DPs, respectiv ely . Let Π h,u and Π ′ h,u be the sets of (randomized, history-based) pol icies in M and M ′ , respectively . Let P M Q and P ′ M Q be th e associated polyhedra. W e want to to argue that the mean-variance t radeof f curves for the two MDPs are close to each other . This is not entirely st raightforward because the augm ented state spaces (which include the possible values of the cumulative re wards W t ) are different for the t wo problems and, therefore, the sets of p olicies are also d iffe rent. A conceptually sim ple but somewhat tedious approach in v olves an ar gument alo ng t he lines of Whitt (1 978, 1979), generalized to the case of constrained MDPs; we outli ne such an argument in Section VI-C. Here, we follow an alternative approach, based on a coupling ar gument. Pr oposition 1. Ther e exists a polyno m i al fun ction c ( K , T ) such th at the H a u sdorf distance between P M Q and P ′ M Q is bound ed above by 2 K T 2 δ . Mor e pr ecisely , (a) F or every pol i cy π ∈ Π h,u , t h er e exists a policy π ′ ∈ Π ′ h,u such that max n | J ′ π ′ − J π | , | Q ′ π ′ − Q π | o ≤ 2 K T 2 δ . (b) Conv ersely , for every pol icy Π ′ h,u , ther e exists a poli cy Π h,u such that the above inequality again holds. Pr oof. W e denote by d ( r ) the discretized value of a reward r , t hat is, d ( r ) = max { nδ : nδ ≤ r, n ∈ Z } . Let us consider a third MDP M ′′ which is identical to M ′ , except t hat its rewards R ′′ t are generated as follows. (W e foll ow the con vention of using a single or double prim e t o indi cate variables ass ociated with M ′ or M ′′ , respectively .) A random va riable R t is generated according to t he di stribution prescribed by g t ( r | s t , a t ) , and it s value is observed by the decision maker , who th en i ncurs the reward R ′′ t = d ( R t ) . Let P ′′ M Q be the polyhedron associated wit h M ′′ . W e claim that P ′′ M Q = P ′ M Q . The only difference between M ′ and M ′′ is t hat t he decisi on maker in M ′′ has access to t he addit ional inform ation R t − d ( R t ) . Howe ver , this in formation is incosequent ial: it does no t affe ct the future transition probabili ties or rew ard distributions.Thus, R t − d ( R t ) can only b e useful as an addit ional randomization variable. Since P ′ M Q is the set of achiev able pairs using general (history-based random ized) policies, having av ailable an additional randomization variable does not change the pol yhedron, and P ′′ M Q = P ′ M Q . Thus , to complete th e proof it suffices to s how th at the polyhedra P M Q and P ′′ M Q are cl ose. Let us compare the MDPs M and M ′′ . The information a v ailable to the decisio n maker is the same 18 for these two MDPs (since all the history of rew ard truncations { R τ − d ( R τ ) } t − 1 τ =1 is av ailable in M ′′ for the d ecision at t ime t ). Therefore, for e very policy in one MDP , there exists a po licy for the oth er under which (if we define the t wo MDPs on a com mon p robability space, inv olving commo n random generators) the exac t sam e sequence of states ( S t = S ′ t ), actions ( A t = A ′ t ), and random variables R t is realized. The only d iffe rence is that the rewards are R t and d ( R t ) , in M and M ′′ , respectiv ely . Recall that 0 ≤ R t − d ( R t ) ≤ δ . W e ob tain that for every policy π ∈ Π , there exists a poli cy π ′′ ∈ Π ′′ for which 0 ≤ W T − W ′′ T = P T − 1 τ =0  R t − d ( R t )) ≤ δ T , and therefore, | W 2 T − ( W ′′ T ) 2 | ≤ 2 K T 2 δ . T aking expectations, we obt ain | J π − J ′′ π | ≤ T δ , | Q π − Q ′′ π | ≤ 2 K T 2 δ . This comp letes the proof of part (a). Th e proof of part (b) is identical. q.e.d. Theor em 10 . Ther e exists an algori thm t hat, given ǫ , ν , and λ , outputs a value ˆ v ( λ ) that sati sfies (5), and which r uns in time polyno m i al in |S | , |A| , T , K , 1 /ǫ , and 1 / ν . Pr oof. Ass ume for simpl icity that ν = ǫ . Given the value of ǫ , let δ be such that ǫ/ 2 = 2 K T 2 δ , and construct t he discretized MDP M ′ . Run the algorithm from Theorem 9 to find an ( ǫ/ 2 , ǫ/ 2) -approxi mation ˆ v for M ′ . Usin g Proposition 1, it is not hard t o verify that this yields an ( ǫ, ǫ ) -approximation of v ∗ ( λ ) . q.e.d. C. An Exact Algorithm and it s Appr oximation There are two general approaches for constructing approx imation algorithms. (i) One can discretize the problem, to obtain an easier one, and then appl y an algorit hm specially tailored to the discretized problem; thi s was the approach in the preceding subsection. (ii) One can design an exact (but inefficient) algorithm for t he original problem and then i mplement the algorit hm approximately . This approach will work provided the approximations do not build up excessive ly in the course of the algorithm. In this subsection, we elaborate on the latter app roach. W e defined earlier the polyhedron P M Q as the set of achie va ble first and second moments of the cumulative reward starting at time zero at the initi al state. W e extend this definiti on by con sidering intermediate ti mes and arbitrary (interm ediate) augmented states. W e let C t ( s, w ) =  ( λ, q ) : ∃ π ∈ Π h,u s . t . E π [ W T | S t = s, W t = w ] = λ and (7) E π [ W 2 T | S t = s, W t = w ) = q  . 19 Clearly , C 0 ( s, 0) = P M Q . Using a straightforward backwards induction, it can be shown that C t ( · , · ) satisfies the set-valued dynamic programm ing recursion 2 C t ( s, w ) = conv a ∈A ( X s ′ ∈S p t ( s ′ | s, a ) X r ∈R g t ( r | s, a ) C t +1 ( s ′ , w + r ) ) , (8) for every s ∈ S , w ∈ R , and for t = 0 , 1 , 2 , . . . , T − 1 , initialized with the boundary condit ions C T ( s, w ) = { ( w , w 2 ) } . (9) A simple ind uctiv e proof shows that the sets C t ( s, w ) are polyhedra; this is because C T ( s, w ) is either empty o r a si ngleton and because the sum or con ve x hul l of finitely many polyhedra is a po lyhedron. T hus, the recursion in v olves a finite amount of computati on, e.g., by representing each polyhedron in term s of its finitely many extreme points. In the worst case, this translates i nto an exponential time algorit hm, because of t he possibly large number of extreme points. Howe ver , such an algorithm can also be implemented approximately . If we allow for the introdu ction of an O ( ǫ/T ) error at each stage (where error i s measured in terms of the Hausdorf dist ance), we can work with approxim ating polyhedra th at in v olve only O ( 1 /ǫ ) extreme point s, whi le ending up with a O ( ǫ ) total error; thi s is because we are approxi mating poly hedra i n the plane, as opposed to higher dimensions where the dependence on ǫ would have been worse dependence. The details are straigh tforward but som ewhat tedious and are omit ted. On t he other hand, in practice, thi s approach is likely to be faster than th e algorit hm of the preceding subsection. V I I . C O N C L U S I O N S W e ha ve shown t hat mean-variance opt imization p roblems for MDPs are t ypically NP-hard, but some- times adm it pseudo polynomial approximati on algorithms. W e only considered finite horizon problems, but it is clear that the negati ve resul ts carry over to their infinite h orizon counterparts. Furthermore, given that the contribution o f the t ail of the t ime horizon i n infinite horizon di scounted problems (or in “proper” stochastic shortest p ath probl ems as in Bertsekas (1995)) can be m ade arbitrarily sm all, our approx imation algorithms can also yield approxi mation algorith ms for i nfinite horizon problems. T wo more problems of some interest deal wit h finding a p olicy that has the smallest possible, or the lar gest possible variance. There i s not much we can say here, except for the fol lowing: 2 If X and Y are subsets of a vector space and α a scalar , we let αX = { αx | x ∈ X } and X + Y = { x + y | x ∈ X , y ∈ Y } . Furthermore, if for ev ery a ∈ A , we hav e a set X α , then conv a ∈A { X α } is the con v ex hull of the union of these sets. 20 (a) The small est poss ible variance is attained by a determinist ic policy , that i s, min π ∈ Π h,u V π = min π ∈ Π h V π . This is proved u sing the inequality V a r π ( W T ) ≥ E π [V ar π ( W T | U 0: T )] . (b) V ariance will be maximized, i n general, by a randomized policy . T o see this, consider a single stage problem and two action s with determi nistic re wards, equal to 0 and 1, respectiv ely . V ariance is maximized b y ass igning probabilit y 1/2 to each of the actions. The variance maximization problem is equiv alent to maximizin g the conca ve function q − λ 2 subject to ( λ , q ) ∈ P M Q . This is a quadratic programming problem over the po lyhedron P M Q and t herefore adm its a pseudopolyn omial time algorithm, when the rewards are integer . Our results suggest several i nteresting di rections for fut ure research, which we briefly outlin e below . First, our negative results apply to general MDPs. It would be interesting to determine wheth er the hardness results remain valid for specially structured MDPs. One possibl y int eresting s pecial case in v olves multi-armed band it p roblems: there are n separate M DPs (“arms”); at each tim e step, the decision m aker has to decide which MDP to acti va te, wh ile the other MDPs remain i nactiv e. Of particul ar interest here are index policies t hat compute a value (“index”) for each MDP and s elect an MDP with m aximal index; such p olicies are often opti mal for th e classical formulations (see Gitt ins (1979) and Whi ttle (1988 )). Obtaining a policy that uses som e sort of an index for the m ean-v ariance probl em or alternatively proving that such a policy cannot exist would be interestin g. Second, a number of complexity quest ions hav e been left op en. W e lis t a few of them: (a) Is there a pseud opolynomi al tim e algorithm for comput ing v ∗ ( λ ) or λ ∗ ( v ) exactly? (b) Is t here a p olynomial or pseudo polynomi al time alg orithm that comp utes v ∗ ( λ ) or λ ∗ ( v ) wit hin a uniform error bound ǫ ? (c) Is the p roblem of computi ng ˆ v ( λ ) with t he prop erties in Eq. (5) NP-hard? (d) Is there a pseudo polynomi al t ime algorit hm the small est p ossible variance i n the absence of any constraints on the m ean cumul ativ e re ward? Third, bias-var iance tradeof fs may pay an im portant role in speeding up certain cont rol and learning heuristics, such as those inv olving control variates (M eyn, 2008). Perhaps mean-v ariance optim ization can be used to address the exploration/exploitation tradeof f in model-based reinforcement learning, with 21 var iance reduction serving as a means to reduce the exploration tim e (see Sutto n and Barto (199 8) for a general discussion of e xploration-exploitati on i n reinforcement learning). Of course, in light of t he computational complexity of bi as-va riance tradeof fs, incorporating bias-variance tradeoff s in learning makes sense only if experimentation is nearly prohibitive and com putation time is cheap. Such an approach could be p articularly useful if a coarse, low-complexity , approximate sol ution of a bias-variance t radeof f problem can result i n s ignificant exploration speedup. Fourth, we only consid ered mean-variance tradeoffs in this paper . Howev er , there are other interesti ng and potentially useful criteria that can be u sed to incorporate ris k into multi -stage d ecision making . For example, Liu and Koenig (2005) consider a utility function with a si ngle switch. Many other risk aware criteria have been consid ered in the sing le stage case. It would be interesting to d e velop a comprehensive theory for the com plexity of solving multi-stage decision problems under general (monot one con vex or conca ve) util ity function and under risk constraints. This is especially interesting for the approximation algorithms presented in Section VI. Ac knowledgments: T his research was partially supported by the Israel Science Foundation (contract 890015), a Horev Fello wship, and the National Science Foundation under grant CMMI-085606 3. R E F E R E N C E S Altman, E. (1999). Constrained Mar kov decision pr o cesses . Chapman and Hal l. Artzner , P ., D elbaen, F ., E ber , J., & Heath, D. (1999). Coherent measures of risk. Mathemat ical F inance , 9 (3), 20 3-228. Bertsekas, D. (1995). Dynamic pr ogramming an d o p timal contr ol . Athena Scientific. Chung, K., & Sobel, M. (1987). Discounted MD P’ s: distribution functions and exponential utility maximization. SIAM J ournal on Contr ol and Op t imization , 25 (1), 49 - 62. Garey , M . R., & Johnson, D. S. (1979). Computers and i ntractability: a guide t o the theory o f np- completeness . Ne w Y ork: W .H. Freeman. Gittins, J. C. (1979). Bandit processes and dynamic allocatio n ind ices. J ournal of t he Royal Statistical Society . S eries B (Methodological) , 41 (2), 148–177. Iyengar , G. (2005). Robust dynam ic programmin g. Mathemati cs of Operations Resear ch , 30 , 257 -280. Le T allec, Y . (2007). Robust, risk-sensitive, and data-driven contr ol of Markov decision pr ocesses . Unpublished docto ral dissertation , Operations Research Center , MIT, Cambridge, M A. Liu, Y ., & K oenig, S. (2005). Risk -sensitive planni ng with one-switch u tility functions: V alue i teration. In Proce edings of the twentieth AAAI confer ence on a rtificial int el l igence (p. 993-999). Luenberger , D. (1997). In vestment science . Oxford Univ ersity Press. Meyn, S. P . (2008). Contr ol techniques for complex networks . Ne w Y ork N Y: Cambridge Univ ersity Press. Nilim, A., & El Ghaoui, L. (2005). Robust Markov decisi on processes wi th uncertain transition matrices. Operations Resear c h , 53 (5), 780-798. Papadimitriou, C. H., & Y annakakis, M. (2000). On the approximabilit y of trade-offs and optimal access of web sources. In P rocee dings of the 41st symposium on foundation s of comput er science (p. 8 6-92). W ashington, DC, USA. 22 Riedel, F . (2004). Dynamic coherent ris k m easures. St och. Pr oc. Appl. , 112 , 185-200. Shapley , L. (1953 ). Stochastic games. Proc. of National Academy of Science, Math. , 1095-1100. Sobel, M. (198 2). The va riance of dis counted Markov decision processes. Journal of Applied Pr obability , 19 , 794-802 . Sutton, R. S., & Barto, A. G. (199 8). Reinfor cement learning: An intr oduction . MIT Press. Whitt, W . (1978). Approximatio n of dynamic prog rams – I. Mat hematics o f Op erations Resear ch , 3 , 231-243. Whitt, W . (19 79). Approximati on of dynamic programs – II. Mathemati cs of Operations Resear ch , 4 , 179-185. Whittle, P . (1988). Restless bandits: Activity allocation in a changing world. Journal o f Applied Pr obability , 2 5 , 28 7–298.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment