Algorithms and Bounds for Rollout Sampling Approximate Policy Iteration

Several approximate policy iteration schemes without value functions, which focus on policy representation using classifiers and address policy learning as a supervised learning problem, have been proposed recently. Finding good policies with such me…

Authors: Christos Dimitrakakis, Michail G. Lagoudakis

Algorithms and Bounds for Rollout Sampling Appro ximate P olicy Iteration ∗ Christos Dimitrak akis Mic hail G. Lagoudakis Octob er 24, 2018 Abstract Several approximate p olicy iteration sc hemes without v alue fun ct ions, whic h focus on p olicy represen tation using classifiers and address p olicy learning as a supervised learning problem, hav e been prop osed recen tly . Finding goo d p olici es with such methods requ ires not only an appropriate classifier, but also reliable examples of best actions, cov ering th e state space sufficien tly . Up to this time, little w ork has b een done on appropri- ate cov ering schemes and on metho ds for reducing the sample complexity of such method s, especially in contin uous state spaces. This pap er fo- cuses on the simplest p ossi ble cov ering scheme (a discretized grid o ver the state space) and performs a sample-complexit y comparison b etw een the simplest (and previously commonly used) rollout sampling allo cation strategy , which allocates samples equally at each state under considera- tion, and an almost as simple meth od, which allo cates samples only as needed an d requires significantly fewer samples. 1 In tro duction Super vised and reinforcement learning are tw o w ell-known learning paradigms, which hav e b een resear ched mostly independently . Recent studies ha ve in- vestigated us ing mature sup ervised learning metho ds for r e inforcemen t learn- ing [9, 6, 10, 7]. Initial results have sho wn that p olicies can b e appr o ximately represented using multi-class classifier s and therefor e it is p ossible to incorp orate classification algor ithms within the inner lo ops of several re info r cemen t lea rning algorithms [9 , 6, 7]. This viewp oint allows the q uan tification of the per formance of r einforcement lea rning algo rithms in terms of the p erformance of classifica- tion algor ithms [10]. While a v a riet y of pro mising combinations b ecome po ssible through this syner gy , her etofore there hav e b een limited pra ctical r esults and widely-applicable algorithms. Herein we co ns ider appr o ximate p olicy iteration alg o rithms, such a s those prop osed by Lagoudakis and P arr [9] as w ell as F ern et al. [6, 7 ], which do not explicitly represent a v alue function. At each iteratio n, a new p o licy/classifier is pro duced using training data obtained thr ough extensive sim ulation (rollo uts) of the pre v ious p olicy on a gener ativ e mo del of the pro cess. These r ollouts aim at identifying b etter actio n c hoices ov er a subset of sta tes in order to form a set of data for training the clas sifier r epresenting the impr oved p olicy . The ma jor ∗ This pro ject was partially supp orted by the ICIS-IAS pro ejct and the M arie Curie In ter- national Rein tegration Grant M CIR G-CT-2006-044980 a warded to Micha il G. Lagoudakis. 1 limitation of these algor ithms, as also indicated by Lagoudakis and Parr [9], is the large amount of r ollout sampling emplo yed at each sampled state. It is hinted, how ev er, that g reat improv ement c o uld be achiev ed with s ophisti- cated manag emen t of sampling. W e have verified this intuition in a companion pap er [4] that e xperimentally compar ed the original appro a c h of uninformed uniform sa mpling with v a rious intelligen t sampling techniques. That pape r em- ploy ed heuristic v ariants of well-known algo rithms for bandit proble ms , such as Upper Confidence Bounds [1] and Successive E limination [5], for the purp ose of managing rollouts (choo sing whic h state to sample from is similar to c ho osing which lev er to pull on a bandit ma c hine). It should b e no ted, how ever, that de- spite the similarity , rollout manag emen t has substantial differ ences to standar d bandit pr oblems and th us genera l bandits results are not directly applicable to our ca se. The curren t pa per aims to o ffer a first theoretical insigh t in to the rollo ut sampling problem. T his is done thr ough the analy sis of the tw o simplest sample allo cation metho ds describ ed in [4]. Firstly , the o ld metho d tha t simply allo cates an equal, fixed num ber of samples a t eac h state and s econdly the slig htly more sophisticated method of progres siv ely sampling all s ta tes where we a r e not yet reasona bly cer tain o f whic h the po licy-improving action w ould b e. The remainder of the pa p er is organised as follows. Section 2 provides the necessary background, Section 4 in tro duces the prop osed algorithms, and Sec- tion 3 discusses r elated w ork. Section 5, which con tains an ana lysis of the prop osed alg orithms, is the main technical contribution. 2 Preliminaries A Markov D e cision Pr o c ess (MDP) is a 6-tuple ( S , A , P, R , γ , D ), wher e S is the state s pace of the proce ss, A is a finite set of actio ns, P is a Markovian transition mo del ( P ( s, a, s ′ ) denotes the pro ba bilit y of a transitio n to state s ′ when taking actio n a in state s ), R is a reward function ( R ( s, a ) is the exp ected reward for ta k ing action a in state s ), γ ∈ (0 , 1] is the discount factor for future rewards, and D is the initial state distribution. A de terministic p olicy π for a n MDP is a mapping π : S 7→ A from states to a ctions; π ( s ) denotes the action choice a t state s . The v a lue V π ( s ) of a sta te s under a p olicy π is the exp ected, total, discounted r ew ard whe n the pr oces s b egins in state s and all decisions at all steps are made according to π : V π ( s ) = E " ∞ X t =0 γ t R  s t , π ( s t )  | s 0 = s, s t ∼ P # . (1) The goal of the decision maker is to find a n o ptimal policy π ∗ that maximises the exp ected, total, disco unted r ew ard from all s tates; in other words, V π ∗ ( s ) ≥ V π ( s ) for all p o licies π and all states s ∈ S . Policy iter ation (PI) is an efficient metho d for deriving an optimal p olicy . It generates a sequence π 1 , π 2 , ..., π k of gradually improving p olicies, whic h terminates when there is no change in the p olicy ( π k = π k − 1 ); π k is an optimal po licy . Impro vemen t is achiev ed by computing V π i analytically (solv ing the 2 linear Be llman equations) a nd the action v alues Q π i ( s, a ) = R ( s, a ) + γ X s ′ P ( s, a, s ′ ) V π i ( s ′ ) , and then determining the impro ved p olicy as π i +1 ( s ) = ar g max a Q π i ( s, a ). Policy iteration typically ter minates in a small num b er of steps. Howev er, it relies on knowledge o f the full MDP mo del, exac t computation and repre- sentation of the v alue function of each p olicy , and exact representation of eac h po licy . Appr oximate p olicy iter ation (API) is a family of metho ds, which ha ve bee n sugges ted to address the “ curse of dimensiona lit y”, that is, the huge growth in complexity as the problem grows. In API, v alue functions and p olicies are represented approximately in so me co mpact for m, but the iterative improv ement pro cess remains the same. Appar en tly , the guarantees for monotonic improv e- men t, optimality , a nd conv ergence are compromised. AP I may never conv erg e, how ever in pr a ctice it r each es go o d p olicies in o nly a f ew iterations. 2.1 Rollout estimates Typically , AP I employs some r e pr esen tation of the MDP mo del to compute the v alue function and der ive the improv ed p olicy . On the other hand, the Monte- Carlo estimation technique of r ol louts provides a wa y of accurately estimating Q π at any given state-action pair ( s, a ) without re q uiring an explicit MDP mo del o r r e present ation of the v alue function. Instead, a gener ativ e mo del of the pro cess (a simulator) is used; such a model ta kes a state-ac tion pair ( s, a ) and returns a rew ard r and a next state s ′ sampled from R ( s, a ) and P ( s, a, s ′ ) resp ectively . A ro llout for the state-action pa ir ( s, a ) a moun ts to sim ulating a s ingle tra- jectory of the pro cess b eginning fro m sta te s , c ho osing action a for the first step, and c ho osing actions according to the p olicy π ther eafter up to a certain hori- zon T . If we denote the seq uenc e of collected r ew ards during the i -th s im ulated tra jectory as r ( i ) t , t = 0 , 1 , 2 , . . . , T − 1, then the ro llo ut es timate ˆ Q π ,T K ( s, a ) of the true s ta te-action v alue function Q π ( s, a ) is the obser v ed total discounted reward, averaged over all K tr a jector ies: ˆ Q π ,T K ( s, a ) , 1 K K X i =1 ˜ Q π ,T ( i ) ( s, a ) , ˜ Q π ,T ( i ) ( s, a ) , T − 1 X t =0 γ t r ( i ) t . Similarly , we define Q π ,T ( s, a ) = E  P T − 1 t =0 γ t − 1 r t   a 0 = a, s 0 = s, a t ∼ π , s t ∼ P  to b e the a ctual state- a ction v alue function up to horizo n T . As will be seen later, with a sufficient amoun t of rollouts and a long horizon T , we ca n c r eate an improv ed p olicy π ′ from π a t an y state s , without req uir ing a mo del of the MDP . 3 Related w ork Rollout estimates ha ve been used in the Rollout Classification Policy Iter ation (R CPI) algorithm [9], which has yielded promising results in s ev eral learning domains. Howev er, as stated therein, it is sens itiv e to the distr ibution of tra ining 3 states ov er the state space. F or this reaso n it is suggested to dr a w states f rom the discounted future state distribution of the improved p olicy . This tricky- to-sample distribution, a lso used by F ern et al. [7], yie lds b etter results. One explanation adv anced in those studies is the reductio n of the po ten tial mismatch betw een the training and testing distributions of the c la ssifier. How ev er, in b o th ca ses, and irresp ectiv ely o f the sampling distr ibution, the main dr a wback is the excessive computational cost due to the need for leng th y and rep eated rollo uts to rea c h a g oo d level of accuracy in the e stimation o f the v alue function. In our preliminary exp eriments with RCPI, it has b een obs erved that mo st of the effort is sp ent where the action v alue differences are either no n- existent, or so fine that they require a prohibitive n umber of rollouts to ident ify them. In this paper , w e propo se and analyse sampling methods to r emo ve this per formance bo ttle- neck. By restricting the sampling distribution to the case of a uniform grid, w e compar e the fixed a lloca tion algorithm ( Fixed ) [9, 7], whereby a larg e fixed amount of r o llouts is used for estimating the a ction v alues in each training state, to a simple incremental sa mpling scheme ba sed on count- ing ( Count ), where the amount of rollo uts in e a c h training state v a ries. W e then der ive complexity b ounds, which show a clear improv ement using Cou nt that dep ends only on the structure of differen tial v alue functions. W e note that F er n et al. [7] pres en ted a related ana lysis. W hile they go int o consider ably more depth with respect to the classifier, their results are not applicable to our fra mework. This is b ecause they assume that there exists some real n umber ∆ ∗ > 0 which lo wer-bounds the amoun t by which the v alue of an optimal action(s) under any polic y e x ceeds the v alue of the nea r est sub- optimal action in any state s . F urthermor e, the a lg orithm they analyse us e s a fixed num ber of r ollouts at each sa mpled state. F or a giv en minim um ∆ ∗ v alue ov er all states, they derive the necess ary n um b er of rollouts per state to guar- antee an improv ement step with high proba bilit y , but the algorithm o ffers no practical w ay to guarantee a high pr obability improv ement. W e instead deriv e error bo unds for the fixed and counting allo cation algor ithms. Additionally , we are considering con tinuous, ra ther than discr ete, s ta te spa ces. Because of this, techn ically our analysis is m uch more clo sely related to that of Auer et al. [2]. 4 Algorithms to reduce sampl ing cost The total sampling cost depends on the balance be t ween the num b er of states sampled and the n umber of samples p er state. In the fixed a lloca tion scheme [9, 7], the sa me num ber of K |A| rollo uts is allo cated to each state in a subset S of states a nd a ll K ro llouts dedicated to a single action are exhausted before moving on to the next ac tio n. Intuitiv ely , if the desir e d outcome (sup eriority of some a ction) in some state can b e co nfiden tly determined ear ly , there is no need to exha ust all K |A| r ollouts av aila ble in that sta te; the training data could be stor ed a nd the state could be r emo ved from the po ol without further examination. Similarly , if we can confidently deter mine that all actions are indifferent in some sta te, we can simply reject it without w asting an y mo re rollouts; such rejected states could b e replaced by fresh ones which might yie ld meaningful results. These idea s lead to the following ques tion: can we e xamine all states in S collectively in some interlea ved manner b y s electing each time a single sta te to fo cus o n and allo cating rollouts o nly as needed? 4 Algorithm 1 SampleSt a te Input: state s , p olicy π , horizo n T , discount factor γ for (each a ∈ A ) do ( s ′ , r ) = Simula te ( s, a ) ˜ Q π ( s, a ) = r x = s ′ for t = 1 to T − 1 do ( x ′ , r ) = Simul a te ( x, π ( x )) ˜ Q π ( s, a ) = ˜ Q π ( s, a ) + γ t r x = x ′ end fo r end fo r return ˜ Q π Selecting states from the sta te p oo l could be view ed as a pr oblem akin to a m ulti-ar med bandit problem, where ea c h state co r resp onds to an arm. Pulling a lev er corr espo nds to s ampling the corresp onding state onc e . By sampling a state we mean that we perfo r m a single rollout for each action in tha t state as shown in Algorithm 1. This is the minimum a mo un t of information w e can request from a single state. 1 Thu s, the problem is tra nsformed to a variant of the cla ssic m ulti-armed bandit problem. Sev eral metho ds hav e been prop osed for v arious versions of this pro blem, whic h co uld potentially b e used in this context. In this pa per, apa rt from the fixed a lloca tio n scheme presented a bove, we als o examine a simple counting scheme. The algo rithms presented here maintain an empirica l estimate ˆ ∆ π ( s ) of the marginal difference o f the apparently maxima l and the second b est of actions . This can b e represented by the marg inal difference in Q π v alues in state s , defined a s ∆ π ( s ) = Q π ( s, a ∗ s,π ) − max a 6 = a ∗ s,π Q π ( s, a ) , where a ∗ s,π is the action that ma ximises Q π in sta te s : a ∗ s,π = arg max a ∈A Q π ( s, a ) . The case o f multiple equiv alent maximising actions can be easily handled b y generalising to se ts of actions in the ma nner of F ern et al. [7 ], in particular A ∗ s,π , { a ∈ A : Q π ( s, a ) ≥ Q π ( s, a ′ ) , ∀ a ′ ∈ A} V π ∗ ( s ) = max a ∈A Q π ( s, a ) ∆ π ( s ) =  V π ∗ ( s ) − max a / ∈ A ∗ s,π Q π ( s, a ) , A ∗ s,π ⊂ A 0 , A ∗ s,π = A How ev er, here we discuss o nly the single b est actio n case to simplify the exp osi- tion. The estimate ˆ ∆ π ( s ) is defined using the empirica l v alue function ˆ Q π ( s, a ). 1 It is possible to also manage sampling of the actions, but herein we are only co ncerned with the effort sa ved by m anaging s tat e s ampling. 5 5 Complexit y of sampling-based p olicy impro v e- men t Rollout algorithms c a n b e used for p olicy improv ement under certain conditions. Bertsek as [3] gives several theor ems for p olicy iter ation using r ollouts and an approximate v alue function that satisfies a co nsistency prop erty . Spec ific a lly , Prop osition 3.1. therein sta tes that the o ne-step lo ok-a head p olicy π ′ computed from the approximate v alue function ˆ V π , has a v alue function whic h is bet- ter than the current approximation ˆ V π , if max a ∈A E [ r t +1 + γ ˆ V π ( s t +1 ) | π ′ , s t = s, a t = a ] ≥ ˆ V π ( s ) for all s ∈ S . It is easy to s ee tha t a n a pproximate v alue function that uses only sampled tra jector ie s from a fixed policy π satisfies this prop erty if we ha ve an adequate num b er of samples. While this a ssures us tha t we can per form r ollouts at any state in order to improv e up on the given p olicy , it do es not lend itself directly to p olicy iteration. That is, with no way to com- pactly repr esen t the resulting rollo ut p olicy w e would be limited to p erforming deep e r and deeper tr ee sear c hes in ro llouts. In this section we shall give conditions that a llow p olicy iteratio n through compact representation of r o llout policies via a grid and a finite n umber of sampled s ta tes and sample tra jectories with a finite ho rizon. F ollowing this, we will analyse the complexity o f the fixed sampling allo cation scheme employed in [9, 7] and co mpare it with an oracle that needs only one sample to determine a ∗ s,π for any s ∈ S and a simple counting scheme. 5.1 Sufficien t conditions Assumption 1 (Bounded finite-dim ension state space) The state sp ac e S is a c omp act subset of [0 , 1 ] d . This assumption can b e genera lised to other b ounded state spaces easily . How- ever, it is necessary to ha ve this assumption in order to be a ble to place some minimal co nstraints on the search. Assumption 2 (Bounded rew ards) R ( s, a ) ∈ [0 , 1 ] for al l a ∈ A , s ∈ S . This assumption b ounds the reward function and can also b e gener a lised easily to other bounding in terv als. Assumption 3 (H¨ older Contin uit y) F or any p olicy π ∈ Π , ther e exist s L, α ∈ [0 , 1] , such that for al l states s, s ′ ∈ S | Q π ( s, a ) − Q π ( s ′ , a ) | ≤ L 2 k s − s ′ k α ∞ . This as s umption ensures that the v alue function Q π is fairly smo oth. It trivially follows in conjunction with Ass umptions 1 and 2 that Q π , ∆ π are b ounded everywher e in S if they are b ounded for at le a st one s ∈ S . F urthermore, the following holds: Remark 5 .1 Given t hat, by definition, Q π ( s, a ∗ s,π ) ≥ ∆ π ( s ) + Q π ( s, a ) for al l a 6 = a ∗ s,π , it fol lows fr om Ass u mption 3 that Q π ( s ′ , a ∗ s,π ) ≥ Q π ( s ′ , a ) , 6 for al l s ′ ∈ S such that k s − s ′ k ∞ ≤ α p ∆ π ( s ) /L . This remar k implies tha t the b est actio n in some sta te s according to Q π will also be the best action in a neighbourho o d of states a round s . This is a reaso na ble condition a s there would be no chance of o btaining a r easonable es timate of the bes t action in a n y r egion from a sing le p oint, if Q π could c hange arbitrar ily fast. W e assert that MDPs with a similar smo othness pro p erty on their transition distribution will also satisfy this assumption. Finally , we need a n a ssumption that limits the total n umber of rollouts that we need to take, as states with a smaller ∆ π will need more ro llouts. Assumption 4 (Measure) If µ { S } denotes the L eb esgue me asur e of set S , then, for any π ∈ Π , ther e exist M , β > 0 such that µ { s ∈ S : ∆ π ( s ) < ǫ } < M ǫ β for al l ǫ > 0 . This a ssumption effectively limits the amount of times v alue-function changes lead to best-action changes, as well as the ratio of states where the action v al- ues are clo se. This assumption, together with the H¨ older contin uity as sumption, impo ses a certain structure on the space o f v alue functions. W e are th us guar- anteed that the v alue function of an y p olicy results in an improv ed p olicy which is no t a rbitrarily complex. This in turn, implies that an optimal policy cannot be arbitrar ily c omplex either. A fina l difficult y is deter mining whether there exis ts some sufficie nt horizo n T 0 beyond whic h it is unnecessa ry to go. Unfortunately , even though for any state s for which Q π ( s, a ′ ) > Q π ( s, a ), there exists T 0 ( s ) such that Q π ,T ( s, a ′ ) > Q π ,T ( s, a ) for all T > T o ( s ), T 0 grows without b ound as we appro ac h a p o in t where the best action c hanges. How ever, by selecting a fixed, sufficiently lar ge rollout horizo n, we can still b eha ve optimally with resp ect to the true v alue function in a compact s ubset of S . Lemma 5.1 F or any p olicy π ∈ Π , ǫ > 0 , ther e exists a finite T ǫ > 0 and a c omp act subset S ǫ ⊂ S such that Q π ,T ( s, a ∗ s,π ) ≥ Q π ,T ( s, a ) ∀ a ∈ A , s ∈ S , T > T ǫ wher e a ∗ s,π ∈ A is such that Q π ( s, a ∗ s,π ) ≥ Q π ( s, a ) for al l a ∈ A . Pro of F rom the ab o ve assumptions it follows dir ectly that for any ǫ > 0 , there exists a compact set of states S ǫ ⊂ S such that Q π ( s, a ∗ s,π ) ≥ Q π ( s, a ′ ) + ǫ for all s ∈ S ǫ , with a ′ = arg max a 6 = a ∗ s,π Q π ( s, a ). Now let x T , Q π ,T ( s, a ∗ s,π ) − Q π ,T ( s, a ′ ). Then, x ∞ , lim T →∞ x T ≥ ǫ . F o r any s ∈ S ǫ the limit exists and th us by definition ∃ T ǫ ( s ) such that x T ǫ > 0 for a ll T > T ǫ . Since S ǫ is compact, T ǫ , sup s ∈S ǫ T ǫ ( s ) als o exists. 2 This ensures that w e can identif y the b est a c tio n within ǫ , using a finite rollout horizon, in most of S . Moreov er, µ {S ǫ } ≥ 1 − M 2 ǫ β from Assumption 4. In standar d p olicy itera tion, the improv ed p olicy π ′ ov er π has the prop erty that the improv ed action in any sta te is the a ction with the hig hest Q π v alue in that sta te. Howev er, in rollout-ba s ed policy iteration, we may only g uarantee being within ǫ > 0 of the maximally improved p olicy . 2 F or a discoun t factor γ < 1 we can si mply b ound T ǫ with l og[ ǫ (1 − γ )] / log( γ ). 7 Algorithm 2 Oracle Input: n , π Set S to a uniform grid of n states in S . for s ∈ S do ˆ a ∗ s,π = a ∗ s,π end fo r return ˆ A ∗ S,π , { ˆ a ∗ s,π : s ∈ S } Definition 5.1 ( ǫ -i mpro v ed p olicy) An ǫ -impr ove d p olicy π ′ derive d fr om π satisfies max a ∈A Q π ( s, a ) − ǫ ≤ V π ′ ( s ) , (2) Such a policy will be said to be imp r oving in S if V π ( s ) ≤ V π ′ ( s ) for a ll s ∈ S . The measure of states for which ther e can not be improvemen t is limited by Assumption 4. Finding a n improv ed π ′ for the who le of S is in fact not po ssible in finite time, since this r equires determining the b o undaries in S at which the bes t action c hanges. 3 In all cas es, w e shall attempt to find the improving action a ∗ s,π at each state s on a uniform grid of n states, with the next po licy π ′ ( s ′ ) taking the estimated b est action ˆ a ∗ s,π for the state s closest to s ′ , i.e. it is a near est- neighbour classifier. In the remainder, we derive complexity b ounds for ac hieving an ǫ -improv ed po licy π ′ from π with probability at least 1 − δ . W e s ha ll always as sume that we are us ing a sufficie ntly deep rollout to cov er S ǫ and only consider the num ber of rollouts p erformed. First, we sha ll de r iv e the num ber of s tates w e need to sample from in order to g uarantee an ǫ -improv ed po licy , under the assumption that a t each state w e ha ve a n or acle which can giv e us the ex act Q π v alues for each state w e examine. Later, w e shall consider sa mple co mplexit y bounds for the case where we do not have a n oracle, but use empirica l estimates ˆ Q π ,T at each s tate. 5.2 The Oracle algorithm Let B ( s, ρ ) denote the infinity-norm sphere of radius ρ centred in s and co n- sider Alg. 2 ( Oracle ) that can insta n tly obta in the state-a ction v alue function for any p oint in S . The a lgorithm crea tes a uniform grid of n sta tes , such that the distance betw een adjac en t states is 2 ρ = 1 n 1 /d – and so can cov er S with spheres B ( s, ρ ). Due to Assumption 3, the error in the a ction v alues of any state in sphere B ( s, ρ ) of state s will be bounded by L  1 2 n 1 /d  α . Thu s, the resulting po licy will be L  1 2 n 1 /d  α -improv ed, i.e. this will be the maximum regret it will suffer over the maximally improv ed p olicy . T o bo und this re g ret by ǫ , it is sufficient to hav e n =  1 2 α q L ǫ  d states in the grid. The follo wing prop osition fo llows dire ctly . Prop osition 5.1 Algo rithm 2 r esults in r e gr et ǫ for n = O  L d/α  2 ǫ 1 /α  − d  . 3 T o see this, consider S , [0 , 1], with some s ∗ : R ( s, a 1 ) ≥ R ( s, a 2 ) ∀ s ≥ s ∗ and R ( s, a 1 ) < R ( s, a 2 ) ∀ s < s ∗ . Finding s ∗ requires a binary searc h, at b est. 8 F urther more, a s for all s such tha t ∆ π ( s ) > Lρ α , a ∗ s,π will b e the improved action in all of B ( s, ρ ), then π ′ will b e improving in S with µ { S } ≥ 1 − M L β  1 2 n 1 /d  αβ . Both the regr et a nd the lack of complete cov erage are due to the fact that we cannot estimate the best-a ction b oundaries with ar bitr ary precision in finite time. When using rollout sampling, howev er, ev en if we r estrict ourselves to ǫ improv ement, we may still ma k e a n error due to bo th the limited num b er of rollouts and the finite horizon of the tra jectories. In the remainder, w e sha ll derives er ror b ounds for t w o practical algo rithms tha t emplo y a fixed gr id with a finite n um b er o f T -horizo n rollouts. 5.3 Error b ounds for states When we e s timate the v alue function at each s ∈ S using rollouts there is a probability that the estimated b est actio n ˆ a ∗ s,π is not in fact the b est ac tion. F o r any given s tate under conside r ation, we can apply the following well-known lemma to obtain a b ound on this error pr obabilit y Lemma 5.2 (Ho effding i nequalit y) L et X b e a r andom variable in [ b, b + Z ] with ¯ X , E [ X ] , observe d values X 1 , . . . , X n of X , and ˆ X n , 1 n P n i =1 X i . Then, P ( ˆ X n ≥ ¯ X + ǫ ) = P ( ˆ X n ≤ ¯ X + ǫ ) ≤ exp  − 2 nǫ 2 / Z 2  for any ǫ > 0 . Without loss of generality , consider t wo rando m v aria bles X, Y ∈ [0 , 1 ], with empirical means ˆ X n , ˆ Y n and empirical difference ˆ ∆ n , ˆ X n − ˆ Y n > 0. Their means and difference will be denoted as ¯ X , ¯ Y , ¯ ∆ , ¯ X − ¯ Y resp ectively . Note tha t if ¯ X > ¯ Y , ˆ X n > ¯ X − ¯ ∆ / 2 and ˆ Y n < ¯ Y + ¯ ∆ / 2 then necessarily ˆ X n > ˆ Y n , so P ( ˆ X n > ˆ Y n | ¯ X > ¯ Y ) ≥ P ( ˆ X n > ¯ X − ¯ ∆ / 2 ∧ ˆ Y n < ¯ Y + ˆ ∆ n / 2). The conv erse is P  ˆ X n < ˆ Y n | ¯ X > ¯ Y  ≤ P  ˆ X n < ¯ X − ¯ ∆ / 2 ∨ ˆ Y n > ¯ Y + ¯ ∆ / 2  (3a) ≤ P  ˆ X n < ¯ X − ¯ ∆ / 2  + P  ˆ Y n > ¯ Y + ¯ ∆ / 2  (3b) ≤ 2 e x p  − n 2 ¯ ∆ 2  . (3c) Now, consider ˆ a ∗ s,π such that ˆ Q π ( s, ˆ a ∗ s,π ) ≥ ˆ Q π ( s, a ) for all a . Setting ˆ X n = Z − 1 ˆ Q π ( s, ˆ a ∗ s,π ) and ˆ Y n = Z − 1 ˆ Q π ( s, a ), wher e Z is a no r malising constant such that Q ∈ [ b, b + 1], we can apply (3). Note that the b ound is la rgest for the action a ′ with v alue closest to a ∗ s,π , for whic h it holds that Q π ( s, a ∗ s,π ) − Q π ( s, a ′ ) = ∆ π ( s ). Using this fact and a n application of the union b ound, we co nclude that for any state s , from whic h we hav e taken c ( s ) samples, it holds that: P [ ∃ ˆ a ∗ s,π 6 = a ∗ s,π : ˆ Q π ( s, ˆ a ∗ s,π ) ≥ ˆ Q π ( s, a )] ≤ 2 |A| ex p  − c ( s ) 2 Z 2 ∆ π ( s ) 2  . (4) 5.4 Uniform sampling: the Fixed algorithm As we have see n in the previo us section, if w e employ a gr id of n states, covering S with spheres B ( s, ρ ), where ρ = 1 2 n 1 /d , and taking action a ∗ s,π in eac h sphere centred in s , then the resulting p olicy π ′ is only guar an teed to b e improved within ǫ of the optimal improv ement from π , where ǫ = L ρ α . No w, we exa mine 9 Algorithm 3 Fixed Input: n , π , c , T , δ Set S to a uniform grid of n states in S . for s ∈ S do Estimate ˆ Q π ,T c ( s, a ) for all a . if ˆ ∆ π ( s ) > Z q 2 log(2 n |A| /δ ) c then ˆ a ∗ s,π = arg max ˆ Q π else ˆ a ∗ s,π = π ( s ) end i f end fo r return ˆ A ∗ S,π , { ˆ a ∗ s,π : s ∈ S } the case where, instead of obtaining the tr ue a ∗ s,π , w e hav e a n estimate ˆ a ∗ s,π arising fro m c samples from ea c h actio n in each state, for a total of cn |A| samples. Alg orithm 3 a ccepts (i.e. it sets ˆ a ∗ s,π to be the empirically hig hest v alue action in tha t state) for all states satisfying: ˆ ∆ π ( s ) ≥ Z r 2 log(2 n |A| / δ ) c . (5) The condition ensures that the probability that Q π ( s, ˆ a ∗ s,π ) < Q π ( s, a ∗ s,π ), mea n- ing the optimally improving action is not ˆ a ∗ s,π , at any state is at most δ . This can ea sily be seen by subs tituting the rig h t hand side of (5) for ǫ in (4). As ∆ π ( s ) > 0, this results in an erro r pr obabilit y of a single state smaller than δ /n and w e can use a union bo und to obta in an erro r proba bility of δ for each p olicy improv ement step. F o r ea c h state s ∈ S that the algorithm considers, the following tw o cases are of in terest: (a) ∆ π ( s ) < ǫ , meaning th at even when we hav e cor rectly identified a ∗ s,π , we are still no t improving ov er all of B ( s, ρ ) a nd (b) ∆ π ( s ) ≥ ǫ . While the pro babilit y of accepting the wrong action is alwa ys b ounded b y δ , w e m ust also calculate the pro babilit y that we fail to a ccept an action a t all, when ∆ π ( s ) ≥ ǫ to estimate the exp ected regret. Restating our accepta nce condition as ˆ ∆ π ( s ) ≥ θ , th is is given b y: P [ ˆ ∆ π ( s ) < θ ] = P [ ˆ ∆ π ( s ) − ∆ π ( s ) < θ − ∆ π ( s )] = P [∆ π ( s ) − ˆ ∆ π ( s ) > ∆ π ( s ) − θ ] , ∆ π ( s ) > θ . (6) Is ∆ π ( s ) > θ ? Note that for ∆ π ( s ) > ǫ , if ǫ > θ then s o is ∆ π . So, in order to achieve to ta l proba bilit y δ for all sta te- action pairs in this ca s e, after some calculations, w e arr ive a t this expression for the regret ǫ = max ( L  1 2 n 1 /d  α , Z r 8 log(2 n |A| /δ ) c ) . (7) By equa ting the tw o sides, we g et an expr ession for the minim um n um b er of samples neces sary p er state: c = 8 Z 2 L 2 4 α n 2 α/d log(2 n |A| /δ ) . 10 Algorithm 4 Count Input: n , π , C , T , δ Set S 0 to a uniform grid of n states in S , c 1 , . . . , c n = 0. for k = 1 , 2 , . . . do for s ∈ S k do Estimate ˆ Q π ,T c ( s, a ) for all a , incremen t c ( s ) S k = n s ∈ S k − 1 : ˆ ∆ π ( s ) < Z q 2 log(2 n |A| /δ ) c ( s ) o end fo r if P s c ( s ) > = C then Break. end i f end fo r This dir ectly allows us to state the following prop osition. Prop osition 5.2 The s ample c omplexity of Algorithm 3 to achieve r e gr et at most ǫ with pr ob ability at le ast 1 − δ is O  ǫ − 2 L d/α  2 ǫ 1 /α  − d log 2 |A| δ L d/α  2 ǫ 1 /α  − d  . 5.5 The Count algorithm The Count alg orithm starts with a po licy π and a set of s ta tes S 0 , with n = | S 0 | . A t each itera tion k , ea c h sample in S k is sampled once. O nce a state s ∈ S k contains a dominating action, it is remov ed from the search. So, S k = ( s ∈ S k − 1 : ˆ ∆ π ( s ) < Z s log(2 n |A| /δ ) c ( s ) ) Thu s, th e n umber of sa mples from ea ch state is c ( s ) ≥ k if s ∈ S k . W e ca n apply similar ar gumen ts to analyse Count , by noting that the alg o- rithm sp ends less time in s tates with higher ∆ π v alues . The measure assumption then allows us to ca lculate the num ber of sta tes with la rge ∆ π and thus, the nu mber of samples tha t are needed. W e hav e alr eady established that ther e is an upp er b ound on the regret depe nding on the grid resolution ǫ < Lρ α . W e pr ocee d by forming subsets of states W m = { s ∈ S : ∆ π ( s ) ∈ [2 − m , 2 1 − m } . Note that we only need to cons ider m < 1 + 1 log 1 / 2 (log L + α lo g ρ ). Similarly to the previous algor ithm, and due to our accepta nce condition, for each state s ∈ W m , we need c ( s ) ≥ 2 2 m +1 Z 2 log 2 n |A| δ in order to b ound the total err or probability by δ . The total num b er of samples necessar y is Z 2 log 2 n |A| δ ⌈ 1 log 1 / 2 (log L + α log ρ ) ⌉ X m =0 | W m | 2 2 m +1 . A b ound on | W m | is required to b ound this express io n. Note that µ { B ( s, ρ ) : ∆ π ( s ′ ) < ǫ ∀ s ′ ∈ B ( s, ρ ) } ≤ µ { s : ∆ π ( s ) < ǫ } < M ǫ β . (8) 11 It follows that | W m | < M 2 β (1 − m ) ρ − d and conseq uen tly X s ∈ S c ( s ) = Z 2 log 2 n |A| δ ⌈ 1 log 1 / 2 (log L + α log ρ ) ⌉ X m =0 M 2 β (1 − m ) ρ − d 2 2 m +1 ≤ M 2 β +1 2 1+ 1 log 1 / 2 (log L + α log ρ ) 2 − β 2 d Z 2 n log 2 n |A| δ . (9) The a bov e results directly in the following pr opos ition: Prop osition 5.3 The s ample c omplexity of Algorithm 4 to achieve r e gr et at most ǫ with pr ob ability at le ast 1 − δ , is O  L d/α  2 ǫ 1 /α  − d log 2 |A| δ L d/α  2 ǫ 1 /α  − d  . W e note that w e are of course not able to remov e the dependency on d , which is only due to the use of a gr id. Nev ertheless, w e obtain a reduction in sample complexity o f order ǫ − 2 for this v ery simple a lgorithm. 6 Discussion W e have derived p erforma nce po unds for appr o ximate p olicy improv ement with- out a v alue function in contin uous MD Ps. W e compared the usual a pproach of sampling equally fr om a set of candidate states to the slig htly more sophisti- cated metho d of s a mpling fro m all candidate sta tes in parallel, a nd removing a candidate state from the set as so on as it was clear whic h action is b est. F or the sec ond alg o rithm, we find an improvemen t of a pproximately ǫ − 2 . Our r e- sults complement those of F ern et al [7] for r elational Markov decis ion pro cesses. How ev er sig nifican t amount of future w ork re ma ins. Firstly , we ha ve as s umed e v erywhere that T > T ǫ . While this may be a relatively mild assumption for γ < 1, it is problematic for the undiscounted case, a s some states w ould requir e far deeper r ollouts than others to achiev e regret ǫ . Thus, in future work w e w ould like t o exa mine sample complexit y in terms of the depth of rollouts as w ell. Secondly , we w ould like to extend the algorithms to increas e the num b er of states that w e lo ok at: whenever ˆ V π ( s ) ≈ ˆ V π ′ ( s ) for all s , then we could increase the resolution. F o r example if, X s ∈ S P  ˆ V π ( s ) + ǫ < ˆ V π ′ ( s ) | V π ( s ) > V π ′ ( s )  < δ then we could increase the r e s olution around those states with the smallest ∆ π . This would get around the pr oblem of having to select n . A related point that has not b een a ddr essed herein, is the c hoice of policy representation. The grid-base d r epresentation probably makes po or use of the av ailable num be r o f states. F o r the increas e d-resolution scheme outlined ab ov e, a clas s ifier such a s k -nearest-ne ig h b our could be emplo yed. F urther more, reg u- larised classifier s might affect a s moothing prop ert y on the resulting p olicy , and allow the learning of impr oved p olicies from a set of sta tes cont aining err oneous bes t action c hoices. As far as the state a llo catio n alg orithms are concerned, in a companio n pap er [4 ], w e hav e co mpared the p erforma nce of Count and Fixed with ad- ditional allo cation schemes inspired fro m the UCB a nd s uccessive elimination 12 algorithms. W e ha ve found that all metho ds o utperfor m Fixed in practice, sometimes by an order of magnitude, with the UCB v aria n ts being the bes t ov erall. F o r this r e a son, in future w ork we plan to perfor m an analysis of suc h algo - rithms. A further extension to deep er searches, by for example mana g ing the sampling of a ctions within a state, could also be p erformed using techniques similar to [8]. 6.1 Ac kno wledgemen ts Thanks to the revie w ers and to Adam Atkinson, Bre nda n Barnw ell, F rans Olieho ek, Ronald Ortner a nd D. Ja cob Wildstrom for comments and use ful discussions. References [1] P . Auer, N. Ces a-Bianchi, and P . Fischer. Finite-time analys is o f the m ulti- armed ba ndit problem. Machine L e arning Journal , 4 7 (2-3):235– 256, 20 02. [2] P . Auer, R. O rtner, and C. Szep esv ari. Improv ed Rates for the Sto chas- tic Contin uum-Armed Bandit Problem. Pr o c e e dings of t he Confer enc e on Computational L e arning The or ey (COL T) , 2007. [3] Dimitri Bertsek as. Dynamic progra mming a nd sub optimal control: F rom ADP to MPC. F un damental I s s ues in Contr ol, Eur op e an Journal of Con- tr ol , 11(4-5), 20 05. F rom 2005 CDC, Seville, Spain. [4] Chris tos Dimitra k akis and Michail Lag oudakis. Rollo ut s ampling approxi- mate p olicy iteration. Mach ine L e arning , 72 (3 ), Septem b er 2 008. [5] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforc e ment learning problems. Journal of Machi ne Le arning R ese ar ch , 7:1 0 79–110 5, 2 0 06. [6] A. F ern, S. Y o on, and R. Giv an. Approximate po lic y iter ation with a p olicy language bias. A dvanc es in Neur al Info rmation Pr o c essing Systems , 1 6(3), 2004. [7] A. F ern, S. Y o on, and R. Giv an. Approximate po lic y iter ation with a p olicy language bias: Solving relatio nal Mar k ov decisio n proces ses. Journal of Artificia l Intel ligenc e R ese ar ch , 25 :75–118 , 2 006. [8] Levent e Ko csis and Csaba Szep esv´ a ri. Bandit based Monte-Carlo planning. In Pr o c e e dings of ECML-2006 , 2006. [9] Michail G. Lagoudakis and Ro nald Parr. Reinforcement learning a s clas- sification: Lev erag ing mo dern classifiers . In Pr o c e e dings of the 20th Inter- national Confer enc e on Machine L e arning (ICML) , pages 424 –431, W ash- ington, DC, USA, August 2 003. 13 [10] J ohn La ngford and Bia nca Zadro zn y . Relating reinforce ment lear ning p er- formance to classificatio n per formance. In Pr o c e e dings of the 22nd In ter- national Confer enc e on Machi ne le arning (ICML) , pages 473–4 80, Bo nn, Germany , 2005 . 14

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment