Reducing Commitment to Tasks with Off-Policy Hierarchical Reinforcement Learning

Reducin g Commitment to T asks with Off-Policy Hierar chical Reinf or cement Lear ning Mitchell Keith B loch University of Michigan 2260 Hayward Street Ann Arbor, MI. 48109 -2121 bazald@umich .edu Abstract In e xperimenting with off-polic y temporal difference (TD) methods in hierarchical reinforcement learning (HRL) sys- tems, we hav e observed unwanted on-polic y learning un- der reproducible conditions. Here we present modiﬁcations to se ve ral TD methods that pre ve nt unintentional on-polic y learning from occurring. These modiﬁcations create a ten- sion between ex ploration and learning. T rad itional TD meth- ods require commitment to ﬁ nishing subtasks without explo- ration in order to update Q-values for early actions with high probability . One-step intra-option learning and temporal sec- ond difference traces (TSDT) do no t suf fer from this limita- tion. W e demonstrate that our HRL system is efﬁcient with- out commitment to completion of subtasks in a cliff-walking domain, con trary to a widespread claim in th e li terature that it is criti cal for efﬁcien cy of l earning. Furthermore, decreasing commitment as exploration progresses is sho wn to improv e both online performance and the resultant policy in the taxi- cab domain, opening a new av enue for research into when it is more beneﬁcial to continue with the current subtask or to replan. 1 Intr oduction and Backgr ound Hierarchical reinforcemen t learnin g (HRL) is an established solution for attacking the curse of dimensionality . Decom- posing a problem into a hierarchy o f tasks has a number o f advantages. Firstly , knowledge abou t the values of different tasks a nd how to p erform d ifferent tasks can be repr esented and learned independently . Secondly , th ere is the possibility of incr easing state abstraction at each p oint in th e d ecision process. Thirdly , av ailable actio ns can b e ign ored in some subprob lems, red ucing the complexity of le arning ind i vid- ual decisions. Finally , an agent can share subtasks between parts o f a p roblem, allowing an agent to take advantage of regularities in the behavior demand ed by the en vir onment. Off-policy temp oral difference (TD) m ethods allow an agent to learn about a policy that is different than that which is b eing f ollowed. This allo ws an agent to learn reliably from a greater variety of explor ation s trategies. W e ha ve observed that off-policy TD method s c an result in inadvertent on-p olicy updates in the co ntext of HRL. An example is discussed a t length in section 2. Speciﬁcally , this can o ccur if lear ning is attempted in a task while a non-g reedy action is being taken in a subtask. As tak ing non-g reedy actions is fundamental to e xploration , this raises the question of h ow best to reliably learn o ff-policy without requirin g subtasks to conver ge ﬁrst. In explo ring solution s to this new prob lem, we chal- lenge the widespre ad claim that com mitting to comp letion of tasks is a critical aspect of wh at giv es HRL an ad- vantage over ﬂat r einforcem ent learnin g (Kaelb ling 1 993; Dietterich 19 98; 2000a ; Ryan 200 4b; 2004 a). Section 3 de- scribes o ur HRL sy stem. Using corrected TD methods, and particularly a gate d version of temporal second difference traces (TSDT) ( Bloch 2011 ), we dem onstrate that it is p os- sible to le arn efﬁciently with no commitment to co mpleting tasks. These resu lts are pr esented in section 4.1. Fur ther- more, w e demon strate that it is possible to learn more efﬁ- ciently with a reduction in commitment, open ing a research question as to when comm itment has value. These results are presented in section 4.2. 1.1 Hierar chica l Reinfor cement Learning Here we br ieﬂy discuss Hierar chical Semi-Markov Q-learning (HSMQ) , a clo se relative o f the HRL sy stem presented in section 3; all-g oals u pdating, a techniq ue to improve efﬁciency of learning; and no n-hierar chical o r polling execution of tasks, a techniqu e f or improvin g the quality of a learned hierarchica l policy . The d esign of HSMQ (Dietterich 200 0b) keeps the goals of tasks truly independ ent, sacriﬁcing g uarantees of achie v- ing hierar chical optimality . Each task is con cerned only with ach ieving its goal as ef ﬁciently as p ossible, assuming an ep isodic structu re to all tasks. Rewards o ccurring after a task completes do no t af fect lear ning within the task . HSMQ instead guaran tees con vergence to a recursi vely o ptimal pol- icy , by which it is meant that the hierarchy can conver ge to a policy th at is the best giv en the restrictions imposed by the hierarchy . Kaelbling (199 3) introd uced all-g oals updating , a te ch- nique which con currently improves an ag ent’ s kn owledge of how to achieve multip le goals, making much better u se of the in formatio n acq uired from the en vironmen t. Di- etterich (1998 ) introduce d a subset of all-g oals upd ating, all-states updating , in which only g oals that the agent is try- ing to ac hiev e are updated. The chang es r equired to im - plement all-states updating in a n existing hier archical rein- forcemen t learning agent are much simpler than the change s required to implement all-g oals up dating but, depending on a n umber of factor s, it may b e signiﬁcantly less powerful. All-goals up dating requires the use of off-policy TD meth- ods. Therefor e, we view all-goals up dating as one motiva- tion for de veloping correct off-policy T D methods for HRL. Both Kaelbling (1 993) and D ietterich (1 998) discuss non-h ierarchical or polling execution of tasks. How- ev er , they descr ibe them as technique s fo r impr oving the perfo rmance of learn ed po licies. Actually lea rning while executing non -hierarch ically is not perm itted. W e demonstra te a system c apable of learnin g while executing non-h ierarchically in section 3. Sutton et al. (19 98) presen ts a simple proo f f or why executing no n-hierar chically must result in a policy that is at least as good as the original. One-Step In tra-Optio n Learning Sutton an d Pre- cup ( Sutton an d Precu p 19 98; Sutton, Precup, and Singh 1999) intro duced one- step in tra-optio n lea rning. Here we present h ow b ackups must be per formed for off-policy learning. Intra- option learning is our basis for reliable learn- ing while e xecuting non-hierar chically , and understanding it is critical f or un derstandin g the tradeoffs b etween different algorithm s, as described in section 2. If an action is n on-pr imitiv e , represen ting a subtask, and the subtask does not complete, then the backup must use the Q-value for the same action from the successor state. T o use a Q-v alue for a different action would conﬂate the iss ues of deciding how to be have, and learning different behaviors. Q ( s, a ) α ← − r + γ Q ( s ′ , a ) (1) Provided that the learnin g rate, α , is su fﬁciently lo w , and that actions are sampled adeq uately over time, Q ( s, a ) shou ld conv erge to the tr ue value of the expected return for follow- ing action a to com pletion and then behaving optim ally fr om that point on. If an action is primitive or successfully terminates a cor- respond ing subtask , th en th e backu p must use the Q-value for the hig hest v alued action from the successor state (if the agent is attempting to learn off-policy ). Q ( s, a ) α ← − r + γ V ( s ′ ) (2) Provided that α is sufﬁciently lo w , and that actions are sam- pled ad equately over time, Q ( s, a ) sho uld co n verge to the true v alue o f th e exp ected retur n fo r executing action a and then behaving optimally from t hat poin t on. Successful termination o f the task at ha nd (r ather th an simply a subtask) dem ands special care unless the agent en- ters an absorb ing state. An absorb ing state is a state where there are no actions av ailable to the agent, and therefore has an e stimated r eturn o f 0. If an absorbin g state is not g uar- anteed, the task should ignore any e xpected futur e return for the state, taking only the immediate rew ard into account. Q ( s, a ) α ← − r (3) Provided that α is sufﬁciently lo w , and that actions are sam- pled ad equately over time, Q ( s, a ) sho uld co n verge to the true value of the expected terminal reward for executing ac- tion a . Of cou rse, the task may no t alw ays terminate when the agent tak es action a in a non-determ inistic domain. This possibility does not affect the reliability o f conver gence, giv en an appropriate α . T e mporal Second Difference T races Bloch (2011 ) intro- duced temporal second d ifference traces (TSDT), an alterna- ti ve to W atkins’ Q( λ ) with a number of advantages. Storing local δ s f or intra -option lear ning, TSDT can do off-policy backup s after no n-greed y actions ha ve been taken. This can make it sig niﬁcantly mor e powerful, particula rly in deter- ministic domains. In do ing b ackups similar ly to intra -option learnin g, and updating earlier ba ckups as an agent explores, TSDT con- ceptually approx imates Dyna-Q with sample backup s but without the need for a model. 2 Difﬁculty Lear ning Off-P olicy Here we present a thr ee-armed bandit pr oblem, d esigned to demonstra te a basic p roblem with existing off-policy tem- poral dif ference (TD) methods i n the context of HRL. Addi- tionally , we present modiﬁcations to popular TD methods to ensure correct off-policy updates. Actions A, B, and C yield 1 , 10 , and 100 reward re spec- ti vely . All thr ee actions result in immed iate terminatio n of the episode. It is trivial to develop a ﬂat ag ent to learn the domain, as depicted in ﬁgur e 1, but developing a h ier- archy sh ould n ot cr eate additional difﬁculties. Howe ver , it turns o ut th at the trivial h ierarchical r einforce ment learning (HRL) agen t dep icted in ﬁgure 2 ca uses pro blems for tradi- tional off-policy temporal difference metho ds. The agent dep icted in ﬁgur e 1 simply chooses action A, B, or C a nd then terminates. Learnin g with α = 1 and exploring with a ﬁxed explo ration strategy (for example epsilon-gr eedy ) and ch oosing a no n-greed y action 10% of the time, it is g uaranteed to con verge to the o ptimal policy as time goes to inﬁnity . There is little incen ti ve to develop a hierarchical agent for this domain . Regardless, one would not nai vely e xpect a hi- erarchy to make it more difﬁcult to conver ge to an optima l policy . Howe ver this is exactly wha t hap pens with the h ier- archy depicted in ﬁgure 2 unless spec ial care is taken . Con- ceptually , it may be helpful to think of th e subtask consisting of a choice between actions A and C as a more complex task than action B. The subtask is p referab le to a ction B only if the subtask is executed expertly . As depicted in ﬁgu re 3, a na i ve implementation of Q-learning will cause sign iﬁcantly decrea sed p erform ance in o ur band it prob lem giv en the hier archy in ﬁgure 2. If the effect of explor ation in the subtask is not accounte d for , the root task will re gularly mislearn the v alue of the s ubtask, ef- fectively doing a some what on-policy b ackup. Action B will be preferr ed to the s uperior subtask 50% of the time because the b ackup in the root node uses the rew ard rec eiv ed fr om Root A B C Figure 1: A trivial ﬂat ag ent for the thre e-armed bandit p rob- lem can choose between actions A, B, and C. Root A B C Subtask Figure 2 : Arbitra ry hierarch y which hap pens to behave in- terestingly in the toy domain. the sub task rath er than the reward it would have receiv ed had the subtask executed greed ily . A task which is exploring, th ereby taking a n on-gr eedy action, is not trying to accomplish its g oal. It is tr ying to gain in formatio n. In an off-policy backup, we must then exclude th e e ffects of explora tion and b lock b ackups that would otherwise occur at higher le vels of the hierarchy . The s olution for W atkins’ Q ( λ ) (and Q-learn ing when us- ing all-states u pdating ) is to skip the backup and to just clear the trace whenev er a non-gr eedy action is being tak en in any subtask. This enables reliable o ff-policy lear ning in HRL systems. Unfo rtunately , this entails a dilemma as to whether to restrict subtasks to gr eedy po licies while trying to learn at a giv en le vel of the hierarchy , or to throw out all potential learning when n on-gr eedy actions are taken at lower levels of the hierarch y . Th e fo rmer optio n is r ather on erous, pre- venting the a gent from learning at all levels of th e hierar- chy concurrently . The latter option, unfortu nately , makes it much less likely th at Q-values near the b eginning of a sub- task will b e updated in supertasks if exploration is equa lly likely throu ghout the execution of the subtask. A h ybrid approa ch may be possible in which the beneﬁt of one ap- proach cou ld be d ynamically weigh ed again st the other, b ut this could be challenging . The solutio n for one-step in tra-option learnin g is sim ply to sk ip th e b ackup whenever a non- greedy action is being taken in any subtask . Gi ven the loca l n ature of its b ack- ups, this does not suffer f rom the same dilemma o f having to choose between greedy policies for subtasks or the prospect of throwing out potential learning. All that is necessary for conv ergence is th at sub tasks no t starve their supertasks f or updates by co nstantly choosing non -greed y a ctions. This happen s to b e ensured by the standard requir ements that ac- tion selection be both non -starving an d greedy in the limit with inﬁnite experience (GLI E). The downside to one- step intra-op tion is that learn ing is relativ ely slow , taking as many episodes for the agent to learn the v alue of a task as there are steps in the task. By comparison, Q ( λ ) using all-states up- dating can learn the v alue of a task in just one episod e unde r ideal circumstances. The solution for temp oral seco nd d ifference traces (TSDT) is simply to skip entering the backup into the trace whenever a non-greedy action is being taken in an y subtask. These gaps in th e traces may re duce the ﬂow o f inform a- tion, b ut keeping t he rest of the trace intact is likely to allow faster learn ing than one-step intra-optio n learning so long as there exists the p ossibility of return ing to a previously vis- ited state. Additiona lly , as backup s are done locally as in 0 500 1 , 000 1 , 500 2 , 000 2 , 500 3 , 000 S t e p N u m b e r − 100 − 80 − 60 − 40 − 20 0 M ea n C u m u l a ti v e S ubop ti m a lit y E ff ec t s o f U pd a ti ng W h il e E xp l o r i ng a B a nd it T a s k F l a t R L F i x e d H R L N a i v e H R L Figure 3: This plot depicts the online perf ormanc e of agents following a ﬁxed epsilon- greedy exploration strategy in a bandit task. Fixed HRL perf orms worse than Flat RL o nly because of th e combin ed effect of exploration in bo th the root task and the subtask. Nai ve HRL ho we ver has an incor- rect policy 50% of the time. one-step intra-op tion learning, TSDT d oes not suffer fr om the dilemma we’ ve intro duced for Q ( λ ) . Th is makes TSDT ideal for use in our off-policy HRL system as detailed in the following section. Like Q ( λ ) , TSDT is c apable o f p ropa- gating reward from the end of a task all the way back to the beginning in a sing le episod e. Re ward will propagate back at least as ef ﬁciently as when using one-step intra-option lea rn- ing in the worst case. 3 Off-Policy Hierar ch ical Reinfor cement Learn ing (OPH RL) The execution of an agent learning with OPHRL, as d e- scribed in algorithm 1, takes the fo rm of a traditional hier- archical reinfo rcement learning algor ithm, but executes in a non-h ierarchical or polling fashion while learning. At each step of execution, there is no commitment at any le vel of the hierarchy to continue w ith the same su btask that was b eing run in the p revious step. This gives one- step intra-op tion learning and temp oral second difference traces (TSDT) ce r- tain advantages over W atkins’ Q ( λ ) , as described in sec- tion 2. OPHRL additionally suppor ts arbitrary rew ard rejec- tion and transfo rmation fun ctions on a per-task basis, allo w- ing solution of th e hierarchical credit assignment problem . OPHRL will co n verge to the tru e value function for a task regardless of whether exploration is decreased at all, so long as the exploration policy is non-star ving fo r all state-action pairs. 3.1 Not Committing to T as ks T aking big steps of exploration has been cited as a signiﬁcant advantage of hierarch ical reinforcem ent lear ning (Dietterich 1998) . It h as e ven been sug gested th at committing to com- pleting all tasks that an a gent chooses to begin is nec essary for hierar chical r einforce ment learn ing to offer advantages over ﬂat rein forcemen t learning (Ryan 2 004a). I t is worth Algorithm 1 Off-Policy Hierarchical Reinforcemen t Learning (OPHRL) u sing one-step intra- option learning. OPHRL(Root) is called each step. Fun ctions rejectReward and transfor mRew ard are task-speciﬁc. Ensure: Q initialized arb itrarily , e.g., Q i ( s, a ) = 0 , ∀ tasks i , ∀ s ∈ S + i , ∀ a ∈ A i ( s ) 1: function OPHRL(T ask i ) 2: Observe s 3: Choose a from A i ( s ) { non- starving } 4: if isT ask( a ) then 5: r , s ′ , exploringIn Subtask ⇐ OPHRL( a ) 6: else 7: T ake primiti ve action a 8: Observe re ward, r , an d next s tate, s ′ 9: end if 10: if not exploringIn Subtask and not rejectRew ard(T ask i , . . . ) t hen 11: r ′ ⇐ transform Re ward(T ask i , . . . ) 12: if s ′ 6∈ S + i then { Completed task i } 13: Q i ( s, a ) α ← − r ′ 14: else if i ∈ A ( s ′ ) then { T ask i can continue } 15: if isT ask( a ) and a ∈ A ( s ′ ) then { Can continue a } 16: Q i ( s, a ) α ← − r ′ + γ Q i ( s ′ , a ) 17: else { Completed subtask/pr imitiv e a } 18: Q i ( s, a ) α ← − r ′ + γ V i ( s ′ ) 19: end if 20: end if 21: end if 22: return r , s ′ , [ Q i ( s, a ) < V i ( s ) or exploring InSubtask ] 23: end function OPHRL noting that Ryan (2004 a) acknowledges that it is conceiv- able that an algorithm withou t a requir ement o f commitmen t could b e developed. OPHRL av oids this commitment and , in doing so, has a rather dif ferent structure than previous hi- erarchical reinforc ement learning algor ithms. Thoug h com mitment to tasks is no t an in tegral part of OPHRL, it is tr i vial to modify OPHRL to supp ort commit- ment to tasks. In fact, there is likely some value in doing so. However , fo lding the qu estion of whether b ig steps of exploration s hould be taken into the dilemma of exploration versus exploitation is pref erable to restricting age nts to big steps. It is well kn own that an agent can per form better when it is n ot forced to co mplete task s it begins (Sutton , Pre- cup, and Singh 1999). Fur thermore , as agents can choose to continue with a subtask until completio n, an agent that can abando n a su btask bef ore it c ompletes is gu aranteed to be able to explore at least as effecti vely as an agent that cannot. Whether it is p ossible to do better, in general, is a difﬁcult question. 3.2 Credit Assig nment Problem Cases exist in which rew ards do not apply to certain lev els of the h ierarchy . It cou ld be that a su btask hasn’t learned that certain action s ar e o nly legal in a subset of th e state space. It could be that a supertask misplanned and a large negativ e reward is received due to no fault of the given task. Dietterich (19 98) addressed th e hier archical credit assign- ment prob lem b y tran sforming the reward f unction. How- ev er , these rewards must be rejected o utright wh en using one-step intra-op tion learning or TSDT . A 0 rew ard will not affect a sum as calculated at th e en d of a task, but it may cause instability when using TD metho ds which up date in an immediate, local fashion. There is still value in ap plying a transforma tion to the rew ard befo re doing a backup for any given task. One can eliminate some cases where recu rsiv e op timality does not imply hierarc hical o ptimality , as outlined in Diet- terich (2 000a) . Additio nally , by incr easing the rew ard for successful termination , it is possible to allow a greedy pol- icy to guarante e conver gence to an op timal p olicy in some cases in which it may otherwise get stuc k in a local min i- mum. Additionally , as identiﬁed b y Ryan (20 04b), hierarch ies can be c onstructed such that a subset of the state space is never exp lored with in a given task if sup ertasks a re never acceptable in tho se states. In the or dinary implemen tation of OPHRL, commitment can b e requir ed to learn e ven a re- cursively optimal po licy . Theref ore, ch oosing to co ntinue a task to c ompletion , e ven if supertasks n o long er support th e execution of the task, may be warranted. 3.3 Gated T e mporal Second Differ ence T races As describ ed in sectio n 2 , it is important to av oid inco rpo- rating rew ards from exploration when attemp ting to lea rn off-policy . As it tur ns out, TSDT is a tempor al difference method a lmost ide ally suited to operating un der this limita- tion. Howe ver , it can beneﬁt from further modiﬁcation. One key obser vation th at gr ants TSDT its power is that actions th at appear to be exploratory whe n they’ re taken may later turn out to be the best choice. A non-greedy choice can turn out to be quite good, enab ling the ﬂow of inf ormation to be tu rned on. T he same issue appear s in the detection of exploration in subtasks. In fact, TSDT suffers from the problem that an action appea ring to be o ptimal in a subtask may later turn out to be suboptimal. An entry can persist in the trace after it turns out that the subtask was exploring a t that time. Th is problem will disappe ar as subtasks con verge to their optimal value fun ctions, but this could pose a serious problem for non-ep isodic or long- runnin g tasks. The gated temporal second difference trace (GTSDT) can resolve this issue by storing the inf ormation necessary to reassess the optim ality of d ecisions taken in subtasks in the tr ace. Rather than excluding entr ies fro m th e trace en - tirely when su btasks are behaving subo ptimally , all entries are stored in th e trac e so th at they may be allowed to up- date when ev er sub tasks appea r to have behaved optimally . Entries in a tr ace ca n p otentially become b locked an d un- blocked many times before the estimated value functio ns for the correspondin g subtasks con ver ge on their true v alue function s. Clif f Figure 4: A shorter 10 x2 clif f-walking domain. 4 Exper imental Results 4.1 Cliff-W alking W e examine a 100x2 clif f-walking domain –a l onger version of the dom ain depicted in ﬁgure 4. Four determin istic move actions can be attem pted from each of the 199 no n-termina l states. All actions result in a reward of − 1 excep t for the terminal actions, wh ich y ield 2 00 for success and − 200 for failure. Agents A ﬂat reinforce ment learnin g age nt simply d ecides between the f our move actions from each state. W e have constructed a hierar chical agent, depicted in ﬁgu re 5, which chooses between a subtask which attempts to solve the tradi- tional cliff-walking ta sk by g etting to the bo ttom-righ t cor- ner, and a subtask which attempts to terminate b y jump ing off the cliff. Gi ven terminal rew ards of 200 an d − 200 , solv- ing the tr aditional task is always p referred by an optima l p ol- icy . The problem th en is for the agent to efﬁciently learn both ho w to solve the traditional cliff-walking task and that solving it is alw ays prefer able to jumping of f the clif f. All hier archical agents have no commitment to complet- ing subtasks and explore with a ﬁxed epsilon-g reedy strat- egy , ε = 0 . 1 . Th e ﬂat agen t and all subtasks explo re with Boltzmann exploration, T = 0 . 5 . All ag ents use all-goa ls updating to speed learning. Results Figure 6 demonstrates that all hierar chical agen ts perfor m strictly worse that the ﬂat agen t, as expected. Both Fixed Q( 0) and GTSDT learn q uite well, but Naive Q(0) does not converge. GTSDT is able to learn mor e ef fectiv ely than Fixed Q( 0) primarily be cause Q-values corre sponding to states far from the g oal can not be upd ated frequ ently giv en the lack of commitment to completing subtasks. 4.2 The T axi cab Domain In the taxicab d omain (Dietterich 1 998), an ag ent is tasked with the p roblem of p icking u p a p assenger an d d eliv ering him to his destination in as few steps as po ssible. The en- vironm ent is a 5x5 grid world. T here are four cells which serve as possible starting locations and possible destinations Root North South East W est Solve Jump Figure 5: HRL cliff-walking agent. 0 20 , 000 40 , 000 60 , 000 80 , 000 100 , 000 S t e p N u m b e r − 160 − 140 − 120 − 100 − 80 − 60 − 40 − 20 0 R e w a r d S ubop ti m a lit y P e r S t e p A g e n t s i n a 100x2 C li f f- W a l k i ng D o m a i n F l a t R L C ( 0 ) G T S D T C ( 0 ) F i x e d Q ( 0 ) C ( 0 ) N a i v e Q ( 0 ) Figure 6: Online per forman ce o f ag ents exploring a 10 0x2 cliff-walking domain. for the passenger . Ther e is a refu eling station near the mid- dle of the map. Additionally , there are six impassable walls (or 26 counting the walls surroundin g the map). There ar e seven actions available to an agent at all times. Attempting to move north , south, east, or west auto matically results in the taxi moving one cell in that direction u nless there is a wall in the way , in which case the move action is ignored and the taxi remains in p lace. Fuel decreases b y 1 unless the move action is ignored . Picku p al ways re sults in the passenger being p icked up if th e taxi do es not have the passenger and is at the passenger’ s starting location. Put- down always results in th e passenger bein g put d own if the taxi h as the passenger and is at the destination . Refuel a l- ways sets the amoun t of fuel to 12 if the taxi is at the refuel- ing station. Each o f the seven actions takes 1 u nit of time. M ove, pickup, p utdown, and refuel actions e ach yield a r ew ard of − 1 except in the following cases. Refu el, p ickup, and put- down each yield a rew ard of − 10 instead if the action is im- possible when attemp ted. Move yields an a dditional rew ard of − 20 if it c auses f uel to dro p below 0, resulting in failure of the trial. Pu tdown y ields an additional rew ard of 20 if it causes the passenger to arrive at his de stination, resulting in the successful termination of the trial. Agents The hierarch y depic ted in ﬁgure 8 is used with a few modiﬁcations from th e version for HSMQ/MAXQ (Di- etterich 1998 ). Q-values are shared b etween subtask s based on destinations. Re wards that are t ransform ed to 0 in HSMQ R G B Y F Figure 7: T axicab Grid W orld Environment. Root Pickup Get Fillup Refuel Putdown Put North South East W est Navigate(t) Figure 8: Hier archical agent for the taxicab domain , as de- scribed in (Dietterich 1998). are rejected instead . T erminal re wards for subtasks are trans- formed from − 1 to 0 . Subtasks explore greedily . Fina lly , all tasks use all-states updating . Results Here we test ﬁxed Q(0), ﬁxed one-step intra-op tion learning (OSIO) , an d gated temp oral sec- ond difference tr aces (GTSDT) while explorin g with full commitmen t to comp leting subtasks. Add itionally , all three algorithm s are tested with a reduction in commitment f rom 1 to 0 over cou rse of the 100 ,000 episodes. In the latter case, the co oling rate for Boltzman n exploration is increased from 0.9999 47 to 0. 99992 4. In term s of the po licies learned after 100,0 00 step s, GTSDT d oes be tter than Fixed Q(0) wh ich d oes better than Fixed OSIO, regardless of th e level of commitmen t to com- pleting subtasks. I n term s o f on line pe rforman ce, depicted in ﬁgure 9, GTSDT is always on top, b ut Fixed OSIO does bet- ter than Fixed Q(0) in terms of online perfor mance if com - mitment is reduced signiﬁcantly . Linearly redu cing commitment from 1 to 0 results in bet- ter policies for all th ree algorithms. Furthermore, online p er- forman ce improves for both GTSDT and Fixed OSIO while only slightly decrea sing online performance for Fix ed Q(0). Fixed Q( 0) and Fixed OSI O with reductio n in co mmitment 0 20 , 000 40 , 000 60 , 000 80 , 000 100 , 000 S t e p N u m b e r − 2 − 1 0 R e w a r d S ubop ti m a lit y P e r S t e p A g e n t s i n t h e T a x i ca b D o m a i n C ( 1 - 0 ) G T S D T C ( 1 ) G T S D T C ( 1 ) F i x e d Q ( 0 ) C ( 1 ) F i x e d O S I O Figure 9: On line perfor mance of agents exploring th e taxi- cab domain. are omitted from ﬁgure 9 for space reasons. 5 Discussion and Futur e Directions W e identiﬁed a signiﬁcant difﬁculty in attempting to lea rn off-policy in hierarchical learn ing systems. Solu tions fo r Q-learning and W atkins’ Q ( λ ) are n ot ideal, r equiring the discarding o f potential lear ning, but on e-step intr a-option learning and te mporal second d ifference tr aces han dle the problem more gracefully . W e have demonstrated that it is po ssible for h ierarchical reinfor cement learning systems to learn efﬁciently witho ut a comm itment constraint, co ntrary to a claim in the litera - ture that commitm ent sh ould be critical for efﬁcient learn- ing. Fu rthermo re, we demo nstrated that reduction in co m- mitment can actually help temp oral difference meth ods learn more quickly . The appr oach we explored for redu cing commitmen t is somewhat ad hoc. It would be interesting to investigate more sop histicated exploration strategies capable of decid- ing wheth er or not commitme nt is valuable at any given time, as opposed to assuming that commitment is most valu- able at the start of exploration. Acknowledgmen ts I would like to than k Professor Joh n Laird and the Univer - sity o f Michigan for their suppo rt. I would like to thank the Soar Group, in cluding Professor John Laird, Jon V oigt, Nate Derbinsky , Nick Gorski, Justin Li, Bob Marinier, Shi- wali Mohan , Miller Tinkerhess, Y ongjia W ang, Sam Win- termute, Joseph Xu, and Mark Y ong for helping me to reﬁne the pre sentation of these idea s. Addition ally , I would like to thank Profe ssor Satind er Singh for meeting with me to discuss some of his past work. Refer ences Bloch, M. K. 201 1. T emporal second d ifference traces. arXiv:cs.LG/1104.46 64. Dietterich, T . G. 199 8. The MAXQ method for hierarchical reinfor cement learning . In ICML , 118– 126. Dietterich, T . G. 2 000a. Hierarch ical reinforcem ent lear n- ing with th e MAXQ value fu nction decompo sition. J. Artif. Intell. Res. (J AIR) 13:227–303 . Dietterich, T . G. 2000b. An overview of MAXQ h ierarchical reinfor cement learning . In SAR A , 26–44. Kaelbling, L. P . 1 993. L earning to achieve goals. In IJ CAI , 1094– 1099 . Ryan, M. 200 4a. Ha ndboo k of Learnin g and Appr o ximate Dynamic Pr ogramming . Series on Co mputation al Intelli- gence. IEEE Press. chapter 8, Hier archical Decision Mak- ing. Ed ited by Jennie Si, Andrew G. Barto, W arren B. Pow- ell, and Donald W un sch II. Ryan, M. R. K. 2004b. Hier ar chical r einfor cement learning : a h ybrid appr oach . Ph.D. Dissertation, New So uth W ales, Australia, Australia. Sup ervisor-Sammut, Claude. Sutton, R. S., and Precup , D. 1 998. Intr a-option lear ning about temporally abstract actions. In Pr oceed ings of the F if- teenth Internationa l Confer ence on Ma chine Learnin g , 556– 564. Morgan Kaufman. Sutton, R. S.; Sing h, S. P .; Precup, D.; and Ravindran, B. 1998. Improved switching amon g temporally abstract ac- tions. In NIPS , 1066– 1072 . Sutton, R. S.; Precup , D.; and Singh, S. P . 19 99. Between MDPs and Semi-MDPs: A frame work for tempor al abstrac- tion in reinforcem ent learnin g. Artif. Intell. 1 12(1- 2):181– 211.

Reducing Commitment to Tasks with Off-Policy Hierarchical Reinforcement Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment