Decision-Theoretic Planning with non-Markovian Rewards
A decision process in which rewards depend on history rather than merely on the current state is called a decision process with non-Markovian rewards (NMRDP). In decision-theoretic planning, where many desirable behaviours are more naturally expresse…
Authors: C. Gretton, F. Kabanza, D. Price
Journal of Artificial In telligence Research 25 ( 2006) 17-74 Submitted 12/04; published 01/06 Decision-Theoretic Planning with non-Mark o vian Rew ards Sylvie Thi´ ebaux Syl vie.Thieba ux@anu.edu.au Charles Gretton Charles.Gretton@anu.edu.au John Slaney John.Slaney@anu.edu. au Da vid Price Da vid.Price@anu.edu.au National ICT A ustr alia & The Austr alian National University Canb err a, ACT 0 200, Austr alia F ro duald Kabanza kabanza@usherbrooke.ca D´ ep artement d’Informatique Universit ´ e de Sherbr o oke Sherbr o oke, Qu´ eb e c J1K 2R1, Canada Abstract A decision proces s in which r ewards dep end on history rather than merely on the cur- rent state is called a decision pr o cess with non-Marko vian rewards (NMRD P). In decision- theoretic planning, where many d esira ble behaviours are more naturally expr essed as pro p- erties of execution sequences rather than as proper ties of states, NMRDPs form a more natural mo del than the co mmonly adopted fully Marko vian decision pro cess (MDP) mo del. While the more tractable solution methods developed for MDPs do not directly apply in the presence of non-Markovian rewards, a n umber of solution metho ds for NMRDPs hav e been prop osed in the literature. These all exploit a compact specification of the no n- Marko vian reward function in temp oral log ic, to automatically translate the NMRDP into an equiv- alent MDP which is solved using efficient MDP solution methods. This pap er presents nmrdpp (Non-Marko vian Rew ard Decision Pro cess Planner), a soft ware platfor m for the developmen t and exp e riment ation of metho ds for decision-theoretic planning with non- Marko vian rewards. The current v ersion of nmrdpp implemen ts, under a sing le int erface, a family of metho ds based on existing as well as new approaches which w e describ e in de- tail. These include dyna mic programming, heuristic search, and structured methods. Using nmrdpp , we compare the methods a nd identify cer tain problem features that affect their per formance. nmrdpp ’s treatmen t of no n-Marko vian r e w ards is inspired by the tr ea tmen t of domain- spec ific sear c h control knowledge in the TLP lan planner, which it inco rp orates as a sp ecial ca se. In the First International Pr o babilistic Planning Comp etition, nmrdpp was a ble to compete and per form w ell in bo th the domain-independent and hand-co ded tracks, using sear ch control kno wledge in the latter. c 2006 A I A ccess F oundation. All ri gh ts reserve d. Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza 1. In tro duction 1.1 The Problem Mark o v decision processes (MDPs) are now widely accepted as t he preferred mod el for decision-theoreti c planning p r oblems (Boutilier, Dean, & Hanks, 1999). The fund amen tal assumption b ehind the MDP formulatio n is that n ot only the system dynamics b u t a lso the rew ard fun ction are Markovian . Therefore, all information needed to determine the reward at a giv en state must b e enco d ed in the state itself. This requiremen t is n ot alwa ys easy to meet for planning problems, as m an y desirable b eha viours are naturally expressed as prop erties of execution se quenc es (see e.g., D rum - mond, 1989; Hadda wy & H anks, 1 992; Bacc h us & Kabanza, 1998; Pistore & T r a v erso, 2001) . T ypical c ases include rew ard s for the main tenance of so me prop ert y , for the p erio dic ac hiev emen t of some goal, for the ac hieve ment of a goal within a giv en num b er of steps of the r equest b eing made, or ev en simply for the v ery first ac hiev ement of a goal wh ich b ecomes irrelev ant afterwa rds. F or instance, c onsider a health care rob ot whic h assists ederly or disabled p eople b y ac hieving simple goals such as remind ing them to d o imp ortant tasks (e.g. taking a pill), en tertaining them, c hec king or transp orting ob jects for t hem (e.g . c hec king the sto v e’s temp erature or bringing coffee), escorting them, or searc hing (e.g. for glasses or for the n urse) (Cesta et al., 2003). In this domain, we might wan t to rew ard the rob ot for making sure a giv en patien t take s his pill exactly once ev ery 8 h ours (and p enalise it if it fails to prev ent the patien t from doing this more th an once w ithin this time frame!), we may rew ard it for rep eatedly visiting all ro oms in the ward in a giv en order and rep orting an y problem it d etect s, it may also receiv e a rew ard once for eac h patien t’s request answered within the app ropriate time-frame, etc. Another example is the elev ator con trol d omain (Ko ehler & Sc huster, 2000), in whic h an elev ator must get passengers from their origin to their destination as efficien tly as p ossible, while attempting to satisfying a range of other conditions such as pro viding priorit y services to critical customers. In this domain, some tra jectorie s of the ele v ator are more desirable than others, w hic h mak es it natural to encode the problem b y assigning rew ards to those tra jectories. A decision pro cess in whic h r ew ards dep end on th e sequence of states p assed through rather than m erely on the current state is called a decision pro cess w ith non-Markovian r ewar ds (NMRDP) (Bacc hus, Boutilier, & Gro v e, 1996). A difficu lty with NMRDPs is that the most efficient MDP solution metho d s do n ot d irectly apply to them. The traditional wa y to circum v ent this problem is to fo rmulate the NMRDP as an e quiv alen t MDP , w hose states result fr om augment ing those of the original NMRDP with extra information capturing enough history to m ak e the rew ard Marko vian. Hand crafting su ch an MDP can ho we ve r b e ve ry difficult in general. Th is is exacerbated b y the fact that the size of the MDP impacts the effectiv eness of many solution metho ds. Therefore, there has b een int erest in automating the translation in to an MDP , starting from a natural sp ecification of non- Mark o vian r ew ards and of the system’s dynamics (Bacc hus et al., 1996; Bacc hus, Boutilier, & Gro v e, 1997). This is the p roblem we fo cus on. 18 Decision-Theoretic Planning with non-Markovian Rew ards 1.2 Existing Approac hes When solving NMRDPs in this setting, the cent ral issue is to d efine a non-Marko vian rewa rd sp ecification language and a translation in to an MDP adap te d to the class of MDP solution metho ds and representa tions w e w ould lik e to us e for the t yp e of problems at hand. More precisely , there is a tradeoff b et w een the effort sp en t in the translation, e.g. in p ro ducing a smal l equiv alen t MDP without many irrelev ant history distinctions, and the effort required to solv e it. Ap p ropriate resolution of this tradeoff dep end s on th e t yp e of repr esen tations and solution metho ds en visioned for the MDP . F or instance, structur e d repr esen tations and solution metho ds whic h hav e some abilit y to ignore irrelev an t inform ation ma y cop e with a crude translation, w h ile state-b ase d (flat) represent ations and metho ds will require a more sophisticated translation pro d ucing an MDP as sm all as feasible. Both the t w o previous prop osals within this line of resea rch r ely on p ast linear temp oral logic (PL TL) form ulae to s p ecify the b ehavio urs to b e r ew arded (Bacc hus et al., 1996, 1997). A nice feature of PL TL is that it yields a straigh tforw ard seman tics of non-Mark o vian rew ards, and lends itself to a range of trans lations from the crud est to the finest. Th e t w o prop osals adopt v ery different translations adapted to tw o v ery differen t types of solution metho ds and r epresen tations. T he fi rst (Bacc hus et al., 1996) targets classical state-based solution metho ds s u c h as p olicy iteration (Ho w ard, 1960) whic h generate c omplete p olicies at the cost of en umerating all states in the en tire MDP . Consequen tly , it adopts an exp ensive translation whic h atte mpts to prod uce a minima l MDP . By con trast, the second translatio n (Bacc hus et al., 1997) is very efficient but crud e, and targets stru ctured solution metho ds and represen tations (see e.g., Ho ey , S t-Aubin, Hu, & Boutilier, 1999; Boutilier, Dearden, & Goldszmidt, 2000; F eng & Hansen, 2002), wh ic h do not r equire explicit state enumeratio n. 1.3 A New Approac h The first con tribution of this p ap er is to provi de a language and a translation adapted to another class of solution metho ds whic h h av e pro v en quite effectiv e in dealing with large MDPs, namely anytime state-based h eu r istic searc h metho ds such as LAO* (Hansen & Zilb erstein, 2001), LR T DP (Bonet & Geffner, 2003), and ancestors (Barto, Bardtk e, & Singh, 1995; Dean, K aelbling, Kirman, & Nic holson, 1995; Thi ´ ebaux, Hertzb erg, Shoaff, & S c hneider, 1995). These metho d s typic ally s tart with a compact r epresen tation of the MDP b ased on p robabilistic plann ing op erators, and searc h forw ard fr om an initial s tate, constructing n ew sta tes b y expand ing the env elope of the p olicy as time p ermits. They ma y pro duce an appro ximate and even incomplete p olicy , bu t explicitly construct and explore only a fraction of the MDP . Neither of the t w o previous pr op osals is w ell-suited to suc h solution metho ds, the fir st b ecause the cost of the translation (most of wh ic h is p erformed prior to the solution phase) annih ilates the b enefits of anyti me algorithms, and the second b ecause the size of the MDP obtained is an obstacle to the applicabilit y of state-based metho ds. Since here b oth the cost of the translation and the size of the MDP it results in will severel y impact on the qualit y of the p olicy obtainable b y the deadline, w e n eed an appropriate resolution of the tradeoff b et w een the t w o. Our approac h has the follo w ing main features. The translation is ent irely em b edded in the an ytime solution metho d, to which full cont rol is giv en as to wh ic h parts of the MDP will b e explicitly constructed and explored. While the MDP obtained is not min imal, it 19 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza is of the minimal size ac hiev able without steppin g outside of the anyti me framework, i.e., without enumerating parts of the state sp ace that the solution metho d would not necessarily explore. W e formalise this relaxed notion of minimalit y , which we call blind minimality in reference to th e fact that it do es not require an y lo ok ahead (b ey ond the fringe). This is appropriate in the con text of an ytime state-based solution metho ds, wh ere we wan t the minimal MDP ac hiev able without exp ensive pre-pro cessing. When the rewa rding b eha viours are sp ecified in PL T L, there do es n ot app ear to b e a w a y of ac hieving a relaxed notion of minimalit y as p o werful as b lind m in imalit y without a p rohibitiv e translation. T herefore instead of PL TL, we adopt a v arian t of futur e linear temp oral logic (FL TL) as our sp ecificatio n language, whic h w e extend to h andle rewards. While the language has a more complex semanti cs than PL TL, it enables a natural trans - lation int o a blind-minimal MDP b y simp le pr o gr ession of the reward form ulae. Moreo v er, searc h con trol knowledge expressed in FL TL (Bacc h us & Kabanza, 2000) fits particularly nicely in this f ramew ork, and can b e used to dramatically redu ce the fraction of the searc h space explored b y the solution m etho d . 1.4 A New System Our second contribution is nmrdpp , the first rep orted imp lemen tation of N MRDP solution metho ds. nmrdpp is designed as a soft w are platform for th eir dev elopmen t and exp erimen- tation u nder a common in terface. Give n a d escription of the actio ns in a domain, nmrdpp lets the u ser pla y with and compare v arious enco ding s tyles for non-Marko vian rewa rd s and searc h con trol kno wledge, v arious translat ions of the resulting NMRDP into M DP , and v arious MDP solution metho ds. While solving the problem, it can b e made to record a range of statistics ab out the space and time b eha viour of the algorithms. It also su pp orts the graphical displa y of the MDPs and p olicies generated. While nmr d pp ’s primary interest is in the treatmen t of non-Mark o vian rewards, it is also a competitiv e platform for decisio n-theoretic p lann ing with purely M arko vian r ew ards. In the First Int ernational Probabilistic Plannin g C omp etition, n m r dpp w as able to enrol in b oth the domain-indep enden t and hand -cod ed trac ks, attempting all p roblems featuring in the conte st. Thanks to its use of searc h cont rol-kno wledge, it scored a second p lace in the hand-co ded trac k whic h featured p r obabilistic v arian ts of blo c ks w orld and logistics problems. More surprisingly , it also scored s econd in the d omain-indep enden t subtrack con- sisting of all problems that we re not take n from the blo c ks w orld and logistic doma ins. Most of these latter p r oblems had not b een released to the p articipant s prior to the comp etition. 1.5 A New Exp erimental Analysis Our third con tribution is a n exp erimen tal analysis of the fa ctors that affect the p erformance of NMRDP solution metho ds. Using nmrd pp , we compared their b eha viours un der the influence of parameters suc h as th e structure and d egree of uncertain t y in the d ynamics, the t yp e of rewa rds and the synta x used to describ ed them, reac habilit y of the conditions trac k ed, and relev ance of rewards to the optimal p olicy . W e were able to ident ify a num b er of general trends in the b eha viours of the metho ds and provi de advice concerning w h ic h are b est suited in certain circumstances. Our exp eriments also lead us to rule out one of 20 Decision-Theoretic Planning with non-Markovian Rew ards the metho d s as systematica lly underp erforming, and to iden tify issues with the claim of minimalit y made by one of the P L TL approac hes. 1.6 Organisation of the P ap er The pap er is organised as follo ws. S ection 2 b egins with b ac kground material on MDPs, NMRDPs, and existing approac hes. S ection 3 describ es our n ew approac h and Section 4 presen ts nmrd pp . Sections 5 and 6 rep ort our exp erimen tal analysis of the v arious ap- proac hes. Section 7 explains ho w we used nmrdpp in the comp etiti on. Section 8 concludes with remarks ab out related and future work. Ap p endix B giv es the pro ofs of the theorems. Most of the material presented is compile d f rom a series of recen t conference and workshop pap ers (Th i ´ ebaux, Kabanza, & Slaney , 2002a, 2002b; Gretton, Price, & Thi´ ebaux, 2003a, 2003b). Details of the logic w e use to represent rew ards ma y b e found in our 2005 pap er (Slaney , 2005). 2. Bac kground 2.1 MDPs, NMRDPs, Equiv alence W e start with some notation and defin itions. Giv en a finite set S of states, we wr ite S ∗ for the set of finite sequences of s tates o v er S , and S ω for the set of p ossibly infi nite state sequences. Where ‘Γ’ stands for a p ossibly infinite state sequence in S ω and i is a natural n umber, b y ‘Γ i ’ we mean the state of index i in Γ, by ‘Γ( i )’ we mean the prefix h Γ 0 , . . . , Γ i i ∈ S ∗ of Γ . Γ; Γ ′ denotes the concatenation of Γ ∈ S ∗ and Γ ′ ∈ S ω . 2.1.1 MDPs A Mark o v decision pro cess of the t yp e we consider is a 5-tuple h S, s 0 , A, Pr , R i , where S is a finite s et of fully observ able states, s 0 ∈ S is the initial s tate, A is a finite set of actions ( A ( s ) d en otes the su bset of actions app licable in s ∈ S ), { Pr( s, a, • ) | s ∈ S , a ∈ A ( s ) } is a family of probabilit y d istributions o v er S , suc h that Pr( s, a, s ′ ) is the p robabilit y of b eing in state s ′ after p erforming action a in state s , and R : S 7→ I R is a reward function such that R ( s ) is the immediate rew ard for b eing in state s . It is well kno wn that suc h an MDP can b e compactly repr esented using dynamic Bay esian n et w orks (Dean & Kanaza wa, 1989; Boutilier et a l., 1999) or probabilistic extensions of tra ditional planning languages (see e.g., Kushmerick, Hanks, & W eld, 1995; Thi´ ebaux et al., 1995; Y ounes & Littman, 2004). A stationary p olicy for an MDP is a function π : S 7→ A , suc h that π ( s ) ∈ A ( s ) is the action to b e executed in state s . T h e v alue V π of the p olicy at s 0 , whic h w e seek to maximise, is the sum of the expected future rew ards o v er an infin ite horizo n, discounted b y ho w far into the futur e they o ccur: V π ( s 0 ) = lim n →∞ E n X i =0 β i R (Γ i ) | π , Γ 0 = s 0 where 0 ≤ β < 1 is the discoun t factor cont rolling the con tribution of distan t r ewards. 21 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . s 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . s 1 ✻ d 1.0 ❅ ❅ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ❂ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ✠ a 0.1 0.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ⑦ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ❅ ❅ ❘ b 0.5 0.5 ❅ ❅ ■ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . c 1.0 In the initial state s 0 , p is false and tw o actions are p ossible: a causes a transition to s 1 with probabilit y 0.1, and no change wi th probability 0.9, while for b the transition probabilities are 0.5. In state s 1 , p is true, and actions c and d (“stay” and “go”) lead to s 1 and s 0 respectively w ith probabilit y 1. A reward is received the fi rst time p is true, but not subsequently . That is, the rewarded state sequences are: h s 0 , s 1 i h s 0 , s 0 , s 1 i h s 0 , s 0 , s 0 , s 1 i h s 0 , s 0 , s 0 , s 0 , s 1 i etc. Figure 1: A Simple NMRDP 2.1.2 NMRDPs A decision pro cess with non-Mark o vian rewards is identic al to an MDP except that the domain of the rewa rd fu nction is S ∗ . The idea is that if the pro cess has p assed through state sequence Γ( i ) up to stage i , then the rew ard R (Γ( i )) is receiv ed at stage i . Figure 1 giv es an example. Lik e the reward function, a p olicy for an NMRDP dep end s on history , and is a m apping from S ∗ to A . As b efore, the v alue of p olicy π is th e exp ectation of the discoun ted cumulativ e rew ard ov er an infin ite horizon: V π ( s 0 ) = lim n →∞ E n X i =0 β i R (Γ( i )) | π , Γ 0 = s 0 F or a d ecision pro cess D = h S, s 0 , A, Pr , R i and a state s ∈ S , w e let e D ( s ) stand for the set of state sequences ro oted at s th at are fe asible und er the actions in D , that is: e D ( s ) = { Γ ∈ S ω | Γ 0 = s and ∀ i ∃ a ∈ A (Γ i ) Pr(Γ i , a, Γ i +1 ) > 0 } . Note that the definition of e D ( s ) do es not dep en d on R and therefore applies to b oth MDPs and NMRDPs. 2.1.3 Equiv a l ence The clev er algorithms d ev elop ed to solv e MDPs cann ot b e directly app lied to NMRDPs. One wa y of dealing with this problem is to translate the NMRDP in to an equiv alen t MDP with an expanded state space (Bacc h us et al., 1996). T h e expanded states in this MDP ( e-states , for sh ort) augmen t the states of the NMRDP by encodin g additional information sufficien t to mak e the r ew ard history-indep end en t. F or instance, if we only w ant to rew ard the ve ry fir st ac hiev emen t of goal g in an NMRDP , the states of an equiv alen t MDP would carry one extra bit of in f ormation recording whether g has already b een true. An e-state can b e seen as labelled b y a state of the NMRDP (via the fu nction τ in Definition 1 b elo w) and b y history information. The dynamics of NMRDPs b eing Mark o vian, the actions and their probabilistic effects in the MDP are exact ly those of the NMRDP . Th e foll o wing definition, adapted from that giv en by Bacc hus et al. (1996), make s this concept of equiv alent MDP precise. Figure 2 give s an example. 22 Decision-Theoretic Planning with non-Markovian Rew ards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . s ′ 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . s ′ 1 ❅ ❅ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ✴ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ✠ a 0.1 0.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ♦ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ❅ ❅ ■ b 0.5 0.5 ✛ c d ✛ ✘ ❄ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . s ′ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . s ′ 3 ✻ d 1.0 ❅ ❅ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ❂ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ✠ a 0.1 0.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ⑦ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ❅ ❅ ❘ b 0.5 0.5 ❅ ❅ ■ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . c 1.0 Figure 2: An MDP E qu iv alen t to the NMRDP in Figure 1. τ ( s ′ 0 ) = τ ( s ′ 2 ) = s 0 . τ ( s ′ 1 ) = τ ( s ′ 3 ) = s 1 . Th e initial state is s ′ 0 . State s ′ 1 is rew arded; the other states are not. Definition 1 MDP D ′ = h S ′ , s ′ 0 , A ′ , Pr ′ , R ′ i is e quivalent to NMRDP D = h S, s 0 , A, Pr , R i if ther e exists a mapping τ : S ′ 7→ S such that: 1 1. τ ( s ′ 0 ) = s 0 . 2. F or al l s ′ ∈ S ′ , A ′ ( s ′ ) = A ( τ ( s ′ )) . 3. F or al l s 1 , s 2 ∈ S , if ther e is a ∈ A ( s 1 ) such that Pr( s 1 , a, s 2 ) > 0 , then for al l s ′ 1 ∈ S ′ such that τ ( s ′ 1 ) = s 1 , ther e exists a u nique s ′ 2 ∈ S ′ , τ ( s ′ 2 ) = s 2 , such that for al l a ′ ∈ A ′ ( s ′ 1 ) , Pr ′ ( s ′ 1 , a ′ , s ′ 2 ) = Pr( s 1 , a ′ , s 2 ) . 4. F or any fe asible state se quenc e Γ ∈ e D ( s 0 ) and Γ ′ ∈ e D ′ ( s ′ 0 ) su c h that τ (Γ ′ i ) = Γ i for al l i , we have: R ′ (Γ ′ i ) = R (Γ( i )) for al l i . Items 1–3 ensure that there is a bijection b et w een feasible state sequences in the NMRDP and feasible e-state sequences in the MDP . Therefore, a stationary p olicy for the MDP can b e rein terpreted as a non-stationary p olicy for the NMRDP . F urthermore, item 4 ensures that the t wo p olicies ha ve iden tica l v alues, and that consequen tly , solving an NMRDP optimal ly reduces to pro du cing an equiv alent MDP and solving it optimally (Bacc hus et al., 1996) : Prop osition 1 L et D b e an NM RD P , D ′ an e quivalent MD P for it, and π ′ a p olicy for D ′ . L et π b e the function define d on the se que nc e pr efixes Γ( i ) ∈ e D ( s 0 ) b y π (Γ( i )) = π ′ (Γ ′ i ) , wher e f or al l j ≤ i τ (Γ ′ j ) = Γ j . Then π is a p olicy for D such that V π ( s 0 ) = V π ′ ( s ′ 0 ) . 1. T echnically , the definition allo ws the sets of actions A and A ′ to b e different, but any action in which they differ must b e inapp licable in reachable states in the NMRDP and in all e-states in th e e quiva lent MDP . F or all practical purp oses, A and A ′ can be seen as identical. 23 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza 2.2 Existing Approac hes Both existing approac hes to NMRDPs (Bacc hus et al., 1996, 1997) use a temp oral logic of the p ast (PL T L ) to compactly represen t non-Mark o vian r ewards and exploit th is compact represen tation to translate the NMRDP into an MDP amenable to off-the-shelf s olution metho ds. Ho w ev er, they target different classes of MDP represen tations and solution meth- o ds, and consequen tly , adopt differen t style s of translations. Bacc hus et al. (1996) target state-based MDP represen tations. The equiv alent MDP is fi rst generated en tirely – this inv olv es the en umeration of all e-state s and all transitions b et we en them. Then, it is s olv ed using traditional d yn amic programming metho ds such as v alue or p olicy iteratio n. Because these metho ds are extremely sensitiv e to the num b er of states, atten tion is paid to pr o ducing a minimal equiv alent MDP (with the least num b er of states). A fir st simple translation wh ic h w e call pl t l sim pr o duces a large MDP whic h can b e p ost-pro cessed for min imisation b efore b eing solv ed. Another, whic h we call pl tlm in , directly results in a minimal MDP , but relies on an exp ensive pre-pr o cessing phase. The second approac h (Bacc hus et al., 1997), whic h w e call pl tls tr , targets structured MDP rep r esen tations: the transition mo del, p olicies, rew ard and v alue functions are repre- sen ted in a compac t form, e.g. as trees or algebraic decision diagrams (ADDs) (Hoey et al., 1999; Boutilier et al., 2000). F or in s tance, the prob abilit y of a giv en prop osition (state v ariable) b eing tru e after the execution of an action is sp ecified by a tree w hose inte rnal no des are lab elled with the state v ariables on w hose previous v alues the give n v ariable de- p ends, whose arcs are lab elled by the p ossible previous v alues ( ⊤ or ⊥ ) of these v ariables, and whose lea v es are lab elled with probabilities. The translation amounts to augmenting the structur ed MDP with new temp or al v ariables trac king the r elev an t p rop erties of state sequences, together with the co mpact represen tatio n of (1) their dyn amics, e.g. a s tree s o ve r the p revious v alues of relev ant v ariables, and (2) of the non-Mark o vian reward function in terms of the v ariables’ cur ren t v alues. Then , st ructur ed solution metho ds suc h as structured p olicy iterat ion or the SPUDD algorithm are r un on the resu lting stru ctured MD P . Neither the translation nor the solution metho ds explicitly enumerates the states. W e no w review these app roac hes in some detail. Th e reader is referred to the resp ectiv e pap ers for additional information. 2.2.1 Rep resenting Rew a r ds with PL TL The synta x of PL T L, the language c hosen to represent rewa rding b ehavio urs , is that of prop ositional logic, augmen ted with the op erators ⊖ (previously) and S (since) (see Emer- son, 1990). Whereas a classical p rop ositional logi c formula denotes a set of state s (a su bset of S ), a PL TL form ula d en otes a se t of finite se quenc es of states (a sub set of S ∗ ). A f orm ula without temp oral mo dalit y expresses a p rop ert y that must b e true of the cur ren t state, i.e., the last state of the fin ite sequence. ⊖ f s p ecifies that f h olds in the previous state (the state one b efore the last). f 1 S f 2 , requires f 2 to h a v e b een tru e at some p oin t in the se- quence, and, unless that p oin t is the presen t, f 1 to ha v e held eve r sin ce. More formally , the mo delling relation | = stating w hether a form ula f h olds of a finite sequence Γ( i ) is d efined recursiv ely as follo ws: • Γ( i ) | = p iff p ∈ Γ i , f or p ∈ P , the s et of atomic prop ositions 24 Decision-Theoretic Planning with non-Markovian Rew ards • Γ( i ) | = ¬ f iff Γ( i ) 6| = f • Γ( i ) | = f 1 ∧ f 2 iff Γ ( i ) | = f 1 and Γ( i ) | = f 2 • Γ( i ) | = ⊖ f iff i > 0 and Γ( i − 1) | = f • Γ( i ) | = f 1 S f 2 iff ∃ j ≤ i, Γ( j ) | = f 2 and ∀ k , j < k ≤ i, Γ( k ) | = f 1 F rom S , one can define the u s efu l op erators ♦ - f ≡ ⊤ S f meaning th at f has b een true at some p oin t, and ⊟ f ≡ ¬ ♦ - ¬ f meaning that f has alw ays b een tru e. E.g, g ∧ ¬ ⊖ ♦ - g denotes the s et of finite sequences ending in a state wh ere g is true for the first time in the sequence. Other u seful abbreviation are ⊖ k ( k times ago), for k iterations of the ⊖ mo dalit y , ♦ - k f for ∨ k i =1 ⊖ i f ( f wa s true at s ome of the k last steps), and ⊟ k f for ∧ k i =1 ⊖ i f ( f wa s true at all the k last steps). Non-Mark o vian reward fun ctions are describ ed with a set of pairs ( f i : r i ) where f i is a PL TL rew ard formula and r i is a r eal, with the semantic s that the r ew ard assigned to a sequence in S ∗ is the sum o f the r i ’s for whic h that sequence is a model of f i . Belo w, w e let F denote th e set of r ew ard formulae f i in the description of the rewa rd function. Bacc hus et al. (1996) giv e a list of b eha viours wh ic h it might b e useful to reward, together with their expression in PL TL. F or ins tance, where f is an atemp oral formula, ( f : r ) rewa rds with r units the ac hiev emen t of f whenev er it happ ens. This is a Mark o vian reward. In con trast ( ♦ - f : r ) rewards ev ery state follo wing (and including) the ac hiev emen t of f , while ( f ∧ ¬ ⊖ ♦ - f : r ) only rew ards the fir st o ccurr en ce of f . ( f ∧ ⊟ k ¬ f : r ) rew ards the o ccurr ence of f at most once ev ery k steps. ( ⊖ n ¬ ⊖ ⊥ : r ) rew ards the n th state, indep end ently of its pr op erties. ( ⊖ 2 f 1 ∧ ⊖ f 2 ∧ f 3 : r ) rewards the o ccurrence of f 1 immediately follo w ed by f 2 and then f 3 . In reactiv e planning, so-called r esp onse formulae whic h describ e that the ac hiev emen t of f is triggered by a condition (or command) c are particularly useful. These can b e written as ( f ∧ ♦ - c : r ) if ev ery state in whic h f is true follo wing the first issue of the command is to b e rew arded. Alternativ ely , they can b e written as ( f ∧ ⊖ ( ¬ f S c ) : r ) if only th e fi r st o ccurrence of f is to b e rewarded after eac h command. It is common to only rew ard th e ac hiev emen t f within k steps of the trigger; we write for example ( f ∧ ♦ - k c : r ) to rew ard all such states in w hic h f holds. F rom a theoretical p oin t of view, it is kno wn (Lic h tenstein, Pnueli, & Zuc k, 1985) that the b ehavi ours repr esentable in PL TL are exactly those corresp ond ing to star-free regular languages. Non star-free b eha viours such as ( pp ) ∗ (rew ard an ev en n umb er of states all con taining p ) are therefore not represent able. Nor, of course, are non-regular b eha viours suc h as p n q n (e.g. rew ard taking equal n umbers of steps to the l eft and right) . W e shall not sp eculate here on ho w s evere a r estriction this is for the pur p oses of plann in g. 2.2.2 Principles Behind the Transla t ions All thr ee translations in to an MDP ( pl tls im , p l tlmin , and pl tls t r ) rely on the equiv- alence f 1 S f 2 ≡ f 2 ∨ ( f 1 ∧ ⊖ ( f 1 S f 2 )), with whic h we can decomp ose temp oral mo dalities in to a requiremen t ab out the last state Γ i of a sequence Γ( i ), and a requirement ab out the prefix Γ( i − 1) of the sequence. More pr ecisely , giv en state s and a form ula f , one can com- 25 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza pute in 2 O ( || f || ) a new form ula Reg( f , s ) called the r egression of f through s . Regression has the prop ert y that, for i > 0, f is true of a fi n ite sequence Γ( i ) ending with Γ i = s iff Reg( f , s ) is tru e of the prefix Γ( i − 1). That is, Reg( f , s ) represen ts what must hav e b een true previously for f to b e true now. Reg is defined as follo ws: • Reg( p, s ) = ⊤ iff p ∈ s and ⊥ otherwise, f or p ∈ P • Reg( ¬ f , s ) = ¬ Reg( f , s ) • Reg( f 1 ∧ f 2 , s ) = Reg( f 1 , s ) ∧ Reg( f 2 , s ) • Reg( ⊖ f , s ) = f • Reg( f 1 S f 2 , s ) = Reg( f 2 , s ) ∨ (Reg( f 1 , s ) ∧ ( f 1 S f 2 )) F or instance, tak e a state s in wh ic h p holds and q do es not, and tak e f = ( ⊖ ¬ q ) ∧ ( p S q ), meaning that q m ust ha v e b een false 1 step ago, but that it must hav e h eld at s ome p oint in the p ast and that p must ha v e held since q last did. Reg( f , s ) = ¬ q ∧ ( p S q ), that is, for f to hold now, then at the pr evious stage, q had to b e false and the p S q requirement still h ad to h old. When p and q are b oth false in s , then Reg( f , s ) = ⊥ , indicating that f cannot b e satisfied, regardless of wh at came earlier in the sequence. F or notational con v enience, where X is a set of formulae we write X for X ∪{¬ x | x ∈ X } . No w the tra nslations exploit the PL TL representa tion of r ew ards as follo ws. Each expanded state (e-state) in the generated MDP can b e seen as lab elled with a set Ψ ⊆ Sub( F ) of subformulae of t he rew ard form ulae in F (and their negations). The subform ulae in Ψ m u st b e (1) tru e of the p aths leading to the e-state, an d (2) suffi cien t to determine the current truth of all r ew ard f ormulae in F , as this is needed to compu te the current rew ard. Ideally the Ψs sh ould also b e (3) small en ou gh to enable just that, i.e. they should not conta in subformulae which draw h istory distinctions that are irrelev ant to determining the rewa rd at one p oin t or another. Note h o w ev er that in the worst-c ase, the n umber of d istinctions needed, e ve n in the minimal equiv alen t MD P , ma y be exp onen tial in || F || . Th is happ ens for instance with the formula ⊖ k f , whic h requires k additional bits of information memorising the truth of f o v er the last k steps. 2.2.3 pl tlsim F or the choic e of the Ψs, Bacc hus et al. (199 6) consider t wo ca ses. In the simple case, whic h w e call p l tlsim , an MDP ob eying p r op erties (1) and (2) is pro du ced by simply lab elling eac h e-state with the set o f al l s u bformula e in Sub( F ) which are true of the sequence leading to that e-state. Th is MDP is generated forwa rd, starting from the in itial e-state lab elled with s 0 and with the set Ψ 0 ⊆ Sub( F ) of all sub form ulae whic h are true of th e sequence h s 0 i . The su ccessors of an y e-state lab elled b y NMRDP state s and subformula set Ψ are generated as follo ws: eac h of them is lab elled by a successor s ′ of s in the NMRDP and by the set of subformulae { ψ ′ ∈ Sub( F ) | Ψ | = Reg( ψ ′ , s ′ ) } . F or instance, consider the NMRDP sh o wn in Figure 3. T h e set F = { q ∧ ⊖ ⊖ p } consists of a single rew ard form ula. The set Sub( F ) consists of all subf orm ulae of this rew ard formula, 2. The size || f || of a rew ard form ula is meas ured as its length and t he size || F || of a s et of reward form ulae F is measured as the sum of the lengths of th e f ormulae in F . 26 Decision-Theoretic Planning with non-Markovian Rew ards start_state a(0.04) b(0.2) p a(0.16) p, q a(0.64) q a(0.16) b(0.8) a(1) b(1) a(1) b(1) a(0.8) a(0.2) b(1) In the initial state, b oth p and q are false. When p is false, action a indep en dently sets p and q to true with probabil ity 0.8. When b oth p and q are fa lse, action b sets q to true with probability 0.8. Both actions hav e no effect otherwise. A reward is obtained whenever q ∧ ⊖ ⊖ p . The optimal policy is to apply b until q gets produced, making sure to av oid the state on the left-hand side, then to app ly a until p gets produced, and then t o apply a or b indifferen tly fo rever. Figure 3: Another Simp le NMRDP and their negat ions, that is Sub( F ) = { p, q , ⊖ p, ⊖ ⊖ p, q ∧ ⊖ ⊖ p, ¬ p, ¬ q , ¬ ⊖ p, ¬ ⊖ ⊖ p , ¬ ( q ∧ ⊖ ⊖ p ) } . Th e equ iv alen t MDP p r o duced by pl tlsim is shown in Figure 4. 2.2.4 pl tlmin Unfortunately , the MDPs pro duced b y pl tl sim a re far from minimal. Although they could b e p ostpro cessed for min imisation b efore inv oking the MDP solution metho d, the ab o ve expansion ma y still constitute a serious b ottlenec k. Therefore, Bacc hus et al. (1996) consider a more complex t w o-phase translation, whic h we call pl t lmin , capable of pro ducing an MDP also satisfying p rop ert y (3). Here, a prepr o cessing phase iterates o v er all states in S , and computes, for eac h state s , a set l ( s ) of subformulae, where the function l is the solution of the fixp oin t equation l ( s ) = F ∪ { Reg( ψ ′ , s ′ ) | ψ ′ ∈ l ( s ′ ) , s ′ is a successor of s } . Only subformulae in l ( s ) w ill b e candidates for inclusion in the sets lab elling the resp ectiv e e-state s lab elled with s . That is, the subsequen t expansion phase will b e as ab o v e, but taking Ψ 0 ⊆ l ( s 0 ) and ψ ′ ⊆ l ( s ′ ) instead of Ψ 0 ⊆ Su b( F ) and ψ ′ ⊆ Su b( F ). As the sub form ulae in l ( s ) are exactly those that are rele v an t to th e wa y feasible execution sequences starting fr om e-state s lab elled with s are r ewarded, this leads the expansion phase to pro duce a m inimal equiv alen t MDP . Figure 5 sho ws the equiv alen t MDP pro du ced by pl tlmin for the NMRDP example in Figure 3, together with th e function l from which the lab els are b uilt. Ob serv e h ow this MDP is smaller than the pl tl sim MDP: on ce we reac h the state on the left-hand side in whic h p is true and q is false, there is no p oin t in trac king t he v alues of subform ulae, b ecause q cannot b ecome tru e and so the rew ard form ula cannot either. This is reflected by the fact that l ( { p } ) only con tains the reward formula. In the worst case, computing l requires a sp ace, and a num b er of iterations through S , exp onen tial in || F || . Hence the question arises of whether the gain during the expansion phase is w orth the extra complexit y of the pr epro cessing ph ase. This is one of the questio ns our exp eriment al analysis in S ection 5 will try to answe r. 2.2.5 pl tlstr The p l tlstr translation can b e seen as a symb olic v ersion of pl tlsim . The set T of added temp oral v ariables con tains the purely temp oral su bform ulae PTSub( F ) of the rew ard form ulae in F , to wh ic h the ⊖ mo dalit y is prep ended (un less already there): T = { ⊖ ψ | ψ ∈ 27 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza start_state f6,f7,f8,f9,f10 Reward=0 a(0.04) b(0.2) p f1,f7,f8,f9,f10 Reward=0 a(0.16) p, q f1,f2,f8,f9,f10 Reward=0 a(0.64) q f2,f6,f8,f9,f10 Reward=0 a(0.16) b(0.8) p f1,f3,f7,f9,f10 Reward=0 a(1) b(1) p, q f1,f2,f3,f9,f10 Reward=0 a(1) b(1) a(0.8) a(0.2) b(1) p f1,f3,f4,f7,f10 Reward=0 a(1) b(1) a(1) b(1) p, q f1,f2,f3,f4,f5 Reward=1 a(1) b(1) a(1) b(1) The fol low ing su b form ulae in Su b( F ) lab el the e-states: f 1 : p f 2 : q f 3 : ⊖ p f 4 : ⊖ ⊖ p f 5 : q ∧ ⊖ ⊖ p f 6 : ¬ p f 7 : ¬ q f 8 : ¬ ⊖ p f 9 : ¬ ⊖ ⊖ p f 10 : ¬ ( q ∧ ⊖ ⊖ p ) Figure 4: Equiv alen t MDP Pro duced by pl tl s im start_state f4,f5,f6 Reward=0 a(0.04) b(0.2) p f4 Reward=0 a(0.16) p, q f3,f4,f5 Reward=0 a(0.64) q f4,f5,f6 Reward=0 a(0.16) b(0.8) a(1) b(1) p, q f2,f3,f4 Reward=0 a(1) b(1) a(0.8) a(0.2) b(1) p, q f1,f2,f3 Reward=1 a(1) b(1) a(1) b(1) The fun ction l is giv en by: l ( {} ) = { q ∧ ⊖ ⊖ p, ⊖ p, p } l ( { p } ) = { q ∧ ⊖ ⊖ p } l ( { q } ) = { q ∧ ⊖ ⊖ p, ⊖ p, p } l ( { p, q } ) = { q ∧ ⊖ ⊖ p, ⊖ p, p } The follow ing formula e label the e- states: f 1 : q ∧ ⊖ ⊖ p f 2 : ⊖ p f 3 : p f 4 : ¬ ( q ∧ ⊖ ⊖ p ) f 5 : ¬ ⊖ p f 6 : ¬ p Figure 5: Equiv alen t MDP Pro duced by pl tl m in 28 Decision-Theoretic Planning with non-Markovian Rew ards p 1.00 0.00 prv p 1.00 0.00 q prv prv p 0.00 1. 00 1. d ynamics of ⊖ p 2. dynamics of ⊖ ⊖ p 3. rewa rd Figure 6: ADDs Pro d uced by pl t lstr . prv (previously) stands for ⊖ PTSub( F ) , ψ 6 = ⊖ ψ ′ } ∪ { ⊖ ψ | ⊖ ψ ∈ PTSu b( F ) } . By rep eatedly applying the equiv alence f 1 S f 2 ≡ f 2 ∨ ( f 1 ∧ ⊖ ( f 1 S f 2 )) to an y s u bformula i n PTS ub( F ), w e can expr ess its current v alue, and hence that of reward formulae, as a fu nction of the cu r ren t v alues of formulae in T and state v ariables, as requir ed by the compact represent ation of the transition and rew ard mo dels. F or our NMRDP example in Figure 3, the set of pu rely temp oral v ariables is PT Sub( F ) = { ⊖ p, ⊖ ⊖ p } , and T is id entica l to PT Sub( F ). Figure 6 sho ws some of the ADDs forming part of the symbolic MDP pro d uced by pl t l str : the ADDs d escribing th e dynamics of the temp oral v ariables, i.e., the ADDs describing the effects of the actions a and b on their resp ectiv e v alues, and the ADD describing the r eward. As a more complex illustration, consider this example (Bacc h us et al., 1997) in whic h F = { ♦ - ( p S ( q ∨ ⊖ r )) } ≡ {⊤ S ( p S ( q ∨ ⊖ r )) } W e h a v e that PTSub( F ) = {⊤ S ( p S ( q ∨ ⊖ r )) , p S ( q ∨ ⊖ r ) , ⊖ r } and so the set of temp oral v ariables used is T = { t 1 : ⊖ ( ⊤ S ( p S ( q ∨ ⊖ r ))) , t 2 : ⊖ ( p S ( q ∨ ⊖ r )) , t 3 : ⊖ r } Using the equiv alence s, the rew ard can b e decomp osed and expressed by means of the prop ositions p, q and the temp oral v ariables t 1 , t 2 , t 3 as follo ws: ⊤ S ( p S ( q ∨ ⊖ r )) ≡ ( p S ( q ∨ ⊖ r )) ∨ ⊖ ( ⊤ S ( p S ( q ∨ ⊖ r ))) ≡ ( q ∨ ⊖ r ) ∨ ( p ∧ ⊖ ( p S ( q ∨ ⊖ r ))) ∨ t 1 ≡ ( q ∨ t 3 ) ∨ ( p ∧ t 2 ) ∨ t 1 As with pl tls im , the un derlying MDP p ro duced by pl tlstr is far from minimal – the enco ded history features do not ev en v ary from one state to the next. How ever, size is not as problematic as with state-based approac hes, b ecause stru ctur ed solution me tho ds do not en umerate states and are able to d ynamically ignore some of the v ariables that b ecome irrelev an t durin g p olicy construction. F or instance, wh en solving the MDP , they ma y b e 29 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza able to determine that some temp oral v ariables ha v e b ecome irrelev an t b ecause the situation they trac k, although p ossible in p rinciple, is to o costly to b e realised under a go o d p olicy . This dynamic analysis of rew ards contrast with pl t lmin ’s static analysis (Bac c hus et al., 1996) wh ich m ust enco de enough history to determine th e reward at all reac h able future states und er an y p olicy . One question that arises is that of the circumstances u n der w h ic h this analysis of irrel- ev ance by structured solution m etho d s, esp ecially the d ynamic asp ects, is really effectiv e. This is another question our exp erimental analysis will try to add ress. 3. fl tl : A F orw ard-Looking Approac h As noted in Section 1 ab o ve , the t wo key issues facing app r oac h es to NMRDPs are ho w to sp ecify the rew ard fun ctions compactly and how to exploit this compact representa tion to au tomatically translate an NMRDP in to an equiv alen t MDP amenable to the c hosen solution metho d . Accordingly , our goals are to provi de a rew ard function sp ecificatio n language and a trans lation that are adapted to an ytime state-based solution metho ds. After a b rief reminder of the relev ant features of these metho d s, we consider these t w o goals in turn. W e describ e the synta x and semant ics of the language, the notion of formula progression for the language whic h will form the basis of our translation, the trans lation itself, its prop erties, and its embeddin g into the solution metho d. W e call our approac h fl tl . W e finish the section with a discussion of the features that distinguish fl tl fr om existing approac hes. 3.1 An ytime State-Based Solution Metho ds The main dra wbac k of traditional dynamic programming algorithms su c h as p olicy iteration (Ho w ard, 1960) is that they explicitly enumerate all states th at are reac hable fr om s 0 in the ent ire MDP . Th ere has b een in terest in other state-base d solution metho ds, wh ic h ma y pro duce in complete p olicies, but only enumerate a fr action of the states that p olicy iteration requires. Let E ( π ) denote the envelop e of p olicy π , that is the set of states th at are r eac hable (with a non-zero probabilit y) from the initial state s 0 under th e p olicy . If π is defin ed at all s ∈ E ( π ), w e sa y that the p olicy is c omplete , and that it is in complete otherwise. The set of states in E ( π ) at whic h π is undefined is called the fringe of the p olicy . The fringe states are take n to b e abs orb ing, and their v alue is heuristic. A common feature of an ytime state-based algorithms is that they p erform a forw ard searc h, starting from s 0 and rep eatedly expan d ing the en v elop e of the current p olicy one step f orw ard by adding one or more fringe s tates. When provided with admissible heuristic v alues for the fringe states, they eve ntual ly con v erge to the optimal p olicy without necessarily needing to explore the en tire state space. In fact, since plann in g op erators are used to compactly represen t the state space, they ma y n ot ev en need to c onstruct more than a small subset of the MDP b efore returnin g th e optimal p olicy . When inte rru pted b efore conv ergence, th ey return a p ossibly incomplete but often us eful p olicy . These metho d s include the env elop e expansion algorithm (Dean et al., 1995), which deplo ys p olicy iteration on j u diciously c hosen larger and larger env elopes, using eac h su c- cessiv e p olicy to seed the calculation of the next. The more r ecen t LAO ∗ algorithm (Hansen 30 Decision-Theoretic Planning with non-Markovian Rew ards & Zilb erstein, 2001) whic h combines dynamic programming with heur istic searc h can b e view ed as a c lev er imp lemen tation of a p articular c ase of the en v elop e expansion algo rithm, where fringe state s are giv en admissible heuristic v alues, where p olicy iteration is run up to con v ergence b et w een en ve lop e expansions, and where the clev er implementa tion only runs p olicy iteration on the s tates whose optimal v alue can actually b e affected when a n ew fringe state is added to the env elop e. Another example is a b ac ktrac king forw ard searc h in the space of (p ossibly incomplete) p olicie s r o oted at s 0 (Thi ´ ebaux et al., 1995), which is p er- formed un til in terrupted, at wh ich point the b est policy found so far is ret urn ed. Real-time dynamic programming (R TDP) (Barto et al., 1995) is another p opular an ytime algorithm whic h is to MDPs what lea rnin g real-time A ∗ (Korf, 1990) is to deterministic domai ns, and whic h has asymptotic conv ergence guarantee s. The R TDP env elop e is made up of s ample paths wh ich are visited with a frequency determined b y the curren t greedy p olicy and the transition probabilities in the domain. R TDP can b e run on-line, off-line for a give n n um b er of steps o r un til interrupted. A v arian t called LR TDP (Bonet & Geffner, 2003) incorp orates mec hanisms that fo cus the searc h on states wh ose v alue has not y et con v erged, resulting in con v ergence s p eed u p and fi nite time con v ergence guaran tees. The fl tl translation we are ab out to p resen t targets these anytime algorithms, although it could also b e used with more traditional metho ds suc h as v alue and p olicy iteration. 3.2 Language and Seman tics Compactly r epresen ting non-Marko vian r ew ard functions reduces to compactly represen ting the b eha viours of interest, wh ere b y b ehaviour we mean a set of finite sequences of states (a subset of S ∗ ), e.g. the {h s 0 , s 1 i , h s 0 , s 0 , s 1 i , h s 0 , s 0 , s 0 , s 1 i . . . } in Figure 1. Recall th at the rew ard is issued at the end of any pr efix Γ( i ) in that set. Once b eha viours are compactly represen ted, it is straig htfo rward to represen t n on-Mark o vian rew ard fu nctions as mappings from b ehavi ours to r eal num b ers – we shall defer lo oking at this u n til Section 3.6. T o represent b eha viours compactly , we adopt a versio n of futur e linear temp oral logic (FL TL) (see Emerson, 1990), augmente d w ith a prop ositional constan t ‘$’, int ended to b e read ‘The b eha viour w e w ant to reward has just happ ened’ or ‘The rew ard is receiv ed no w’. The language $FL TL b egins with a set of basic prop ositions P giving rise to literals: L ::= P | ¬P | ⊤ | ⊥ | $ where ⊤ and ⊥ stand for ‘true’ and ‘false’, resp ectiv ely . The connectiv es are cla ssical ∧ and ∨ , and the temp oral mo dalities (next) and U ( we ak u n til), giving form ulae: F ::= L | F ∧ F | F ∨ F | F | F U F Our ‘until’ is w eak: f 1 U f 2 means f 1 will b e true from no w on unt il f 2 is, if ev er. Unlik e the more commonly used strong ‘un til’, this do es n ot imply that f 2 will ev en tually b e true. It allo ws u s to defin e the usefu l op erator (alw a ys): f ≡ f U ⊥ ( f will alw a ys b e true from no w on). W e also adopt the notations k f for k iterations of the mo dalit y (f w ill b e true in exactl y k steps), ♦ k f f or W k i =1 i f (f will b e tru e within the next k steps), an d k f for V k i =1 i f ( f will b e true throughou t the next k steps). Although negation officially o ccurs only in literals, i.e., the form ulae are in negation normal form (NNF), w e allo w ourselv es to w rite f orm ulae in v olving it in the usual w a y , 31 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza pro vided that they ha v e an equiv alen t in NNF. Not every formula has suc h an equiv alen t, b ecause there is no such literal as ¬ $ and b ecause ev en tualities (‘f will b e true some time’) are not expressible. These restrictions are d elib erate. If we were to use our notation and logic to the orise ab out the allocation of rew ards, we would indeed need the means to s ay when rew ards are not receiv ed or to expr ess features suc h as liv eness (‘alw a ys, there will b e a rewa rd ev en tually’), but in fact w e are using them only as a mec hanism for ensurin g that rew ards are give n where they should b e, and for this r estricted p urp ose even tualities and the negated doll ar a re not n eeded. In fac t, including them would create tec hn ical difficulties in relating formulae to the b ehavio urs th ey represent . The seman tics of this languag e is similar to that of FL TL, with an imp ortan t difference: b ecause the interpretati on of the constan t $ dep ends on the b eh aviour B we w ant to reward (whatev er that is), the m o delling relation | = m ust b e ind exed by B . W e therefore write (Γ , i ) | = B f to mean that formula f holds at the i -th stage of an arbitrary sequence Γ ∈ S ω , relativ e to b eha viour B . Defining | = B is the first step in our description of the semantic s: (Γ , i ) | = B $ iff Γ( i ) ∈ B (Γ , i ) | = B ⊤ (Γ , i ) 6| = B ⊥ (Γ , i ) | = B p , for p ∈ P , iff p ∈ Γ i (Γ , i ) | = B ¬ p , for p ∈ P , iff p 6∈ Γ i (Γ , i ) | = B f 1 ∧ f 2 iff (Γ , i ) | = B f 1 and (Γ , i ) | = B f 2 (Γ , i ) | = B f 1 ∨ f 2 iff (Γ , i ) | = B f 1 or (Γ , i ) | = B f 2 (Γ , i ) | = B f iff (Γ , i + 1) | = B f (Γ , i ) | = B f 1 U f 2 iff ∀ k ≥ i if ( ∀ j, i ≤ j ≤ k (Γ , j ) 6| = B f 2 ) then (Γ , k ) | = B f 1 Note that except for su bscript B and for the first rule, this is j ust the standard FL TL seman tics, and that therefore $-free form ulae k eep their FL TL m eaning. As with FL TL, w e sa y Γ | = B f iff (Γ , 0) | = B f , and | = B f iff Γ | = B f for all Γ ∈ S ω . The m o delling r elatio n | = B can b e seen as sp ecifying when a form ula holds, on w hic h reading it take s B as input. O ur next and final step is to use the | = B relation to d efine, for a formula f , the b eha viour B f that it r ep r esen ts, and for this w e must rather assume that f h olds, and then solve for B . F or instance, let f b e ( p → $), i.e., we get rew arded ev ery time p is true. W e w ould like B f to b e th e set of all finite sequences endin g with a state conta ining p . F or an arbitrary f , we tak e B f to b e the set of prefixes that have to b e rew arded if f is to h old in all sequences: Definition 2 B f ≡ T { B | | = B f } T o und erstand Definition 2, recall that B con tains pr efixes at the end of which w e get a reward and $ ev aluates to true. Since f is su pp osed to describ e the w a y rewards will b e r eceiv ed in an arbitr ary s equ en ce, we are in terested in b eha viours B wh ic h mak e $ true in suc h a wa y as to mak e f hold w ithout imp osing constrain ts on the ev olution of the w orld. Ho wev er, there ma y b e man y b eha viours with this p rop ert y , so we tak e their 32 Decision-Theoretic Planning with non-Markovian Rew ards in tersection, 3 ensuring t hat B f will only rewa rd a prefix if it has to b ecause that prefix is in every b eha viour satisfying f . In all but pathologic al cases (see Section 3.4), this makes B f coincide with the (set-inclusion) minimal b eha viour B su c h that | = B f . T h e r eason for this ‘stingy’ seman tics, making r ewards minimal, is that f do es not actually sa y that rew ards are allo cated to more prefixes than are required for its truth. F or instance, ( p → $) says only that a rew ard is give n ev ery time p is tru e, ev en though a more generous distribution of rewa rds wo uld b e c onsistent w ith it. 3.3 Examples It is in tuitiv ely clear that many b ehavi ours can b e sp ecified by means of $FL TL form ulae. While there is no simp le wa y in general to translate b etw een past and f u ture tense ex- pressions, 4 all of th e examples u sed to illustrate PL TL in Section 2.2 ab o v e are exp r essible naturally in $FL TL, as follo ws. The classical goal formula g saying that a goal p is r ew arded wheneve r it happ ens is easily expressed: ( p → $). As a lready not ed, B g is the set of finite sequences of states suc h that p h olds in the last state. If we only care that p is ac hiev ed once and get rew arded at eac h s tate from then on, w e write ( p → $). T he b eh aviour that this form ula repr esen ts is the set of fi nite state sequences h a ving at least one state in wh ic h p holds. By con trast, the formula ¬ p U ( p ∧ $) stipulates only that th e first o ccurrence of p is r ew arded (i.e. it sp ecifies th e b eha viour in Figure 1). T o rewa rd the o ccurrence of p at most once ev ery k steps, we write (( k +1 p ∧ k ¬ p ) → k +1 $). F or resp onse form ulae, w h ere th e achie ve ment of p is triggered by the command c , w e write ( c → ( p → $)) to reward eve ry state in which p is true follo wing the fir st issue of the command. T o rew ard only the first o ccurrence p after eac h command, w e wr ite ( c → ( ¬ p U ( p ∧ $))) . As for b ounded v arian ts fo r w hic h we only reward goal ac hiev emen t within k steps of the trigger command, we write for example ( c → k ( p → $ )) to rew ard all suc h states in w h ic h p holds. It is also w orth noting how to express simple b eha viours inv olving past tense op erators. T o stipulate a reward if p has alw a ys b een true, we write $ U ¬ p . T o sa y that w e are rew arded if p has b een true since q w as, we wr ite ( q → ($ U ¬ p )). Finally , w e often find it u seful to rewa rd the holding of p u n til the o ccurrence of q . The neatest expression for this is ¬ q U (( ¬ p ∧ ¬ q ) ∨ ( q ∧ $)). 3.4 Rew ard Normality $FL TL is therefore quite expressiv e. Unfortunately , it is rather to o expressiv e, in that it con tains form ulae w hic h describ e “unn atural” allo cations of rew ards. F or ins tance, they ma y mak e rew ards dep end on f uture b eha viours rather than on the past, or they may 3. I f there is no B such th at | = B f , whic h is the case f or any $-free f which is not a logical theorem, then B f is T ∅ – i.e. S ∗ follo wing normal set-theoretic con ven tions. This limiting case do es no harm, since $-free formulae do not describ e the attribution o f rew ards. 4. I t is an o p en question whether the set of represen table behaviours is the same for $ FL TL as for PL TL, that is star-free regular languages. Even if the b eha viours were th e same, th ere is little h ope that a practical translation from one to the other exists. 33 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza lea v e op en a choice as to whic h of sev eral b eha viours is to b e rew arded. 5 An example of dep endence on the future is p → $, wh ic h stipulates a rewa rd now if p is going to h old next . W e call suc h form ula r ewar d-unstable . What a rew ard-stable f amount s to is that whether a particular pr efix needs to b e rewarded in order to mak e f true do es n ot dep end on the future of the sequence. An example of an op en c hoice of whic h behavio r to rew ard is ( p → $) ∨ ( ¬ p → $) whic h sa ys we sh ould either rew ard all ac hiev ement s of the goal p or reward ac hieveme nts of ¬ p but do es not d etermine whic h. W e call suc h formula r ewar d- indeterminate . What a rew ard-determinate f amounts to is that the set of b ehavi ours mo delling f , i.e. { B | | = B f } , has a u nique minim um. If it do es not, B f is insufficient (too small) to mak e f tru e. In inv estig ating $FL TL (Slaney , 2005) , we examine the n otions of rewa rd-stabilit y and rew ard-determinacy in depth, and motiv ate the claim that form ulae that are b oth rewa rd- stable and rewa rd-determinate – we call them r ewar d-normal – are p r ecisely those that capture the n otion of “no funny business”. This is the intuiti on that we ask the reader to note, as it will b e needed in the r est of the pap er. Just for reference then, we d efi n e: Definition 3 f is rew ard-normal iff for e v ery Γ ∈ S ω and every B ⊆ S ∗ , Γ | = B f iff for every i , if Γ( i ) ∈ B f then Γ( i ) ∈ B . The p rop ert y of reward-normalit y is decidable (Slaney , 2005 ). In App endix A w e give some simp le syntact ic constructions guaran teed to r esult in rew ard-normal formulae. While rew ard-abnormal formulae ma y b e in teresting, for p resen t purp oses we restrict atte nti on to rew ard-normal ones. I ndeed, we stipulate as part of our metho d that only rewa rd-n orm al form ulae should b e u sed to represent b eha viours. Naturally , all formulae in Section 3.3 are normal. 3.5 $FL TL F orm ula Progression Ha ving d efined a language to represent b eha viours to b e rewarded, w e n o w turn to the problem of computing, giv en a rewa rd formula, a min im um allo cation of rewards to states actually encount ered in an execution sequence, in suc h a wa y as to satisfy the formula. Because w e ultimately wish to use anytime solution metho ds wh ic h generate state sequences incremen tally via forw ard searc h, this computation is b est done on the fly , while the sequence is b eing generated. W e therefore devise an increment al algorithm based on a mo del-c hec king tec hnique normally u sed to chec k whether a state sequence is a mo del of an FL TL formula (Bacc hus & Kabanza, 1998). This tec hnique is kno wn as formula pr o gr ession b ecause it ‘progresses’ or ‘pushes’ the formula through the sequence. Our p rogression tec hnique is sho wn in Algorithm 1. In essence, it computes the mo d - elling relation | = B giv en in Section 3.2. How ever,unlik e the defin ition of | = B , it is designed to b e useful when s tates in t he sequence b ecome a v ailable one at a time, in that i t defers the ev aluation of the part of the form ula that refers to the fu ture to the p oint where the next state b ecomes a v ailable. Let s b e a state, sa y Γ i , the last s tate of the sequence pr efi x Γ ( i ) 5. These difficulties are inherent in the u se of linear-time formalisms in contexts where the p rinciple of directionalit y must b e enforced. They are shared for instance by formalisms developed for reasoning about actions suc h a s the Ev ent Calculus and L TL action theories (see e.g., Cal v anese, De Gia como, & V ardi, 2002). 34 Decision-Theoretic Planning with non-Markovian Rew ards that has b een generated so far, and let b b e a b o olean tru e iff Γ( i ) is in the b eha viour B to b e rewa rded. Let the $FL TL form ula f d escrib e the allo cation of rew ards ov er all p ossible futures. Th en the progression of f through s giv en b , written Prog( b, s , f ), is a new form u la whic h will describe the al lo cation of rew ards ov er all p ossible futures of the next state, give n that we hav e ju st passed th rough s . C rucially , the f unction Pr og is Mark o vian, dep ending only on the cu r ren t state and th e single b o olean v alue b . Note that Prog is computable in linear time in the length of f , and that for $-free f ormulae, it collapses to FL TL formula progression (Bacc h us & Kabanza, 1998), regardless of the v alue of b . W e assume that Prog incorp orates the u sual simp lification for sente ntia l constan ts ⊥ and ⊤ : f ∧ ⊥ simplifies to ⊥ , f ∧ ⊤ s imp lifies to f , etc. Algorithm 1 $FL TL P r ogression Prog(true , s, $) = ⊤ Prog(false , s, $) = ⊥ Prog( b, s, ⊤ ) = ⊤ Prog( b, s, ⊥ ) = ⊥ Prog( b, s, p ) = ⊤ iff p ∈ s and ⊥ otherwise Prog( b, s, ¬ p ) = ⊤ iff p 6∈ s and ⊥ otherwise Prog( b, s, f 1 ∧ f 2 ) = Prog( b, s , f 1 ) ∧ Prog( b, s, f 2 ) Prog( b, s, f 1 ∨ f 2 ) = Prog( b, s , f 1 ) ∨ Prog( b, s, f 2 ) Prog( b, s, f ) = f Prog( b, s, f 1 U f 2 ) = Prog( b, s, f 2 ) ∨ (Prog( b, s, f 1 ) ∧ f 1 U f 2 ) Rew( s, f ) = true iff Prog(false , s, f ) = ⊥ $Prog( s, f ) = Prog(Rew( s, f ) , s, f ) The fund amental prop erty of Prog is the follo wing. Where b ⇔ (Γ( i ) ∈ B ): Prop e rty 1 (Γ , i ) | = B f iff (Γ , i + 1) | = B Prog( b, Γ i , f ) Pro of: See App endix B. Lik e | = B , the f u nction Prog seems to require B (or at least b ) as input, bu t of course when pr ogression is applied in pr actice w e only ha v e f and one new state at a time of Γ, and what we really wan t to do is c ompute the appropr iate B , namely that repr esen ted b y f . So, s imilarly as in Section 3.2, w e no w turn to the second step, whic h is to use Prog to decide on the fly whether a newly generated sequence prefix Γ( i ) is in B f and s o shou ld b e allo cated a r ew ard. This is the purp ose of the functions $Prog and Rew, also giv en in Algorithm 1. Giv en Γ and f , the function $P r og in Algorithm 1 defines an infi nite sequen ce of formula e h f 0 , f 1 , . . . i in the ob vious wa y: f 0 = f f i +1 = $Prog(Γ i , f i ) T o decide w hether a prefix Γ( i ) of Γ is to b e r ew arded, Rew firs t tries p rogressing th e form ula f i through Γ i with the b o olean flag set to ‘false’. If that giv es a consistent resu lt, w e need not rew ard the prefix and w e con tin ue w ithout rew arding Γ( i ), but if the result is 35 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza ⊥ then w e kn o w that Γ( i ) must b e rewarded in ord er for Γ to satisfy f . In that case, to obtain f i +1 w e m ust progress f i through Γ i again, this time with the b o olean flag set t o the v alue ‘true’. T o s u m up , the b eha viour corresp onding to f is { Γ( i ) | Rew(Γ i , f i ) } . T o illustrate the b ehavio ur of $FL TL progression, consider the form ula f = ¬ p U ( p ∧ $) stating that a rew ard w ill b e receiv ed the first time p is true. Let s b e a state in whic h p holds, then Pr og(false , s, f ) = ⊥ ∨ ( ⊥ ∧ ¬ p U ( p ∧ $)) ≡ ⊥ . Th erefore, since th e formula has progressed to ⊥ , Rew( s, f ) is true and a r ew ard is receiv ed. $Prog( s, f ) = Prog(tru e , s, f ) = ⊤ ∨ ( ⊥ ∧ ¬ p U ( p ∧ $)) ≡ ⊤ , so t he reward form ula fades aw ay and will not a ffect sub s equen t progression steps. If , on the other h and, p is false in s , then Prog(false , s, f ) = ⊥ ∨ ( ⊤ ∧ ¬ p U ( p ∧ $ )) ≡ ¬ p U ( p ∧ $)). Therefore, since the form ula has not progressed to ⊥ , Rew( s, f ) is false and no rewa rd is receiv ed. $Prog( s, f ) = Prog(fa lse , s, f ) = ¬ p U ( p ∧ $), so the rew ard form ula p ersists as is for su b sequen t progression steps. The follo wing theorem states that u nder we ak assumptions, rew ards are correctly allo- cated b y p r ogression: Theorem 1 L et f b e r ewar d-normal, and let h f 0 , f 1 , . . . i b e the r esult of pr o gr essing it thr ough the suc c essive states of a se que nc e Γ using the function $ Prog . Then, pr ovide d no f i is ⊥ , for al l i Rew(Γ i , f i ) iff Γ( i ) ∈ B f . Pro of: See Ap p endix B The premise of th e theorem is that f never p rogresses to ⊥ . Ind eed if f i = ⊥ for some i , it m eans that ev en rew arding Γ( i ) do es not su ffice to mak e f tru e, so something must ha v e gone wrong: at some earlier stage, the b o olea n Rew was m ade false where it should ha v e b een made tru e. The usual explanation is that th e original f wa s not rewa rd-norm al. F or instance p → $, which is r ew ard u nstable, progresses to ⊥ in the next state if p is true there: regardless of Γ 0 , f 0 = p → $ = ¬ p ∨ $, Rew(Γ 0 , f 0 ) = false, and f 1 = ¬ p , so if p ∈ Γ 1 then f 2 = ⊥ . How ever, other (admittedly b izarre) p ossibilities exist: for example, although p → $ is reward-unstable, its su bstitution instance ⊤ → $, which also progresses to ⊥ in a f ew s teps, is logica lly equiv alen t to $ and is rew ard-normal. If the progression metho d were to deliv er the correct minimal b eha viour in all cases (ev en in all rew ard-normal cases) it wo uld h a v e to bac ktrac k on th e c hoice of v alues for the b o olean flags. In th e int erest of efficiency , we c ho ose not to allo w b ac ktrac kin g. Instead, our algorithm raises an exception w henev er a rew ard form ula progresses to ⊥ , and informs the user of the sequ ence whic h caused the p roblem. The on us is th us placed on the domain mo deller to select sensible reward formulae so as to a v oid p ossible progression to ⊥ . It should b e noted that in th e worst case, detecti ng rew ard-normalit y cannot b e easier than the decision problem for $FL TL so it is n ot to b e exp ecte d that there w ill b e a simple syn tactic criterion for r ew ard-normalit y . In practice, h o w ev er, commonsense precautions suc h as a vo iding making rew ards dep end explicitly on fu ture tense expr essions suffice to k eep things normal in all routine cases. F or a generous class of syntac tically recognisable rew ard-normal formula e, see App end ix A. 3.6 Rew ard F unctions With th e language d efined so far, w e are able to compactly represent b ehavi ours. The extension to a non-Mark o vian rewa rd function is s traigh tforw ard. W e represen t such a 36 Decision-Theoretic Planning with non-Markovian Rew ards function by a s et 6 φ ⊆ $ FL TL × I R of formulae asso ciated with real v alued rewards. W e call φ a r ewar d function sp e cific ation . Where form ula f is asso ciated with rewa rd r in φ , w e write ‘( f : r ) ∈ φ ’. Th e rewa rds are assumed to b e indep enden t and additiv e, so that the rewa rd fu nction R φ represen ted by φ is given by: Definition 4 R φ (Γ( i )) = X ( f : r ) ∈ φ { r | Γ( i ) ∈ B f } E.g, if φ is {¬ p U ( p ∧ $) : 5 . 2 , ( q → $) : 7 . 3 } , we get a reward of 5.2 the fir st time that p holds, a reward of 7 . 3 from the first time that q holds on wa rds , a reward of 12 . 5 when b oth conditions are met, and 0 otherwise. Again, w e can progress a rew ard f unction sp ecification φ to compute the reward at all stages i of Γ. As b efore, p rogression defines a sequence h φ 0 , φ 1 , . . . i of r ew ard fun ction sp ecifications, with φ i +1 = RPr og(Γ i , φ i ), where RProg is the function that applies Prog to all form ulae in a r eward fu nction sp ecificatio n: RProg( s, φ ) = { (Prog( s, f ) : r ) | ( f : r ) ∈ φ } Then, the total r ew ard receiv ed at stage i is s imply the sum of the r eal-v alued rewards gran ted by the pr ogression fun ction to the b eha viours represente d by the formulae in φ i : X ( f : r ) ∈ φ i { r | Rew(Γ i , f ) } By p ro ceeding that w a y , we get the exp ected analog of Theorem 1 , whic h states pr ogression correctly computes non-Mark o vian r ew ard functions: Theorem 2 L et φ b e a r e war d-normal 7 r ewar d function sp e cific ation, and let h φ 0 , φ 1 . . . i b e the r esult of pr o gr essing it thr ough the suc c essive states of a se q uenc e Γ using the function RProg . Then, pr ovide d ( ⊥ : r ) 6∈ φ i for any i , then X ( f : r ) ∈ φ i { r | Rew(Γ i , f ) } = R φ (Γ( i )) . Pro of: Immediate f rom Theorem 1. 3.7 T ranslation In to MDP W e now exploit the compact repr esen tation of a non-Mark o vian r ew ard function as a rew ard function sp ecification to translate an NMRDP int o an equiv alen t MDP amenable to state- based an ytime solution metho ds. Recall from S ection 2 that eac h e-state in the MDP is lab elled b y a state of the NMRDP and by history information sufficien t to determine the immediate rew ard. In the case of a compact represen tation as a rew ard fun ction sp ecificatio n φ 0 , this additional in formation can b e summarised by the progression of φ 0 through the sequence of states passed through. So an e-state w ill b e of the f orm h s, φ i , where s ∈ S is 6. S trictly sp eaking, a multi set, but for co nv enience w e represent it as a set, with the rewards for m ultiple occurrences of th e same fo rmula in the multiset summed. 7. W e extend the d efi nition of reward-normalit y to rew ard sp ecification functions in the ob vious wa y , by requiring that all rewa rd form ulae in vol ved b e rew ard n ormal. 37 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza a state, and φ ⊆ $FL TL × I R is a r ew ard function sp ecification (obtained by progression). Tw o e-states h s, φ i and h t, ψ i are equal if s = t , the immediate rewards are the same, and the results of progressing φ and ψ through s are semant ically equiv alen t. 8 Definition 5 L et D = h S, s 0 , A, Pr , R i b e an NM RDP, and φ 0 b e a r ewar d function sp e c- ific ation r epr esenting R (i.e., R φ 0 = R , se e Definition 4). W e tr anslate D into the MD P D ′ = h S ′ , s ′ 0 , A ′ , Pr ′ , R ′ i define d as fol lows: 1. S ′ ⊆ S × 2 $ FL TL × I R 2. s ′ 0 = h s 0 , φ 0 i 3. A ′ ( h s, φ i ) = A ( s ) 4. If a ∈ A ′ ( h s, φ i ) , then Pr ′ ( h s, φ i , a, h s ′ , φ ′ i ) = Pr( s, a, s ′ ) if φ ′ = RPr og( s, φ ) 0 otherwise If a 6∈ A ′ ( h s, φ i ) , then Pr ′ ( h s, φ i , a, • ) is undefine d 5. R ′ ( h s, φ i ) = X ( f : r ) ∈ φ { r | Rew( s, f ) } 6. F or al l s ′ ∈ S ′ , s ′ is r e achable under A ′ fr om s ′ 0 . Item 1 says that the e-states are lab elled by a state and a reward function sp ecification. Item 2 sa ys that the initial e-state is lab elled with the initial state and with the original rewa rd function sp ecification. Item 3 sa ys that an action is applicable in an e-state if it is applicable in the state lab elling it. Item 4 explains ho w su ccessor e-states and their pr obab ilities are computed. Giv en an action a applicable in an e-state h s, φ i , eac h successor e-state will b e lab elled by a successor state s ′ of s via a in the NMRDP and by the progression of φ through s . The p robabilit y of that e-state is Pr ( s, a, s ′ ) as in the NMRDP . Note that the cost of computing Pr ′ is linear in that of c omputing Pr and in the sum of the lengt hs of the form ulae in φ . Item 5 has b een m otiv ated b efore (see Section 3.6). Finally , since items 1–5 lea v e op en the c hoice of man y MDPs differing only in the unr eac hable states they con tain, item 6 excludes all su c h irrelev ant extensions. It is easy to show that this translation leads to an equiv alen t MDP , as d efined in Definition 1. Obviously , the function τ r equ ir ed for Definition1 is giv en by τ ( h s, φ i ) = s , and then the pro of is a matter of c hec king conditions. In our practical implemen tation, the lab elling is one step ahead of that in the definition: w e lab el the initial e-stat e w ith R P r og( s 0 , φ 0 ) and compu te the curren t rew ard a nd the c ur- ren t rew ard sp ecification lab el by pr ogression of predecessor reward sp ecificat ions through the current state rather than thr ough the p r edecessor states. As will b e apparent b elo w, this has the p otent ial to reduce the num b er of states in th e generated MDP . Figure 7 shows the equiv alen t MDP p ro duced for the $FL TL ve rsion of our NMRDP example in Figure 3. Recall that for this example, the PL TL rew ard formula w as q ∧ ⊖ ⊖ p . In $ FL TL, the allo cation of rew ards is describ ed b y (( p ∧ q ) → $). The figure also 8. Care is needed ov er the n otion of ‘semantic equiv alence’. Because rew ards are additive, determining equiv alence may inv olve arithmetic as w ell as theorem pro ving. F or example, the rew ard fun ct ion speci- fication { ( p → $ : 3) , ( q → $ : 2) } is equiv alent to { (( p ∧ q ) → $ : 5) , (( p ∧ ¬ q ) → $ : 3) , (( ¬ p ∧ q ) → $ : 2) } although th ere is no one-one correspondence betw een the form ulae in the tw o sets. 38 Decision-Theoretic Planning with non-Markovian Rew ards start_state f1 Reward=0 a(0.04) b(0.2) p f1,f2 Reward=0 a(0.16) p, q f1,f2 Reward=0 a(0.64) q f1 Reward=0 a(0.16) b(0.8) p f1,f2,f3 Reward=0 a(1) b(1) p, q f1,f2,f3 Reward=0 a(1) b(1) a(0.8) a(0.2) b(1) a(1) b(1) p, q f1,f2,f3 Reward=1 a(1) b(1) a(1) b(1) The follow ing formula e label the e- states: f 1 : (( p ∧ q ) → $) f 2 : q → $ f 3 : q → $ Figure 7: Equiv alen t MDP Pro du ced by fl tl sho ws the relev an t formulae lab elling the e-states, obtained by progression of this r ew ard form ula. Note that without progressing o ne ste p ahead, there would b e 3 e -states with state { p } on the left-hand side, lab elled w ith { f 1 } , { f 1 , f 2 } , and { f 1 , f 2 , f 3 } , resp ectiv ely . 3.8 Blind Minimality The size of the MDP obtained, i.e. the num b er of e-states it con tains is a k ey issu e for us, as it has to b e amenable to state-based solution metho ds. Ideally , we w ould lik e the MDP to b e of minimal size. Ho wev er, we d o not kno w of a m ethod building th e minimal equiv alen t MDP incremen tally , adding parts as required b y t he solution m etho d . And since in the w orst case eve n the minimal equiv alent MDP can b e larger than the NMRDP b y a factor exp onent ial in the length of the rewa rd f orm ulae (Bac c hus et al., 1996), constru cting it en tirely wo uld n ullify the int erest of an ytime solution metho ds. Ho w ev er, as w e no w explain, Definition 5 leads to an equiv alen t MDP exhibiting a relaxed notion of minimalit y , and which is amenable to incrementa l construction. By insp ection, w e ma y observe th at whereve r an e-state h s, φ i h as a su ccessor h s ′ , φ ′ i via action a , this means that in order to su cceed in rew arding the b eha viours describ ed in φ by means of execution sequences that start by going from s to s ′ via a , it is n ecessary that the fu tu re starting w ith s ′ succeeds in rewarding the b eha viours describ ed in φ ′ . If h s, φ i is in the minimal equiv alen t MDP , and if there really are suc h execution sequences succeeding in rew arding the b eha viours d escrib ed in φ , then h s ′ , φ ′ i must also b e in the minimal MDP . That is, construction b y progression can only in tro du ce e-states whic h are a priori needed. Note that an e-state that is a priori needed may not r e al ly b e needed: there ma y in fact b e no execution sequence using the a v ailable actions that exhibits a giv en b ehavio ur. F or 39 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza instance, consider the resp onse formula ( p → ( k q → k $)), i.e., ev ery time trigger p is true, w e will b e rew arded k steps later provided q is true then. Obvio usly , whether p is tru e at some stage affects the w a y futu re states should b e rewa rded. Ho w ev er, if the transition relation h app ens to ha v e the prop erty th at k steps from a state satisfying p , no state satisfying q can b e reac hed, then a p osteriori p is irrelev ant, and there was no need to lab el e-states different ly according to wh ether p was true or not – observe an o ccurrence of this in the example in Figure 7, and how this leads fl tl to pro du ce an extra state at the b ottom left of the Figure. T o detect suc h cases, we w ould ha ve to lo ok p erhaps quite deep in to feasible futu res, whic h we cannot do while constru cting the e-states on the fly . Hence the relaxed notion w hic h we call blind minimality do es not alwa ys coincide with absolute minimalit y . W e no w formalise the difference b et w een true and b lind minimalit y . F or this pur p ose, it is conv enient to define some fu nctions ρ and µ mapping e-states e to fun ctions from S ∗ to I R int uitiv ely assigning rewards to sequences in the NMRDP starting fr om τ ( e ). Recall from Definition 1 that τ m aps eac h e-state of the MDP to the und erlying NMRDP state. Definition 6 L et D b e an NMRDP . L et S ′ b e the set of e-states in an e quivalent MDP D ′ for D . L et e b e any r e achable e-state in S ′ . L et Γ ′ ( i ) b e a se quenc e of e-states in f D ′ ( s ′ 0 ) such that Γ ′ ( i ) = e . L et Γ( i ) b e the c orr esp onding se quenc e in e D ( s 0 ) obtaine d under τ in the sense that, for e ach j ≤ i , Γ( j ) = τ (Γ ′ j ) . Then for any ∆ ∈ S ∗ , we define ρ ( e ) : ∆ 7→ R (Γ( i − 1); ∆) if ∆ 0 = Γ i 0 otherwise and µ ( e ) : ∆ 7→ R (Γ( i − 1); ∆) if ∆ ∈ e D (Γ i ) 0 otherwise F or any unr e achable e-state e , we define b oth ρ ( e )(∆) and µ ( e )(∆) to b e 0 for al l ∆ . Note carefully the difference b et wee n ρ and µ . The f ormer d escrib es the rew ards assigned to al l con tin uations of a giv en state sequence, wh ile the latter confines rew ards to fe asible con tin uations. Note also that ρ and µ are w ell-defined despite the indeterminacy in the c hoice of Γ ′ ( i ), since by clause 4 of Definition 1, all suc h c hoices lead to the same v alues for R . Theorem 3 L et S ′ b e the set of e-states in an e quivalent MDP D ′ for D = h S, s 0 , A, Pr , R i . D ′ is minimal iff every e-state in S ′ is r e achable and S ′ c ontains no two distinct e-states s ′ 1 and s ′ 2 with τ ( s ′ 1 ) = τ ( s ′ 2 ) and µ ( s ′ 1 ) = µ ( s ′ 2 ) . Pro of: See App endix B. Blind minimalit y is s imilar, except that, since there is no lo oking ahead, n o distinction can b e drawn b et we en f easible tra jectories and others in the fu ture of s : Definition 7 L et S ′ b e the set of e-states in an e quivalent MD P D ′ for D = h S, s 0 , A, Pr , R i . D ′ is blind minimal iff every e-state in S ′ is r e achable and S ′ c ontains no two distinct e- states s ′ 1 and s ′ 2 with τ ( s ′ 1 ) = τ ( s ′ 2 ) and ρ ( s ′ 1 ) = ρ ( s ′ 2 ) . 40 Decision-Theoretic Planning with non-Markovian Rew ards Theorem 4 L et D ′ b e the tr anslation of D as in Definition 5. D ′ is a blind minimal e quiv alent M DP for D . Pro of: See App endix B. The s ize difference b et w een the blind-minimal and minimal MDPs will dep end on the precise int eraction b et w een rewards and dynamics for the problem at hand, making theoret- ical analyses difficult and exp erimental results rather anecdotal. Ho we ve r, our exp erimen ts in Section 5 and 6 will sho w that fr om a computation time p oin t of view, it is often prefer- able to work with the blind -min imal MDP than to inv est in the o v erhead of computing the truly minimal one. Finally , recall th at syn tactical ly differen t but semant ically equiv alen t rew ard function sp ecifications define the same e-state. T herefore, neither min imalit y nor blind m in imalit y can b e achiev ed in general without an equiv alence c hec k at least as complex as theorem pro ving for L TL. In p ratical imp lemen tatio ns, we av oid theorem pro ving in fa v our of em- b edding (fast) form ula simplification in our progression and regression algorithms. This means that in prin ciple we only app ro ximate minimalit y and blind minimalit y , but this app ears to b e enough for p ractica l purp oses. 3.9 Em b edded Solution/Construction Blind minimalit y is essent ially the b est ac hiev able with an ytime state-based solution meth- o ds whic h t ypically extend their env elop e one step forward without looking deep er int o th e future. Ou r translation in to a blind -min imal MDP can b e trivially em b edded in any of these solution metho d s. This results in an ‘on-line construction’ of the MD P: the metho d en tirely driv es the construction of those parts of the MDP whic h it feels th e need to explore, and lea v e the others implicit. If time is short, a sub optimal or ev en incomplete p olicy may b e returned, but only a fraction of the state and expanded state sp aces might b e constru cted. Note that the solution metho d should raise an exception as s o on as one of the reward for- m ulae pr ogresses to ⊥ , i.e., as so on as an expand ed state h s, φ i is built such that ( ⊥ : r ) ∈ φ , since this acts as a detector of u nsuitable r ew ard function sp ecifications. T o the exten t enabled b y blind min imalit y , our approac h allo ws for a d ynamic analysis of the reward formulae, muc h as in pl tl s tr (Bacc h us et al., 1997) . Ind eed, only the execution sequences feasible u n der a particular p olicy actually explored b y the solution metho d con- tribute to the analysis of r ew ards for that p olicy . S p ecifically , the r ew ard form ulae generated b y progression f or a giv en p olicy are d etermined by the p r efixes of the execution sequences feasible under this p olicy . This d y n amic analysis is particularly u seful, since relev ance of rew ard form ulae to particular p olicies (e.g. the optimal p olicy) cannot be detected a priori. The forw ard-chai ning planner T LPlan (Bacc hus & Kabanza, 2000) int ro d u ced the idea of using FL TL to sp ecify domain-sp ecific se ar ch c ontr ol know le dge and f ormula progression to prune un p romising sequen tial plans (plans violating this kn o wledge) from d eterministic searc h spaces. This has b een shown to p ro vide enormous time gains, leading TLPlan to win the 2002 planning comp etitio n h and-tailored trac k. Bec ause our approac h is b ased on p r ogression, it pro vides an elegan t wa y to exploit searc h con trol kn owledge, y et in the con text of decision-theoretic planning. Here this results in a dramatic reduction of the 41 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza fraction of the MDP to b e constructed and explored, and therefore in sub s tan tially b etter p olicies b y the deadline. W e a c hiev e this as follo ws. W e specify , via a $-free formula c 0 , p rop erties whic h we kno w m ust b e v erified b y p aths f easible under pr omising p olicies. Then w e simp ly p rogress c 0 alongside the rew ard function sp ecification, making e-states triples h s, φ, c i w here c is a $ -free form ula obtained by progression. T o preve nt the solution method from applying an action that leads to the con trol kno wledge b eing violated, the action applicabilit y condition (item 3 in Definition 5) b ecomes: a ∈ A ′ ( h s, φ, c i ) iff a ∈ A ( s ) and c 6 = ⊥ (the other changes are straigh tforw ard). F or instance, the effect of the control kno wledge form ula ( p → q ) is to remo v e from consideration any feasible path in wh ic h p i s n ot follo we d b y q . Th is is detected as so on as violat ion o ccurs, when the formula progresses to ⊥ . Although this pap er fo cuses on n on-Marko vian rew ards rather than d ynamics, it should b e noted that $-free formulae can also b e used to express n on-Mark o vian constrain ts on the system’s d ynamics, whic h can b e incorp orated in our app roac h exactly as w e do for the con trol knowle dge. 3.10 Discussion Existing approac hes (Bacc hus et al., 1996, 1997) adv o cate the use of P L TL o v er a fi nite past to sp ecify n on-Marko vian rewards. In the PL TL style of sp ecification, we describ e the past conditions u nder wh ic h we get r ew arded no w, while with $FL TL we describ e the conditions on the p resen t and future under whic h fu ture sta tes will b e rew arded. While the b eha viours and rew ards ma y be the same in eac h sc heme, the naturalness of thinkin g in one st yle or the other d ep ends on th e case. Letting the kids hav e a stra wb erry d essert b ecause they hav e b een go o d all da y fits n atur ally into a past-orien ted accoun t of rewards, whereas promising that they ma y w atc h a m o vie if they tidy their room (indeed, making sense of th e whole n otion of pr omising) go es more natur ally with $FL T L. One adv an tage of the P L TL form ulation is that it trivially en f orces the principle that present rewa rds d o not dep end on futu r e states. In $FL TL , this resp ons ib ilit y is placed on the domain mo deller. The b est w e can offer is an exception mec hanism to recognise mistak es wh en their effects app ear, or synta ctic restrictions. On the other hand , th e greater expressiv e p o w er of $FL TL op ens the p ossibilit y of considering a ric her class of decision p ro cesses, e.g. with uncertaint y as to whic h rewards are receiv ed (the dessert or th e movi e) and when (some time next we ek, b efore it rains). A t an y r ate, we b eliev e that $FL TL is b etter suited than PL TL to solving NMRDPs using an ytime state-based solution metho ds. While the p l tlsim translation could b e eas- ily em b ed d ed in suc h a solution metho d, it loses the structure of the original formulae when considerin g sub form ulae in d ividually . Consequently , the expanded state space easily b ecomes exp onent ially bigger than the blind -minimal one. This is problematic with th e solution metho ds we consider, b ecause size sev erely affects their p erformance in solution qualit y . T he pre-pro cessing phase of pl tlmin u ses P L TL f orm ula regression to fin d sets of subformulae as p oten tial lab els for p ossible predecessor states, so that the subsequent generation ph ase bu ilds an MDP r epresen ting all and only the h istories w hic h mak e a d if- ference to the w ay actually feasible exe cution sequences should b e rew arded. Not only do es this reco v er the structure of the original f orm ula, but in the b est case, the MDP p ro duced is exp onen tially smaller than the b lind-minimal one. Ho wev er, the pr ohibitiv e cost of the 42 Decision-Theoretic Planning with non-Markovian Rew ards pre-pro cessing phase make s it uns uitable for anytime solution metho ds. W e d o n ot con- sider that an y metho d based on PL TL and regression will ac hiev e a meaningful relaxed notion of minimalit y without a costly pre-pro cessing ph ase. fl t l is an approac h based on $FL TL and pr ogression wh ic h do es p recisely that, letting the solution metho d resolv e the tradeoff b et w een qualit y and cost in a principled wa y in termediate b etw een the t wo extreme suggestions ab o v e. The structured represent ation and solution metho ds targeted by Bacc h us et al. (1997) differ from the an ytime state-based solution metho ds fl t l primarily aims at, in particular in that t hey do not r equ ir e explic it state en umeration at al l. Here, non-minimalit y is not a s problematic a s with the state -based approac hes. I n virtue of the size of the MDP pro duced, the pl tlst r translation is, as p l tlsim , clearly u nsuitable to a nytime state-based metho d s . 9 In another sense, to o, fl tl represents a middle w ay , com b ining the adv an tages conferr ed b y state-based and structured approac hes, e.g. by pl tl m in on one side, and pl tls tr on the other. F r om the former fl t l inherits a meaningful notio n of minimalit y . As with the latte r, appro ximate solution metho d s can b e used a nd can p erform a r estricted d ynamic analysis o f the rew ard formulae . In particular, form ula progression enables ev en s tate- based metho ds to exploit some of the structure in ‘$FL TL space’. Ho w ev er, the gap b et w een blind and true minimalit y ind icates that progression alone is insufficien t to alw a ys fully exploit that structure. There is a hop e th at pl tlstr is able to tak e adv anta ge of the full structure of the rew ard function, but also a p ossibilit y that it will fail to exploit ev en as muc h structur e as f l tl , a s effic ient ly . An empirical comparison of the three approac hes is needed to answ er this question and identi fy the d omain f eatures fa v oring one o ve r the other. 4. NMRDPP The first step to wards a decen t comparison of the differen t approac hes is to ha v e a fr amework that includ es them all. The Non-Mark o vian Reward Decision Pro cess Planner, nmrdpp , is a platform for the deve lopment and exp erimen tation of approac hes to NMRDPs. it pro vides an imp lementat ion of the app roac hes we hav e describ ed in a common framewo rk, within a single system, and with a common input language. nmrdpp is a v ailable on-line, see ht tp://rsis e.anu.edu .au/~charlesg/nmrdpp . It is worth noting that Bacc hus et al. (1996 , 1997) do not r ep ort an y implementa tion of th eir approac hes. 4.1 Input la nguage The input language enables the sp ecificat ion of actions, initial states, rewards, and searc h con trol-kno wledge. T he format for the action sp ecificat ion is essenti ally the same as in th e SPUDD system (Hoey et al., 1999). Th e rew ard sp ecificatio n is one or more form ulae, eac h asso ciated with a n ame and a real num b er. Th ese form ulae are in either PL T L or $ FL TL. Con trol kno wledge is give n in the same language as that c hosen for the rew ard. Con trol kno wledge formulae will h av e to b e v erified b y any sequence of states feasible un der the generated p olicie s. Initial states are s imp ly sp ecified as part of the con trol kno wledge or as explicit assignmen ts to p r op ositions. 9. I t wo uld b e in teresting, on th e other hand, to use pl tlstr in conjunction with symbolic versions of such method s, e.g. Symb olic LAO* (F eng & Hansen, 2002) or Symbolic R TDP (F eng, Hansen, & Zilb erstein, 2003). 43 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza action flip heads (0.5) endactio n action tilt heads (heads (0.9) (0.1)) endactio n heads = ff [first, 5.0]? heads and ~prv (pdi heads) [seq, 1.0]? (prv^2 heads) and (prv heads) and ~heads Figure 8: In put f or th e Coin Example. prv (previously) stands for ⊖ and pdi (past diamond) stands for ♦ - . F or instance, consider a simple example consisting of a coin sho wing either heads or tails ( ¬ heads). There are t w o actions that can b e p erformed. Th e flip action change s the coin to sho w heads or tails with a 50% probabilit y . T he tilt action c hanges it with 10% probabilit y , otherwise lea ving it as it is. The initial state is tails. W e get a rew ard of 5. 0 for the very fi rst head (this is written h eads ∧ ¬ ⊖ ♦ - heads in PL TL) and a rew ard of 1.0 eac h time w e achiev e the sequence heads, h eads, tails ( ⊖ 2 heads ∧ ⊖ heads ∧ ¬ heads in PL T L). In our input language, this NMRDP is describ ed as sh o wn in Figure 8. 4.2 Common framew ork The common fr amew ork un derlying nmrd p p take s adv an tage of the fact th at NMRDP solution metho d s can, in general, b e divided in to the distinct p hases of p repro cessing, expansion, and solving. T h e fi rst tw o are optional. F or p l tlsim , pr epr o c essing simply computes the set Sub( F ) of subformulae of the rew ard form ulae. F or p l tlmin , it also includes computing the lab els l ( s ) for eac h state s . F or pl tlst r , prepro cessing in vo lv es compu ting the set T of temp oral v ariables as w ell as the ADDs for their dynamics and for the rewards. fl tl do es not require any pr ep r o cessing. Exp ansion is the optional generation of the enti re equiv alen t MDP prior to solving. Whether o r not off- line expansion is sensible dep ends on the MDP solution method used. If state-based v alue or p olicy iteration is used, then the MDP needs to b e expanded anyw ay . If, on the other hand, an anyt ime search algorithm or structured metho d is us ed, it is definitely a bad idea. In our exp eriments, w e often used expansion so lely for the pur p ose of measuring the size of the generated MDP . Solving the MDP can b e done using a num b er o f metho ds. Curr ently , nmrdpp pr o vides implemen tations of classica l d ynamic p rogramming methods, namely state-based v alue and p olicy iteratio n (Ho w ard, 1960), of heuristic searc h metho ds: state-base d LA O* (Hansen & Zilb erstein, 2001 ) using either v alue or p olicy iteration as a subr outine, and of one structured metho d, namely SP UDD (Ho ey et al., 1999). Prime candidates fo r future d ev elopmen ts are (L)R TDP (Bonet & Geffner, 2003), symb olic LAO* (F eng & Hansen, 2002), and sym b olic R TDP (F eng et al., 2003). 44 Decision-Theoretic Planning with non-Markovian Rew ards > load World (’coin’) load coin NMRDP > prep roces s(’sPltl’) pl tlstr prepro ces sing > star tCPUt imer > spud d(0.9 9, 0.000 1) so lve MDP with SPUDD( β , ǫ ) > stop CPUti mer > read CPUti mer rep ort solving time 1.2200 0 > iter ation Count rep ort n umber of iterations 1277 > disp layDo t(valueToDot) display ADD of v alue function Expected value heads (prv heads) (prv heads) (prv (prv pdi heads)) (prv (prv pdi heads)) (prv^2 heads) (prv pdi heads) 18.87 23.87 18.62 23.62 (prv pdi heads) 18.25 23.15 19.25 24.15 > disp layDo t(policyToDot) display p olicy Optimal policy heads (prv heads) flip tilt > prep roces s(’mPltl’) pl tlmin prepro ces sing > expa nd completely expand MDP > doma inSta teSize rep ort MDP size 6 > prin tDoma in ("") | ’show- domain .rb’ display postcript rendering of MDP - Reward=0 flip(0.5) tilt(0.9) heads Reward=5 flip(0.5) tilt(0.1) heads Reward=0 flip(0.5) tilt(0.9) - Reward=0 flip(0.5) tilt(0.1) flip(0.5) tilt(0.9) - Reward=1 flip(0.5) tilt(0.1) tilt(0.9) flip(0.5) heads Reward=0 tilt(0.1) flip(0.5) flip(0.5) tilt(0.9) flip(0.5) tilt(0.1) flip(0.5) tilt(0.9) flip(0.5) tilt(0.1) > valI t(0.9 9, 0.000 1) so lve MDP with VI( β , ǫ ) > iter ation Count rep ort n umber of iterations 1277 > getP olicy output po licy (textual) ... Figure 9: Sample Session 45 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza 4.3 Approac hes co v ered Altoge ther, the v arious types of prepro cessing, the c hoice of whether to expand, and the MDP solution metho ds, give rise to quite a num b er of NMRDP approac hes, including, but not limited to those p reviously men tioned (see e.g. pl t lstr(a) b elo w). Not al l com bina- tions are p ossible. E.g., state-based pr o cessing v arian ts are incompatible w ith structured solution metho ds (the conv erse is p ossible in principle, ho w ev er). Also, there is at present no structured form of prepr o cessing for $FL TL formulae . pl tlst r (a) is an example of an int eresting v arian t of pl tlst r , wh ic h w e obtain by considering add itional prepr o cessing, wh ereby the state space is explored (without explicitly en umerating it) to pro d uce a BDD rep r esen tation of the e-states reac hable from the start state. This is done by starting with a BDD representi ng the start e-state, and r ep eatedly applying eac h action. Non-zero probabilities are conv erted to ones and the result “or-ed” with the last result. When n o action adds any reac hable e-states to this BDD, we can b e sure it r epresen ts the reac h able e-state space. This is then used as additional con trol kno wledge to restrict the sea rch. It should be n oted that without t his p hase pl tl str makes no assum p tions ab out the start state, and thus is left at a p ossible disadv antag e. Similar structured reac habilit y analysis tec hniques h av e b een used in th e symb olic implemen tation of LA O* (F eng & Hansen, 2002). Ho w ev er, an imp ortan t asp ect of wh at w e do here is that temp oral v ariables are also included in the BDD. 4.4 The n m r dpp System nmrdpp is con trolled b y a command language, which is read either from a file or inte rac- tiv ely . The command language pro vides commands for the different phases (prepro cessing, expansion, solution) of the metho ds, commands to insp ect the resulting p olicy and v alue functions, e. g. with rendering via DOT (A T&T L abs-Researc h , 2000), as well as su pp orting commands for timing and memory u sage. A sample session, wh ere the coin NMRDP is successiv ely solv ed w ith p l tlstr and pl tlmin is sh o wn in Figure 9. nmrdpp is implemen ted in C++, and mak es use of a num b er of sup p orting libr aries. In particular, it relies hea vily on the C UDD pac k age for manipulating ADDs (Somenzi, 2001) : action sp ecification trees are con v erted in to and stored as ADDs by the system, and m oreo v er the s tr u ctured algorithms rely hea vily on CUDD for ADD computations. The state-based algorithms make use of the MTL – Matrix T emplate Library f or matrix op erations. MTL tak es adv an tage of mo dern pro cessor features suc h as MMX and SSE and pr o vides efficien t sparse matrix op erations. W e b eliev e that ou r implement ations of MDP solution metho ds are comparable w ith the state of the art. F or instance, we found that our implement ation of S PUDD is comparable in p erformance (within a factor of 2) to the r eference implemen tation (Hoey et al., 1999). On the other hand , we b eliev e that data structures used for regression and pr ogression of temp oral formulae could b e optimised. 5. Exp erimen tal Analysis W e are faced with thr ee substanti ally differen t approac hes that are n ot easy to compare, as their p erformance will dep end on d omain features as v aried as the structure in the transition mo d el, the t yp e, syn tax, and le ngth of the temp oral rew ard f orm ula, the presence 46 Decision-Theoretic Planning with non-Markovian Rew ards of rewards unreac hable or irrelev an t to the optimal p olicy , th e a v ailabilit y of goo d heuristics and cont rol-kno wledge, etc, and on the interac tions b et we en these factors. In this section, w e rep ort an exp erimental inv estigation int o the influen ce of some of these factors and try to answ er the qu estions raised previously: 10 1. is the d ynamics of the domain the predominant factor affecting p erformance? 2. is the typ e of rew ard a ma jor factor? 3. is the s yntax u sed to describ e rewa rds a ma jor factor? 4. is there an o v erall b est metho d ? 5. is there an o v erall worst m etho d ? 6. do es the prepro cessing phase of pl tlmin pa y , compared to pl t lsim ? 7. do es the simplicit y of the fl tl translation comp ensate for blind-min imalit y , or d o es the b enefit of true minimalit y out w eigh the cost of pl tlmin prepro cessing? 8. are th e dynamic analyses of rew ards in pl tlstr and fl tl effectiv e? 9. is one of these analyses more p ow erfu l, or are they rather complemen tary? In some cases but n ot all, we w ere able to identify systematic p atterns. The results in this section were obtained u sing a P ent ium4 2.6GHz GNU/Lin ux 2.4.20 mac hine with 500MB of ram. 5.1 Preliminary Remarks Clearly , fl tl and pl tlst r(a) ha v e great p oten tial for exploiting d omain-sp ecific h euris- tics and contro l-kno wledge; p l tlmin less so. T o a v oid obscuring the results, w e therefore refrained from incorp orating these features in the exp erimen ts. When ru nning LAO*, the heuristic v alue of a state w as the crudest p ossible (the sum of all r ew ard v alues in the problem). P erformance results sh ould b e in terpreted in this ligh t – they do not necessarily reflect the practical abilities of the m ethod s that are able to exploit these features. W e b egin with some general observ ations. One question raised ab o v e w as whether the gain during the P L TL expansion phase is worth the exp ensiv e prepro cessing p erformed by pl tlmin , i.e. whether pl tlmin typica lly outp erf orms pl tlsim . W e can d efinitiv ely answ er this question: up to pathological exceptions, prepr o cessing pa ys. W e found that expansion w as the b ottlenec k, and that p ost-ho c minimisation of the MDP pr o duced by pl t lsim did not help m uch. pl t lsim is therefore of little or no pr actica l in terest, and w e decided not to rep ort results on its p erformance, as it is often an order of magnitude worse than that of pl tlmin . Unsurp risingly , we also found that pl tlstr would t ypically scale to larger state spaces, inevitably leading it to outp erform state-based metho ds. Ho w ev er, this effect is not uniform: stru ctured solution metho ds sometimes imp ose excessiv e m emory requirements whic h mak es them uncomp etitiv e in certain cases, for example where ⊖ n f , for large n , features as a rewa rd formula. 10. Here is an executive summary of the answ ers for the ex ecutive reader. 1. no, 2. yes, 3. yes, 4. pl tlstr and fl tl , 5. pl tlsim , 6. yes, 7. yes and no, respectively , 8 . yes, 9. n o and y es, respectively . 47 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza 5.2 Domains Exp eriments were p erf ormed on four h an d -cod ed d omains (prop ositions + dyn amics) and on random domains. Eac h hand-co ded domain has n prop ositions p i , and a dynamics whic h mak es eve ry s tate p ossible and ev en tually reac hable from the initial state in whic h all pr op ositions are false. The fi rst t w o suc h d omains, sp udd-linear and sp udd-expon w ere discussed by Ho ey et al. (1999 ); the t wo others are our o wn. The in ten tion of s pudd-linear was to tak e adv antag e of the b est case b ehavi our of SPUDD. F or eac h prop osition p i , it has an action a i whic h sets p i to true and all prop ositions p j , 1 ≤ j < i to false. spudd-expon , w as used by Ho ey et al. (1999) to demonstrate the w orst case b ehavio ur of SPUDD. F or eac h prop osition p i , it has an action a i whic h sets p i to true only when all prop ositions p j , 1 ≤ j < i are true (and sets p i to false otherwise), a nd sets the latter pr op ositions to false. T he thir d d omain, called on/off , has one “turn-on” and one “turn-off ” action p er prop osition. Th e “turn-on- p i ” action only p r obabilistical ly succeeds in setting p i to true when p i w as false. The tur n-off action is similar. The four th domain, calle d complet e , is a fully connected reflexive domain. F or eac h prop osition p i there is a n actio n a i whic h sets p i to true w ith p robabilit y i/ ( n + 1) (and to false o therwise) and p j , j 6 = i to tru e or false with probabilit y 0.5. Note that a i can cause a transition to an y of the 2 n states. Random domains of size n a lso inv olv e n prop ositions. The metho d for generating their dynamics is detailed in app endix C . Let us j u st summarise by sa ying that we are able to generate r andom dynamics exhibiting a giv en degree of “structure” and a giv en degree of uncertain t y . Lac k of structure essen tially measur es the bushiness of the in ternal part of the ADDs represen ting the actions, and u ncertain t y measur es the bu shiness of their lea v es. 5.3 Influence of Dynamics The in teraction b et we en dynamics and rew ard certainly affects the p erformance of the differen t approac hes, though not so strikingly as other factors such as the reward t yp e (see b elo w). W e found that under the same reward sc heme, v arying the degree of stru cture or uncertain t y d id not generally change the relativ e success of the differen t approac hes. F or instance, Figures 10 and 11 sh o w the a v erage run time of the metho ds as a fu nction of the degree of structure, resp . degree of u ncertain t y , f or random problems of size n = 6 and rew ard ⊖ n ¬ ⊖ ⊤ (the state encounte red at stage n is rew arded, regardless of its prop erties 11 ). Run-time increases slightl y with b oth degrees, but there is n o significan t change in r elativ e p erformance. These are typica l of the graphs we obtain for other rewards. Clearly , counte rexamples to th is observ ation exist. These are most n otable in cases of extreme dynamics, for instance with the sp u dd-expon domain. Although for small v alues of n , such as n = 6, pl t l str appr oac hes are faster than the others in han d ling the reward ⊖ n ¬ ⊖ ⊤ for virtuall y an y t yp e o f dynamics w e encoun tered, they p erform v ery p o orly with that rewa rd on s pudd-expon . This is explained by th e fact that only a small fr action of spudd-expo n states are reac hable in the first n steps. After n steps, fl tl immediately recognises that r eward is of no consequence, b ecause the f orm ula has progressed to ⊤ . pl tlmin d isco v ers this fact only after exp ensiv e prepro cessing. pl tl str , on the other hand, remains concerned by the prosp ect of rewa rd, just as pl tlsim w ould. 11. n $ in $FL TL 48 Decision-Theoretic Planning with non-Markovian Rew ards Structure (0:Structured, ... 1:Unstructured) 0.1 0.3 0.5 0.7 0.9 1.1 Average CPU time (sec) 5 10 15 20 25 30 FLTL PLTLSTRUCT PLTLMIN PLTLSTRUCT(A) Figure 10: Changing the Degree of S tructure Uncertainty (0:Certain, ... 1:Uncertain) 0 0.2 0.4 0.6 0.8 1 1.2 Average CPU time (sec) 5 10 15 20 25 35 FLTL PLTLSTRUCT PLTLMIN PLTLSTRUCT(A) Figure 11: Changing the Degree of Uncertain t y 5.4 Influence of Rew ard Types The type of reward app ears to hav e a s tron ger infl u ence on p erformance than dyn amics. This is unsurprising, as the rew ard t yp e significan tly affec ts the size of t he generated MDP: certain r ew ards only mak e the size of th e minimal equiv alent MDP increase b y a constant n umber of states or a constan t factor, while others mak e it increase by a f actor exp onen tial in the length of the formula . T able 1 illustrates this. The third column rep orts the size of the minimal equiv alent MDP induced by the form ulae on the left hand side. 12 A legitimate qu estion is whether there is a d irect correlation b et ween size in crease and (in)appropriateness o f the different metho ds. F or instance, w e migh t exp ect t he state- based metho ds to do particularly w ell in conjunction with rewa rd t yp es inducing a small MDP and 12. The figures are not necessarily va lid for non- completely conn ected N MRDPs. Unfortunately , even for completely connected domains, there do es not app ear to be a muc h c heap er wa y to determine the M DP size than to generate it and coun t s tates. 49 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza t yp e formula size fastest slow est first time all p i s ( ∧ n i =1 p i ) ∧ ( ¬ ⊖ ♦ - ∧ n i =1 p i ) O (1) || S || pl tlstr(a) pl tl min p i s in sequence from start state ( ∧ n i =1 ⊖ i p i ) ∧ ⊖ n ¬ ⊖ ⊤ O ( n ) || S || fl tl pl tlstr t wo consecutiv e p i s ∨ n − 1 i =1 ( ⊖ p i ∧ p i +1 ) O ( n k ) || S || pl tlstr fl tl all p i s n times ago ⊖ n ∧ n i =1 p i O (2 n ) || S || pl tlstr pl tl min T able 1: Influence of Rew ard Typ e on MDP S ize and Metho d P erformance n 2 2.5 3 3.5 4 4.5 5 5.5 Average CPU time (sec) 200 400 600 1000 All APPROACHES prvIn All APPROACHES prvOut Figure 12: Changing the Sy ntax otherwise badly in comparison w ith structured metho d s . Interesti ngly , this is not alwa ys the case. F or in s tance, in T able 1 whose last t wo columns rep ort the fastest and slo we st metho ds o v er the range of h and-co ded domains wh ere 1 ≤ n ≤ 12, the first row con tradicts that exp ectation. Moreo ver, although pl tlstr is fastest in the last ro w, for larger v alues of n (n ot repr esen ted in the table), it ab orts through lac k of m emory , unlike the other metho ds. The most obvi ous observ atio ns arising out of these exp eriments is th at pl tlstr is nearly alw a ys the fastest – until it run s out of memory . P erhaps the most in teresting results are those in the s econd row, wh ic h exp ose the inabilit y of metho ds based on PL TL to deal with rewards sp ecified as long sequences of eve nts. In con ve rting the r eward form ula to a s et of subf ormulae, they lose information ab out the order of eve nts, which then has to b e reco v ered lab oriously b y reasoning. $FL TL progression in contrast tak es the ev en ts one at a time, preserving the relev ant structure at eac h step. F ur ther exp erimenta tion led us to observe that all PL T L based algorit hms p erform p o orly where rew ard is sp ecified us ing form ulae of the form ⊖ k f , ♦ - k f , and ⊟ k f ( f h as b een true k steps ago, within the last k steps, or at all of the last k steps). 5.5 Influence of Syn tax Unsurpr isingly , w e fi nd that the syn tax used to exp ress rew ards, whic h affects the length of the formula, has a ma jor in fluence on the run time. A t ypical example of this effect is captured in Figure 12. Th is graph demonstrates how re-expressing prvOut ≡ ⊖ n ( ∧ n i =1 p i ) 50 Decision-Theoretic Planning with non-Markovian Rew ards n 0 2 4 6 8 10 12 14 State count/(2^n) 1 3 5 7 9 11 PLTLMIN FLTL Figure 13: Effect of Multiple Rew ards on MDP size n 0 2 4 6 8 10 12 14 Total CPU time (sec) 500 1000 1500 FLTL PLTLSTRUCT PLTLMIN PLTLSTRUCT(A) Figure 14: Effect of Multiple Rewards on Run Time as prvIn ≡ ∧ n i =1 ⊖ n p i , thereby creating n times m ore temp oral sub form ulae, alters the runnin g time of all PL TL metho ds. f l tl is affected to o as $FL TL progression requires tw o iteratio ns through the r ew ard formula. The graph repr esents the a v erages of the run ning times o v er all the m etho d s, for the comple te domain. Our most serio us concern in r elatio n to the PL TL approac hes is their hand ling of rewa rd sp ecifications conta ining m ultiple reward elemen ts. Most n otably we found that pl tlmin do es not necessarily pro d uce the minimal equiv alen t MDP in this situation. T o demon- strate, we consider th e set of r ew ard formulae { f 1 , f 2 , . . . , f n } , eac h asso ciated with the same r eal v alue r . Giv en this, PL TL app roac hes will distinguish unnecessarily b et w een past b eha viours whic h lead to iden tical future rewa rds. T his may o ccur when the r ew ard at an e-state is dete rmined b y the truth v alue of f 1 ∨ f 2 . Th is formula do es not nece ssarily require e-state s that distinguish b et we en the cases in whic h { f 1 ≡ ⊤ , f 2 ≡ ⊥} and { f 1 ≡ ⊥ , f 2 ≡ ⊤} hold; how ever, giv en the ab o v e sp ecification, pl tlmin mak es this distinction. F or example, 51 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza taking f i = ⊖ p i , Figure 13 sh o ws that f l tl leads to an MDP whose size is at most 3 times that of the NMRDP . In con trast, the relativ e size of the MDP pro du ced b y pl tlmin is linear in n , the num b er of rew ards and pr op ositions. These r esults are obtained with all hand-co ded domains except spudd-expon . Figure 14 sh o ws the run-times as a f unction of n for compl ete . fl tl dominates and is only o v ertak en by pl tlstr (A) for large v alues of n , when the MDP b ecomes to o large for explicit exploration to b e practical. T o obtain the minimal equiv alent MDP us in g pl t l min , a bloated r eward sp ecificati on of the form { ( ⊖ ∨ n i =1 ( p i ∧ n j = 1 , j 6 = i ¬ p j ) : r ) , . . . , ( ⊖ ∧ n i =1 p i : n ∗ r ) } is n ecessary , w hic h, by virtue of its exp onen tial length, is n ot an adequate solution. 5.6 Influence of Reac habilit y All approac hes claim to ha v e some abilit y to ignore v ariables which are ir r elev an t b ecause the condition they trac k is unr eac h able: 13 pl tlmin detects them through prepr o cessing, pl tlst r exploits the abilit y of structured solution metho ds to ignore them, and fl tl ig- nores them wh en progression nev er exp oses them. How ever, giv en that the mec hanisms for a v oiding irrelev ance are so differen t, w e exp ect corresp onding d ifferences in their effects. On exp erimen tal in v estigation, we foun d that the differences in p erformance are b est illus- trated by looking at resp onse form ulae, whic h assert that if a trigger condition c is reac hed then a rew ard will b e receiv ed up on ac hiev emen t of the goal g in, resp. within, k steps. In PL TL, this is w r itten g ∧ ⊖ k c , r esp. g ∧ ♦ - k c , and in $ FL TL, ( c → k ( g → $)), resp. ( c → k ( g → $)) When the go al is unr eac h able, PL TL approac hes p erform w ell. As it is alw a ys false, the goal g do es not lead to b eha vioural distinctions. On the other hand, while constructing the MDP , f l tl considers the successiv e progressions of k g without being able to detect that it is unreac hable until it actually fails to happ en. This is exactly wh at the b lindness of blind minimalit y amoun ts to. Figure 15 illustrates the difference in p erf ormance as a f u nction of the n umber n of prop ositions in v olv ed in the spud d-linear domain, when the rew ard is of the form g ∧ ⊖ n c , with g u n reac hable. fl tl s h ines wh en the trigger is unr eac h able. S ince c nev er happ ens, the formula will alw a ys p rogress to itself, and the goal, ho w ev er complicated, is nev er trac k ed in the gener- ated MDP . In this situation PL TL approac hes still consider ⊖ k c and its sub form ulae, only to disco v er, after exp ensive prepro cessing for pl tlmin , after reac habilit y analysis for pl tl- str(a) , and never for pl tl str , that these are irrelev ant . This is illustrated in Figure 16, again with spudd-linear and a rewa rd of the form g ∧ ⊖ n c , with c unreac hable. 5.7 Dynamic Irrelev ance Earlier w e claimed that one adv antag e of pl tl s tr and fl tl o v er pl tlmin and pl tl sim is that the former p erform a dyn amic analysis of rewa rds capable of d etecti ng irrelev ance of v ariables to particular p olici es, e.g. to the optimal p olicy . Our exp erimen ts confirm this claim. How ever, as for reac habilit y , wh ether the goal or the triggering condition in a resp onse form ula b ecomes irrelev an t pla ys an imp ortant role in d etermining whether a 13. Here we sometimes speak of co nditions and goals b eing ‘reachable’ or ‘achiev able’ rather than ‘feasi ble’, although they ma y b e temp orally extended. This is to keep in line with con ven tional vocabulary as in the ph rase ‘reachabil ity analysis’. 52 Decision-Theoretic Planning with non-Markovian Rew ards n 2 4 6 8 10 12 14 Total CPU time (sec) 50 100 150 250 350 FLTL PLTLSTRUCT PLTLMIN PLTLSTRUCT(A) Figure 15: Resp onse F ormula with Unac hiev able Goal n 1 3 5 7 9 11 Total CPU time (sec) 50 100 150 250 350 FLTL PLTLSTRUCT PLTLMIN PLTLSTRUCT(A) Figure 16: Resp onse F ormula with Unac hiev able T rigger pl tlst r or fl tl app roac h should b e tak en: p l tlstr is able to d ynamically ignore the goal, while fl tl is able to d yn amically ignore the trigger. This is illustrated in Figures 17 and 18. In b oth figures, the domain considered is on/off with n = 6 p rop ositions, the resp on s e f ormula is g ∧ ⊖ n c as b efore, here with b oth g and c ac hiev able. This resp onse form ula is assigned a fixed reward. T o study the effec t of dynamic irrelev ance of the goal, in Figure 17, ac hiev emen t of ¬ g is rew arded by the v alue r (i.e. w e hav e ( ¬ g : r ) in PL T L ). In Figure 18, on the other han d , w e study the effect of dynamic irrelev ance of the trigger and ac hiev emen t of ¬ c is rewarded b y the v alue r . Both figures sho w the runtime of the metho ds as r increases. Ac hieving the goal, resp. the trigger, is made less attractiv e as r increases up to the p oin t where the resp onse form ula b ecomes irrelev an t und er the optimal p olicy . When this happ ens, the run -time of p l tlstr resp. fl t l , exhibits an abru p t but durable imp ro v emen t. The figur es sho w that fl tl is able to p ic k up irrelev ance of the trigger, wh ile pl tlstr is able to exploit irrelev ance of the goa l. As exp ected, pl tl min whose analysis is static do es not pic k 53 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza r 0 50 100 150 200 250 300 350 Total CPU time (sec) 50 100 150 200 PLTLMIN FLTL PLTLSTRUCT PLTLSTRUCT (A) Figure 17: Resp onse F orm ula with Unrewarding Goal r 0 50 100 150 200 250 300 350 Average CPU time (sec) 50 100 150 200 PLTLMIN PLTLSTRUCT FLTL PLTLSTRUCT(A) Figure 18: Resp onse F ormula with Unrewa rdin g T rigger up e ither and p erforms consisten tly badly . Note th at in b oth figures, pl tlstr p r ogressiv ely tak es longer to compute as r increases b ecause v alue iteration requires additional ite rations to con v erge. 5.8 Summary In our exp erimen ts with artificial domains, w e found pl tl str and fl tl preferable to state- based PL TL app roac hes in most cases. If one insists on using the latter, w e strongly recommend prepr o cessing. fl tl is the tec hnique of c hoice w hen the rewa rd requir es trac king a long sequence of ev en ts or when the desired b eh avio ur is comp osed of many elemen ts with iden tical rew ards. F or resp onse formulae , we a dvise the use of pl tlstr if the probabilit y of reac hing the goal is low or ac hieving the goal is v ery costly , and conv ersely , w e advise the use of f l tl if the prob ability of reac hing the triggering condition is lo w or if reac hing it is v ery costly . In all cases, atten tion should b e p aid to the syn tax of the rew ard formulae and 54 Decision-Theoretic Planning with non-Markovian Rew ards in particular to minimising its length. Indeed, as could b e exp ected, we found the syntax of the form ulae and the type of n on -Marko vian rewa rd they enco de to b e a predominant factor in determining the difficult y of the problem, muc h m ore so than the features of the Mark o vian dyn amics of the d omain. 6. A Concrete Example Our exp erimen ts hav e so far fo cus ed on arti ficial problems and hav e aimed at c haracterising the strengths and wea knesses of the v arious ap p roac hes. W e no w lo ok at a concrete example in order to giv e a sense of the size of more int eresting p roblems that these tec hniqu es can solv e. Our example is derive d fr om the Miconic elev ator classical planning b enchmark (Ko ehler & Sc h uster, 2000). An elev ator must g et a num b er of passenge rs from their origin flo or to th eir destination . I nitially , the elev ator is at some arbitrary flo or and no passenger is serve d nor has b o ar de d the elev ator. In our v ersion of the problem, there is one single action whic h causes the elev ator to servic e a giv en fl o or, with the effect that the unserved passengers whose origin is the serviced flo or b oard the elev ato r, while the b oarded passengers whose destination is the serviced fl o or unboard and b ecome served. Th e task is to plan the elev ator mov ement so that all passengers are eve ntual ly served. 14 There are tw o v arian ts of Miconic. In the ‘simple’ v arian t, a reward is receiv ed eac h time a passenger b ecomes serv ed. In the ‘hard ’ v ariant, the elev ator also attempts to pro vide a range of pr iorit y s erv ices to passengers with sp ecial requirements: many p assengers will prefer trav elling in a single direction (either up or do wn) to th eir destination, certain passengers migh t b e offered non-stop tra v el to their destination, and fi nally , passengers with disabilities or young c hildren sh ould b e su p ervised inside the elev ator by some other passenger (the sup ervisor) assigned to them. Here we omit th e VIP and confl icting group services present in the original hard Miconic problem, as the reward form ulae for those do not create additional difficulties. Our form ulation of the problem mak es use of the same prop ositions as the PDDL d escrip- tion of Miconic used in t he 2 000 In tern ational Planning Comp etition: dynamic pr op ositions record the fl o or the elev ator is curren tly at and whether passengers are serv ed or b oarded, and static prop ositions record the origin and destination flo ors of p assengers, as w ell as the catego ries (non-stop, direct-tra v el, sup ervisor, sup ervised) the p assengers f all in. Ho we ve r, our formulat ion differs f rom the PDDL description in t wo inte resting w a ys. Firstly , since w e u s e r e war ds instead of goals, w e are able to find a preferred solution ev en when all goals cannot s imultaneously b e satisfied. Secondly , b ecause priorit y services are naturally describ ed in terms of non-Markovian rewards, we are able to use the same actio n descrip- tion for b oth the simple and hard v ersions, whereas the PDDL description of h ard miconic requires additional actions (up, do wn) and complex preconditions to m onitor the satisfac- tion of priority service constrain ts. Th e r ew ard sc hemes for Miconic can b e encapsulated through four differen t types of r ew ard form ula. 1. In the simp le v ariant , a rew ard is r eceiv ed the first time eac h passenger P i is serve d: 14. W e hav e exp erimented with sto c hastic v arian ts of Miconic where passengers hav e some small probability of desem barking at the wrong fl oor. How ever, we fin d it more useful to present results for the deterministic versi on since it is closer to the Mico nic deterministic planning b enchmark and since, as w e ha ve sho wn b efore, rewards ha ve a far more crucial impact than dynamics on the relative performance of th e methods. 55 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza PL TL: S erv edP i ∧ ⊖ ⊟ ¬ S erv edP i $FL TL: ¬ S er v edP i U ( S er v edP i ∧ $) 2. Next, a rew ard is receiv ed eac h time a n on-stop passenger P i is serv ed in one step after b oarding the elev ator: PL TL: N onS topP i ∧ ⊖ ⊖ ¬ B oar dedP i ∧ ⊖ ⊖ ¬ S erv edP i ∧ S er v edP i $FL TL: (( N onS topP i ∧ ¬ B oar dedP i ∧ ¬ S erv edP i ∧ S erv edP i ) → $) 3. Th en, a rew ard is receiv ed eac h time a sup ervised passenger P i is serv ed while ha ving b een accompanied at all times in side th e elev ator by his su p ervisor 15 P j : PL TL: S uper v isedP i ∧ S uper v isor P j P i ∧ S er v edP i ∧ ⊖ ⊟ ¬ S erv e dP i ∧ ⊟ ( B oar dedP i → B oar dedP j ) $FL TL: ¬ S er v edP i U (( B oardedP i ∧ S uper v isedP i ∧ ¬ ( B oar dedP j ∧ S uper v isor P j P i ) ∧ ¬ S erv e dP i ) ∨ ( S er v ededP i ∧ $)) 4. Finally , rew ard is receiv ed eac h ti me a direct tra v el passenger P i is serv ed while ha ving tra v elled only in one direction since b oarding, e.g., in the case of going up : PL TL: Di rectP i ∧ S er v edP i ∧ ⊖ ¬ S e rv e dP i ∧ (( W j W k>j ( AtF loor k ∧ ⊖ AtF l oor j )) S ( B oar dedP i ∧ ⊖ ¬ B oar dedP i )) $FL TL: (( Dir ectP i ∧ B oar dedP i ) → ( ¬ S erv edP i U (( ¬ ( W j W k>i AtF l oor j ∧ AtF l oor k ) ∧ ¬ S erv e dP i ) ∨ ( ser v edP i ∧ $)))) and similarly in the case of going d o wn. Exp eriments in this section w ere r un on a Dual Pe ntium4 3.4GHz GNU/Lin ux 2.6.11 mac hine with 1GB of ram. W e fi rst exp erimen ted w ith the simp le v arian t, giving a r ew ard of 50 eac h time a passenger is first served. Figure 19 sh o ws the CPU time take n b y the v arious app roac hes to solv e r andom problems with an increasing num b er n of flo ors and passengers, and Figure 20 shows the num b er of states expanded w h en doing so. Eac h data p oin t corresp onds to just one r andom p r oblem. T o b e fair with the structured app roac h, w e ran pl t lstr(a) which is able to exploit reac habilit y from the start state. A fi rst observ ation is that although pl tl str(a) do es b est fo r small v alues of n , it quic kly run s out o f memory . pl tlst r (a) and pl tlsim b oth need to trac k formulae of the form ⊖ ⊟ ¬ S er v edP i while pl tlsim do es n ot, and w e conjecture that this is wh y they ru n out of memory earlier. A second observ ation is that attempts at PL TL minimisation do not pay v ery muc h here. While pl tlmin has reduced memory b ecause it trac ks few er subformulae, the size of the MDP it pro d uces is iden tical to the size of the pl tls im MDP and larger than that of the fl tl MDP . This size increase is du e to the fact that PL TL approac hes lab el differentl y e-state s in which the same passengers are serv ed, dep end ing on who has ju st b ecome serv ed (for those passengers, the rew ard f orm ula is tru e at the e-state ). In con trast, our fl tl implemen tation with progression on e step ahead lab els all these e-stat es w ith the r ew ard 15. T o understand the $ FL TL formula, observe that w e get a reward iff ( B oar de dP i ∧ S uper v isedP i ) → ( B oar de dP j ∧ S uper v isor P j P i ) holds until S er v edP i b ecomes t ru e, and recall that th e formula ¬ q U (( ¬ p ∧ ¬ q ) ∨ ( q ∧ $)) rew ards the holding of p until the occurrence of q . 56 Decision-Theoretic Planning with non-Markovian Rew ards n 2 4 6 8 10 12 14 Total CPU time (sec) 1000 2000 4000 7000 FLTL PLTLSIM PLTLMIN PLTLSTR(A) Figure 19: Simple Miconic - Ru n T ime n 2 4 6 8 10 12 14 State count/(2^n) 0 5 10 15 20 25 30 35 40 45 FLTL PLTLSIM, PLTLMIN Figure 20: Simple Miconic - Nu m b er of Expanded States form ulae r elev an t to the passengers that still need to b e serv ed, the other formulae ha ving progressed to ⊤ . The gain in num b er of expand ed states materialises into ru n time gains, resulting in fl tl ev ent ually taking the lead. Our sec ond exp eriment illustrates the b enefits of using an ev en extremely simple adm is- sible heuristic in conju n ction w ith fl tl . O u r heuristic is applicable to d iscoun ted sto c hastic shortest path problems, and discounts rewards b y th e shortest time in the future in whic h they are p ossible. Here it simp ly amount s to assigning a fringe state to a v alue of 50 times the num b er of still un serv ed passengers (discount ed once), and results in av oiding fl o ors at whic h no passenger is w aiting and wh ic h are not the destination of a b oarded passenger. Figures 21 and 2 2 co mpare the run time and num b er of stat es expanded by fl tl when used in conjunction with v alue iteration (v alIt) to wh en it is us ed in conju n ction with an LAO* 57 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza n 2 4 6 8 10 12 14 Total CPU time (sec) 5000 10000 20000 35000 FLTL−LAO(h) FLTL−LAO(u) FLTL−valIt Figure 21: Effect of a S imple Heuristic on Run Time n 2 4 6 8 10 12 14 State count/(2^n) 0 10 20 30 40 50 FLTL−LAO(h) FLTL−valIt,FLTL−LAO(u) Figure 22: Effect of a S imple Heur istic on the Num b er of Exp an ded States searc h informed by the ab o v e h euristic (LA O(h)). Uninformed LA O* (LA O*(u), i.e. LA O* with a heuristic of 50 ∗ n at eac h no d e) is also included as a reference p oin t to sho w the o v erhead indu ced b y heuristic search. As can b e seen from the graphs, the heuristic searc h generates significan tly fewe r states and this ev ent ually pa ys in terms of ru n time. In our fi n al exp eriment, w e considered th e hard v arian t, giving a reward of 50 as b efore for service (1), a rew ard of 2 for non-stop tra ve l (2), a rew ard of 5 for appr opriate sup ervision (3), and a rewa rd of 10 for direct tra v el (2). Regardless of the num b er n of flo ors and passengers, p roblems only feature a single n on-stop tra ve ller, a third of passengers require sup ervision, and only half the p assengers care ab out trav eling direct. CPU time and num b er of states expanded are sho wn in Figures 23 and 24, resp ectiv ely . As in the simple case, pl tlsim and pl tlstr quic kly run out of memory . F ormulae of type (2) and (3) create to o man y additional v ariables to trac k for these approac hes, and the problem do es not seem 58 Decision-Theoretic Planning with non-Markovian Rew ards n 2 3 4 5 6 7 Total CPU time (sec) 2000 4000 8000 14000 FLTL PLTLSIM PLTLMIN PLTLSTRUCT(A) Figure 23: Hard Miconic - Ru n T ime n 2 3 4 5 6 7 State count/(2^n) 0 20 40 60 80 100 FLTL PLTLSIM PLTLMIN Figure 24: Hard Miconic - Nu m b er of Expanded States to exhibit enough structure to help pl tlstr . fl tl remains the fastest. Here, this do es not seem to b e so muc h due to the size of the generated MDP w hic h is just slight ly b elo w that of the pl tl min MDP , bu t rather to the o v erhead incurred by minimisation. Another observ ation arising from th is exp eriment is that only v ery small ins tances can b e handled in comparison to the classical plann ing v ersion of the problem solv ed by state of the art optimal classical p lanners. F or example, at the 2000 In ternational Plann ing Comp etition, the PropPlan p lanner (F ourman, 2000) optimally solv ed instances of h ard Miconic with 20 passengers and 40 flo ors in ab out 1000 seconds on a muc h less p o we rfu l m achine. 59 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza 7. nmrdpp in the Probabilistic Planning Competit ion W e no w rep ort on the b eha viour of nmrdpp in the probabilistic track of the 4th Interna- tional Planning C omp etition (IPC -4). Since the comp etition did not feature non-Mark o vian rew ards, o ur original motiv ation i n taking part was to further compare the s olution metho ds implemen ted in nmrdpp in a Markovian set ting. Th is o b jectiv e largely underestimated the c hallenges raised by merely get ting a planner ready for a comp etition, esp ecially when that comp etition is the first of its kind. In the end, w e decided that successfully p r eparing n m- rdpp to attempt all p roblems in the comp etition using one solution metho d (and p ossibly searc h con trol kno wledge), wo uld b e an h onorable r esu lt. The most crucial pr oblem w e encoun tered was the translation of PPDDL (Y ounes & Littman, 2004), the p robabilistic v ariant of PDDL u sed as inp ut language for the comp e- tition, into nmrd pp ’s ADD-based input language. While translating PPDDL into ADDs is p ossible in theory , devising a translation which is pr actic al enou gh for the need of the comp etition (small n umb er of v ariables, small, quic kly generated, and easily manipulable ADDs) is another matter. mtbdd , the translator kindly made a v ailable to participant s by the comp etition organisers, was not alw a ys able to ac hiev e the required efficiency . At other times, the trans lation w as qu ic k bu t nmrd p p w as unable to use the generated ADDs effi- cien tly . Consequen tly , we i mplemente d a state-based translator on top of the PDDL parser as a backup, and opted for a state-based solution metho d s in ce it did n ot rely on ADDs and could op erate with b oth translators. The v ersion of nmrd p p ente red in the comp etition did the follo wing: 1. Att empt to get a translation int o ADDs using mtb d d , and if th at p ro v es in f easible, ab ort it and rely on the state-based translator ins tead. 2. Run fl tl expansion of the state sp ace, taking searc h con trol knowle dge into accoun t when a v ailable. Break after 10mn if not complete. 3. Run v alue iteration to con v ergence. F ailing to ac hiev e any useful result (e.g. b ecause expansion w as not complete enough to ev en reac h a goal state), go bac k to step 2. 4. Run as many of the 30 trials as p ossible in th e remaining time, 16 follo w ing the gen- erated p olicy wh ere d efined, and falling b ac k on the n on-deterministic searc h con trol p olicy when a v ailable. With Step 1 we w ere trying to m aximise the in s tances in whic h the original ADD-based nmrdpp version could b e run in tact. In S tep 3, it was decided n ot to use LA O* b ecause when run with no go o d heuristic, it often incu r s a significan t ov erh ead compared to v alue iteratio n. The p roblems featured in the comp etition can b e classified in to goal-based or rew ard- based problems. In goal-based problems, a (p ositiv e) r ew ard is only receiv ed wh en a goal state is reac hed. In rewa rd-based problems, action p erformance ma y also incur a (usu ally negativ e) rew ard. Another orthogonal d istinction can b e made b et w een problems from 16. On eac h given problem, planners had 15mn to run w hatever computation they sa w as appropriate (in- cluding parsing, p re- processing, and p olicy generation i f any), and execute 30 trial ru n s of the generated p olicy from an initial state to a goal state. 60 Decision-Theoretic Planning with non-Markovian Rew ards domains that were n ot comm unicated in adv ance to the participan ts and those from domains that we re. The latter consisted of v ariants of blo cks world and logistics (or b o x world) problems, and gav e the p articipating p lanners an opp ortun it y to exploit knowledge of the domain, m uc h as in the hand-co ded deterministic plannin g trac k. W e decided to enr oll nmrdpp in a control -kno wledge mo de and in a domain-indep enden t mo de. T he only difference b et wee n the tw o mo des is that th e fir s t u ses FL TL searc h con trol kno wledge written for the known domains as additional inpu t. Our main concern in writing the control kno wledge w as to ac hiev e a reasonable compromise b et w een the size and effectiv eness of the form ulae. F or the blo c ks world d omain, in which the t w o actions pic kup-from and putdown-to h ad a 25% chance of dr opping the blo c k ont o the table, the con trol knowledge w e used enco ded a v arian t of the well -known GN1 near-optimal strategy for deterministic b lo c ks world plann in g (Slaney & Th i´ ebaux, 2001): wh enev er p ossible, try pu tting a clear blo ck in its goal p osition, otherwise p ut an arb itrary clear blo c k on the table. Because blo c ks get dropp ed on the table whenev er an action fails, and b ecause the success pr obabilities and r ewards are iden tical across actions, optimal p olicies f or the problem are essen tially made up of optimal sequences of actions for the deterministic blocks w orld and there w as little need for a m ore sophisticated strategy . 17 In the colored b lo cks w orld domain, where sev eral blo cks can share the same color and the goa l only refers to the color of the blo c ks, the con trol kno wledge selected an arbitrary goal sta te of the non-colored blo c ks w orld consistent with the colored goa l sp ecificat ion, and then used th e same strate gy as for the non-colored blo c ks world. T he p erformance of this s trategy d ep ends enti rely on the goal-stat e selected and can therefore b e arbitrarily bad. Logistics pr oblems from IPC -2 distinguish b et w een airp orts and other lo cations within a cit y; truc ks can drive b etw een any tw o lo catio ns in a cit y and p lanes can fly b et we en an y t w o airp orts. In con trast, the b ox wo rld only features cities, some of whic h ha v e an airp ort, some of whic h are only accessible by truc k. A priori, the map of the truc k and plane connections is arbitrary . T he goal is to get pac k ages fr om their cit y of origin to their cit y of destination. Mo ving by truc k has a 20% chance of resulting in r eac hing one of the three cities closest to the departure cit y rather than the in tended one. T he size of the b o x w orld se arc h space turn ed out to b e qu ite c hallenging for nmr d pp . Therefore, when writing searc h con trol k n o wledge, w e ga ve u p an y optimalit y consideration and fa vo red maximal prunin g. W e were help ed by the fact that the b o x world generato r pr o duces problems with the follo wing stru cture. Cities are divided into clusters, all of which are comp osed of at least one airp ort cit y . F urtherm ore eac h cluster has at least one hamiltonian circuit whic h truc ks can follo w . T he con trol knowledge we used forced all planes but one, and all tru cks but one in eac h cluster to b e idle. In eac h cluster, the truc k allo wed to mo v e could only attempt dr iving along the c hosen hamiltonian circuit, pic king u p and dropp ing parcels as it w ent . The planners participating in th e comp etition are sh o wn in T able 2. Planners E, G2, J1, and J2 are d omain-sp ecific: either t hey are tuned for bloc ks and box wo rlds, or they use domain-sp ecific searc h con trol kno wledge, or learn from examples. Th e other participating planners are domain-indep end ent. 17. More sophisticated n ear-optimal strategies fo r deterministic blocks wo rld exist (see Slaney & Thi ´ ebaux, 2001), bu t are m uch more complex to enco de and might hav e caused time p erformance problems. 61 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza P art. Description Reference C symbolic LA O* (F eng & Hansen, 2002) E* first-order h euristic search in the fluent calculus (Karabaev & Skvortso v a, 20 05) G1 nmrdpp without con trol kno wledge this paper G2* nmrdpp with co ntrol k now ledge this paper J1* interpreter of h and written classy p olicies (F ern et al., 2004) J2* learns cla ssy policies from random w alks (F ern et al., 2004) J3 v ersion of ff repla nning up on failure (Hoffmann & Nebel, 2001) P m gpt : lr tdp w ith automatical ly extracted h euristics (Bonet & Geffner, 2005) Q Pr obaPr op : conforman t probabilis tic planner (Onder et al., 2006) R structured rea chabilit y a nalysis and s tructured PI (T eich teil-K¨ onigsbuch & F abiani, 20 05) T able 2: Comp etition P articipan ts. Domain-sp ecific plann ers are starred dom bw-c-nr bw -nc- n r bx-nr expl-bw hanoise zeno tire-nr prob 5 8 11 8 5-10 10 -10 11 5-3 1-2-3-7 30-4 total G2 * 100 100 100 100 100 100 600 J1* 100 100 100 100 100 100 600 J2* 100 100 100 100 100 67 567 E* 100 100 100 100 400 J3 100 100 100 100 100 100 9 — — 23 632 G1 — 50 100 30 180 R 3 57 90 30 177 P 100 53 153 C 100 ? ≥ 1 00 Q 3 23 26 T able 3: Results for Goal-Ba sed Problems. Domain-sp ecific p lanners are s tarred . Entries are the p ercen tage of runs in whic h the goal w as reac hed. A blank ind icates that the p lanner was unable to attempt the p roblem. A — indicates that the planner attempted the problem but w as n ev er able to ac hiev e the goal. A ? indicates that the resu lt is u na v ailable (due to a bug in the ev aluation soft wa re, a couple of the results initially announced were foun d to b e inv alid). dom bw-c-r bw -nc- r bx-r file tire-r prob 5 8 11 5 8 11 15 18 21 5-10 10-10 10-15 30-4 30-4 total J1* 497 487 481 494 489 480 470 462 458 419 317 129 5183 G2 * 495 486 480 495 490 480 468 352 286 438 376 — 4846 E* 496 492 486 495 490 2459 J2* 497 486 482 495 490 480 468 — 455 376 — — 4229 J3 496 487 482 494 490 481 — — 459 425 346 279 36 — 4475 P 494 488 466 397 184 — 58 — 2087 C 495 ? ≥ 495 G1 495 — — 495 R 494 494 Q 180 11 191 T able 4: Results for Rew ard-Based Problems. Domain-sp ecific planners are starred. En tries are th e av erage reward ac hiev ed o v er the 30 r uns. A blank ind icate s that th e planner w as u n able to attempt the pr oblem. A — in dicates that the planner attempted the problem but did not ac hiev e a strictly p ositiv e r eward. A ? indicates that the result is unav ailable. 62 Decision-Theoretic Planning with non-Markovian Rew ards T ables 3 and 4 sh o w the results of the comp etition, which we extracted f rom th e com- p etition ov erview p ap er (Y ounes, Littman, W eissmann, & Asmuth, 2005) and from the comp etition w eb site h ttp://www .cs.rutge rs.edu/~mlittman/topics/ipc04- pt/ . The first of those tables concerns goal-based p roblems and the second th e r ew ard-based prob- lems. The ent ries in th e tables represen t the goal-ac h iev emen t p ercen tage or av erage re- w ard ac hiev ed by the v arious planner v ersions (left-column) on the v arious problems (top t w o ro ws). Planners in the top p art of the tables are domain-sp ecific. Problems from the kno wn domains lie on the left-hand side of the tables. The colored blo c ks world pr oblems are b w-c-nr (goal-base d ve rsion) and bw -c-r (rew ard versio n) with 5, 8, and 11 b lo cks. The non-colored blo c ks w orld problems are b w-nc-nr (goal-base d v ersion) with 8 b lo c ks, and b w- nc-r (rew ard-based ve rsion) with 5, 8, 11, 15, 18, and 21 blo c ks. The b ox wo rld pr oblems are b x-nr ( goal-based) and b x-r (reward-based), with 5 or 10 cities a nd 1 0 or 15 b oxe s. Prob- lems from the unknown d omains lie on the right hand side of the tables. They comprise: expl-b w , an explo ding versio n of the 11 blo c k blo cks world problem in wh ic h putting down a blo ck may d estro y th e ob ject it is put on, zeno , a pr obab ilistic v arian t of a zeno tra v el domain problem from the IPC-3 with 1 plane, 2 p ersons, 3 cities and 7 fu el lev els, hanoise , a probabilistic v ariant of the to we r of hanoi p roblem with 5 disks and 3 ro ds, file , a problem of putting 30 files in 5 randomly c hosen folders, and tire , a v arian t a the tire world problem with 30 cities and spare tires at 4 of th em, where the tire ma y go fl at while d riving. Our planner nmrdpp in it s G1 or G2 v ersion, was able to atte mpt all problems, ac h iev- ing a strictly positiv e rew ard in all b ut 4 of them. Not ev en ff (J3) , the competition o v erall winner, was able to successfully attempt that many problems. n mrdpp p erformed particu- larly w ell on goal-base d problems, ac hieving the goal in 100% of the runs except in expl-b w , hanoise , and tire-nr (note that for these three problems, the goal ac hiev emen t probabilit y of the optimal p olicy do es not exceed 65%). No other p lanner outp erf orm ed nmrdpp on that scale. As p oint ed out b efore, ff b ehav es well on the probabilistic version of blo cks and b o x w orld b ecause the optimal p olicies are very close to those for th e deterministic p roblem – Hoffmann (2002) analyses the reasons why the ff h euristic works w ell for traditional plan- ning b enc hmarks such as blo c ks w orld and logistics. On the other hand, ff is un ab le to solv e the u nkno wn problems wh ich ha v e a differen t structur e and require more sub stan tial probabilistic r easoning, although these problems are easily solv ed by a num b er of partici- pating planners. As exp ected, there is a large discrepancy b et w een the versio n of nmrdp p allo wed to u se searc h con trol (G2) and the domain-indep enden t ve rsion (G1). While the latter p erforms ok a y with the unknown goal-base d domains, it is not able to s olv e an y of the known ones. In fact, to except for ff , none of the participating domain-indep endent planners w ere able to solve these problems. In the reward-based case, n mrdpp with cont rol know eldge b ehav es w ell on the known problems. Only t he h uman-enco ded p olicie s (J1) p erformed b etter. Without con trol knowl- edge nmrdpp is unable to scale on those pr oblems, while other participan ts suc h as f f and m gpt are. F urth erm ore nmr dpp app ears to p erform po orly on the t wo unknown p roblems. In b oth cases, this migh t b e due to the fact that it fail s to generate an optimal p olicy: s ub- optimal p olicie s easily hav e a high negativ e score in these domains (see Y oun es e t al., 2005). F or r-tire , w e know that nmrd p p did indeed generate a sub optimal p olicy . Add itionally , it could b e that nmrdpp was u nluc ky with the sampling-based p olicy ev aluation pro cess: in 63 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza tire-r in particular, there w as a h igh v ariance b et w een the costs of v arious tra jectories in the optimal p olicy . Alltoge ther, the comp etition results suggest that con trol kno wledge is lik ely to b e es- sen tial when solving larger p r oblems (Mark o vian or not) with nmrdpp , and that, as has b een ob s erved with deterministic planners, appr oac h es m aking use of con trol kno wledge are quite p ow erfu l. 8. Conclusion, Related, and F uture W ork In this pap er, w e ha v e examined the pr oblem of solving decision pr o cesses with non- Mark o vian rewa rds. W e ha v e d escrib ed existi ng approac hes whic h exploit a compact repre- sen tation of the reward function to automatica lly translate the NMRDP into an equiv alen t pro cess amenable to MDP solution metho ds. The computational mo del un derlying this framew ork can b e traced bac k to work on the relationship b et ween linear temp oral logic and automata in the areas of au tomated v erification and mo del-c hec king (V ardi, 2003; W olp er, 1987). While remaining in this framew ork, we ha ve prop osed a new represent ation of non-Mark o vian rewa rd functions and a translation in to MDPs aimed at making the b est p ossible use of state-based an ytime heur istic searc h as the solution metho d. Our repre- sen tation extends fu ture linear temp oral logic to express rewards. Ou r translation h as the effect of em b edd ing mo del-c hec king in the solution metho d. It results in an MDP of the minimal size ac hiev able without stepping outside the an ytime framew ork, and c onsequently in b etter p olicie s by the deadline. W e ha v e describ ed nmrdpp , a soft w are platform that implemen ts such app roac hes under a common interface , and whic h prov ed a usefu l to ol in their exp erimental analysis. Both the system and the analysis are the first of their kind . W e were ab le to iden tify a num b er of general trend s in the b eha viour s of the metho ds and to p ro vide ad vice as to whic h are the b est suited to certain circumstances. F or ob vious reasons, our analysis has fo cused on artificial domains. Additional work should examine a wider range of domains of m ore p ractical interest , to see w h at form these results tak e in that con text. Ultimately , we w ould lik e our analysis to help nmrdpp automatic ally select the most appropriate metho d. Unfortunately , b ecause of the d ifficult y of tr ans lating b et w een PL TL and $FL TL, it is lik ely that nmrdpp wo uld still hav e to main tain b oth a PL TL and a $FL TL v ersion of the reward formulae. A detailed comparison of our approac h to solving NMRDPs with existing metho ds (Bac- c h us et al., 1996, 1997) can b e found in Sections 3.10 and 5. T w o imp ortan t asp ects of f u ture w ork w ould help tak e the comparison further. One is to settle the question of the appro- priateness of our translation to stru ctured solution metho ds. Symb olic imp lemen tations of the solution metho ds we consider, e.g. symbolic LAO* (F eng & Hansen, 2002), as wel l as form ula p rogression in the con text of symbolic state represent ations (Pistore & T ra v erso, 2001) could b e inv estigated for that pur p ose. The other is to tak e adv anta ge of the greater expressiv e p o w er of $FL TL to co nsider a ric her class o f decision pro cesses, for instance with uncertain t y as to whic h rew ards are receiv ed and when. Many extensions of the language are p ossible: adding ev en tualities, unrestricted negation, first-class rewa rd p rop ositions, quan titativ e time, etc. Of course, d ealing with them via p rogression without bac ktrac king is another matter. 64 Decision-Theoretic Planning with non-Markovian Rew ards W e should in ve stigate the precise relationship b et w een our line of w ork and recen t w ork on planning for temp orally extended goals in non-deterministic domains. Of particular in terest are ‘w eak’ temp orally extended goals su c h as those expressible in the E agle language (Dal Lago et al., 2002), and temp orally extended goals expressible in π -CTL* (Baral & Zhao, 2004) . Eagle enables the exp r ession of attempted reac habilit y and main tenance goals of the form “try-reac h p ” and “try-main tain p ”, whic h add to the goals “do-reac h p ” and “do-main tain p ” already expr essible in CTL. The idea is that the generated p olicy should mak e ev ery atte mpt at satisfying prop osition p . F u rthermore, Eagle incl ud es reco v ery goals of the form “ g 1 fail g 2 ”, meaning that goal g 2 m ust b e ac hieve d whenev er goal g 1 fails, and cyclic goals of the form “rep eat g ”, meaning that g should b e ac hiev ed cyclically until it fails. Th e semanti cs of these goals is giv en in terms of v arian ts of B ¨ uc hi tree automata with preferred transitions. Dal Lago et al. (2002) presen t a planning algorithm based on sym b olic mo d el-c hec king whic h generates p olicies ac hieving those goals. Baral and Zhao (2004 ) describ e π -CTL *, an alternativ e fr amework for expressing a subset of Eagle goals and a v ariet y of others. π -CTL* is a v ariant of C TL* w h ic h allo ws for formulae inv olving t w o types of path q u an tifiers: quan tifiers tied to the paths f easible und er the generated p olicy , as is u sual, bu t also quan tifiers more generally tied to the paths feasible u nder any of the domain actions. Baral and Zhao (2004) d o n ot present an y planning algorithm. It w ould b e ve ry in teresting to know whether Eagle and π -CTL* goals can b e encoded as n on- Mark o vian rew ards in our fr amew ork. An immediate consequence wo uld b e that nmrdpp could b e used to plan for them. More generally , we would like to examine the resp ectiv e merits of non-deterministic planning for temp orally extended goals and decision-theoretic planning with non-Mark o vian rew ards. In the pure probabilistic setting (no rewards), recen t r elate d researc h includes work on planning and con troller synt hesis for probabilistic temp orally extended goals expressible in probabilistic temp oral logic s suc h as CSL or PCTL (Y ounes & Simmons, 2004; Baier et al., 2004) . T hese logics enable expressing statemen ts ab out the probabilit y of the p olicy satis- fying a giv en temp oral go al exc eeding a giv en threshold. F or instance, Y ounes and S immons (2004 ) desc rib e a v ery general pr ob abilistic planning f ramew ork, in v olving concurrency , con- tin uous time, and temp orally extended goals, ric h enough to mo del generalised semi-Mark o v pro cesses. Th e solution algorithms are not directly comparable to those presente d here. Another exciting f u ture work area is the in v estigation of temp oral logic formalisms for sp ecifying heuristic fu n ctions for NMRDPs or more generally for searc h problems with temp orally extended goals. Go o d heuristics are imp ortan t to some of the solution m etho d s w e are targeting, and surely their v alue ought to dep end on history . The metho ds w e hav e describ ed could b e applicable to the description an d pro cessing of such heuristics. Related to this is the problem of extending search con trol kn o wledge to ful ly op erate und er the presence of temp orally extended goals, rewa rds, and sto chasti c actions. A fir st issue is that branching or probabilistic logics s u c h as CTL or PC TL v arian ts should b e preferr ed to FL TL when describing searc h con trol kno wledge, b ecause when sto chast ic actions are in v olv ed, searc h cont rol often needs to refer to some of the p ossible futures and ev en to their probabilities. 18 Another ma jor problem is that the GOALP mo dalit y , w hic h is the k ey to the sp ecification of r eusable searc h con trol kno wledge is int erpreted with r esp ect to 18. W e w ould not argue, on the o ther hand , that CTL is necessa ry fo r represe nting non-Marko v ian r ewar ds . 65 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza a fi xed reac habilit y goal 19 (Bacc hus & Kabanza, 2000), and as suc h, is not applicable to domains with temp orally extended goals, let alone rewards. K abanza and Thi ´ ebaux (2005) presen t a first approac h to search con trol in the p r esence of temp orally extended goals in deterministic d omains, bu t m uch remains to b e done for a system lik e nmrdpp to b e able to supp ort a meaningful extension of GOA LP . Finally , let us mentio n that rela ted work in the area of databases uses a similar app r oac h to pl t lstr to extend a database with auxiliary relations con taining suffi cien t information to c hec k temp oral integ rit y constrain ts (Chomic ki, 1 995). The issues are somewhat differen t from those raised by NMRDPs: as there is only ever one sequence of databases, what matters is more the size of these auxiliary relations than a v oiding making redundant distinctions. Ac kno wledgemen ts Man y thanks to F ahiem Bacc h us, Ra jeev Gor ´ e, Marco Pistore, Ron v an der Meyden, Moshe V ardi, and Lenore Zuck for us eful discussions and comment s, as we ll as to the anonymous review ers and to Da vid Smith f or their thorough reading of the pap er and their excellen t suggestions. Sylvie T h i ´ ebaux, Charles Gretton, John Slaney , and Da vid P rice thank Na- tional ICT Australia f or its supp ort. NICT A is funded through the Australian Go ve rnm ent’s Backing A ustr alia’s A bility initiativ e, in p art through the Australian Researc h Coun cil. F r o- duald Kabanza is su pp orted b y the Canadian Natural S ciences and Engineering Researc h Council (NSER C). App endix A. A Class of Rew ard-Normal F orm ulae The existing decision pro cedur e (Slaney , 2005 ) for d etermining w hether a form ula is r ew ard- normal is guaran teed to terminate finitely , bu t in v olv es the construction and comparison of automata and is rather in tricat e in practice. It is therefore useful to g iv e a simple syn tactic c haracterisatio n of a set of constructors for obtaining rewa rd-norm al formulae ev en though not all suc h formulae are so constructible. W e sa y that a f ormula is material iff it con tains no $ and n o temp oral op erators – that is, the material formula e are the b o olean com binations of atoms. W e consider four operations on b eha viours represen table by f orm ulae of $FL TL. Firstly , a b ehavio ur ma y b e dela y ed for a sp ecified num b er of timesteps. Secondly , it ma y b e made conditional on a material trigger. T hirdly , it ma y b e started rep eatedly u n til a material termination condition is met. F ourthly , t w o b eha viours ma y b e com bined to form their union. Th ese op erations are easily realised syntac tically b y corresp ondin g op erations on form ulae. Where m is an y material formula: dela y [ f ] = f cond [ m, f ] = m → f lo op [ m, f ] = f U m union [ f 1 , f 2 ] = f 1 ∧ f 2 19. Where f is an atemp oral f ormula, GOALP( f ) is true iff f is true of all goal s tates. 66 Decision-Theoretic Planning with non-Markovian Rew ards W e ha v e sho wn (Slaney , 2005) that the set of rew ard-normal form ulae is c losed under dela y , cond (for any material m ), lo op (for any material m ) and union , and also that the closure of { $ } under these op erations represent s a class of b eha viours cl osed und er in tersection and concatenat ion as wel l as union. Man y familiar r ew ard-normal form u lae a re obtainable fr om $ b y app lying the f our op er- ations. F or example, ( p → $) is lo op [ ⊥ , cond [ p, $]]. Sometimes a paraph rase is necessary . F or example, (( p ∧ q ) → $) is not of th e required f orm b ecause of the in the an teceden t of th e conditional, but the equiv alent ( p → ( q → $ )) is lo op [ ⊥ , cond [ p, dela y [ cond [ q , $]]]]. Other cases are n ot so easy . An example is the formula ¬ p U ( p ∧ $ ) wh ic h stipulates a reward the first time p happ ens and whic h is not at all of t he form suggested. T o capture the same b eha viour using the ab ov e op erations requir es a formula lik e ( p → $) ∧ ( ( p → $) U p ). App endix B. Pro ofs of Theorems Prop e rty 1 Where b ⇔ (Γ( i ) ∈ B ), (Γ , i ) | = B f iff (Γ , i + 1) | = B Prog( b, Γ i , f ). Pro of: Induction on the structure of f . There are sev eral base cases, all fairly trivial. If f = ⊤ or f = ⊥ there is nothing to p ro v e, as these p rogress to themselv es and hold ev erywhere and nowhere resp ectiv ely . If f = p then if f holds in Γ i then it progresses to ⊤ whic h holds in Γ i +1 while if f do es not hold in Γ i then it p rogresses to ⊥ whic h d o es not hold in Γ i +1 . The case f = ¬ p is similar. In the last b ase case, f = $. Then the follo wing are equiv alent : (Γ , i ) | = B f Γ( i ) ∈ B b Prog( b, Γ i , f ) = ⊤ (Γ , i + 1) | = B Prog( b, Γ i , f ) Induction case 1: f = g ∧ h . The follo wing are equiv alen t: (Γ , i ) | = B f (Γ , i ) | = B g and (Γ , i ) | = B h (Γ , i + 1) | = B Prog( b, Γ i , g ) and (Γ , i + 1) | = B Prog( b, Γ i , h ) (by induction hypothesis) (Γ , i + 1) | = B Prog( b, Γ i , g ) ∧ Prog( b, Γ i , h ) (Γ , i + 1) | = B Prog( b, Γ i , f ) Induction case 2: f = g ∨ h . Analogous to case 1. Induction case 3: f = g . T rivial by insp ection of the defin itions. Induction case 4: f = g U h . Then f is logically equiv alent to h ∨ ( g ∧ ( g U h ) whic h by cases 1, 2 and 3 holds at stage i of Γ for b eha viour B iff Prog( b, Γ i , f ) h olds at stage i+1. Theorem 1 L et f b e r ewar d-normal, and let h f 0 , f 1 , . . . i b e the r esult of pr o gr essing it thr ough the suc c essive states of a se quenc e Γ . Then, pr ovide d no f i is ⊥ , for al l i Rew(Γ i , f i ) iff Γ( i ) ∈ B f . 67 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza Pro of: First, by the defin ition of reward-normalit y , if f is reward-normal then Γ | = B f iff for al l i , if Γ( i ) ∈ B f then Γ( i ) ∈ B . Next, if Γ | = B f then progressing f through Γ according to B (that is, letting eac h b i b e true iff Γ( i ) ∈ B ) cannot lead to a con tradiction b ecause b y Prop erty 1, progression is tru th-preserving. It remains, then, to show that if Γ 6| = B f then pr ogressing f th rough Γ according to B m ust lead ev en tually to ⊥ . The pro of of this is by in duction on the structur e of f and as usual the base case in whic h f is a literal (an atom, a negated atom or ⊤ , ⊥ or $) is trivial. Case f = g ∧ h . Sup p ose Γ 6| = B f . Then either Γ 6| = B g or Γ 6| = B h , so by the ind u ction h yp othesis either g or h progresses ev ent ually to ⊥ , and h en ce so d o es th eir conjun ction. Case f = g ∨ h . Supp ose Γ 6| = B f . Then b oth Γ 6| = B g and Γ 6| = B h , so by the induction h yp othesis eac h of g and h progresses ev en tually to ⊥ . Supp ose without loss of generalit y that g do es not p rogress to ⊥ b efore h do es. Then at some p oin t g has progressed to s ome form ula g ′ and f has progressed to g ′ ∨ ⊥ wh ic h simplifies to g ′ . Since g ′ also pr ogresses to ⊥ ev en tually , so do es f . Case f = g . Supp ose Γ 6| = B f . Let Γ = Γ 0 ; ∆ and let B ′ = { γ | Γ 0 ; γ ∈ B } . Then ∆ 6| = B ′ g , so by the induction hyp othesis g progressed through ∆ according to B ′ ev en tually reac hes ⊥ . But Th e progression of f through Γ according to B is exactly the same after the first step, so that to o leads to ⊥ . Case f = g U h . Su p p ose Γ 6| = B f . Then there is some j suc h that (Γ , j ) 6| = B g and for all i ≤ j , (Γ , i ) 6| = B h . W e pro ceed by ind u ction on j . In the base case j = 0, and b oth Γ 6| = B g and Γ 6| = B h whence b y th e main in d uction h yp othesis b oth g and h will ev ent ually pr ogress to ⊥ . Thus h ∨ ( g ∧ f ′ ) progresses ev entually to ⊥ for a ny f ′ , and in particular for f ′ = f , establishing the base case. F or the ind uction case, sup p ose Γ | = B g (and of course Γ 6| = B h ). Since f is equiv alen t to h ∨ ( g ∧ f ) a nd Γ 6| = B f , Γ 6| = B h and Γ | = B g , clearly Γ 6| = B f . Where ∆ and B ′ are as in the pr evious case, therefore, ∆ 6| = B ′ f and the failure o ccurs at stage j − 1 of ∆. Therefore the hyp othesis of the indu ction on j applies, and f progressed throu gh ∆ according to B ′ go es ev ent ually to ⊥ , and so f progressed through Γ according to B go es similarly to ⊥ . Theorem 3 L et S ′ b e the set of e-states in an e q uivalent MDP D ′ for D = h S, s 0 , A, Pr , R i . D ′ is minimal iff every e-state in S ′ is r e achable and S ′ c ontains no two distinct e-states s ′ 1 and s ′ 2 with τ ( s ′ 1 ) = τ ( s ′ 2 ) and µ ( s ′ 1 ) = µ ( s ′ 2 ) . Pro of: Pro of is b y construction of the canonical equiv alent MDP D c . Let the set of finite pr efixes of state sequ en ces in e D ( s 0 ) b e p artitioned in to equiv alence classes, wh ere Γ1( i ) ≡ Γ2( j ) iff Γ1 i = Γ2 j and for all ∆ ∈ S ∗ suc h that Γ1( i ); ∆ ∈ e D ( s 0 ), R (Γ1( i ); ∆) = R (Γ2( j ); ∆). Let [Γ( i )] denote the equiv alence class of Γ( i ). Let E b e the set of these equiv alence classes. Let A b e the fun ction that tak es eac h [Γ( i )] in E to A (Γ i ). F or eac h Γ( i ) and ∆( j ) and for eac h a ∈ A ([Γ( i )]), let T ([Γ( i )] , a, [∆( j )]) b e Pr(Γ i , a, s ) if [∆( j )] = [Γ( i ); h s i ]. Otherwise let T ([Γ( i )] , a, [∆( j )]) = 0. Let R ([Γ( i )]) b e R (Γ( i )). Then note th e follo w ing four facts: 1. Eac h of the functions A , T and R is well -defined. 2. D c = hE , [ h s 0 i ] , A , T , Ri is an equiv alent MDP for D with τ ([Γ( i )] ) = Γ i . 68 Decision-Theoretic Planning with non-Markovian Rew ards 3. F or an y equiv alent MDP D ′′ of D there is a mapping f r om a subset of the states of D ′′ on to E . 4. D ′ satisfies the condition that ev ery e-state in S ′ is reac hable and S ′ con tains no t w o distinct e-states s ′ 1 and s ′ 2 with τ ( s ′ 1 ) = τ ( s ′ 2 ) and µ ( s ′ 1 ) = µ ( s ′ 2 ) iff D c is isomorph ic to D ′ . What fact 1 ab o v e amount s to is that if Γ1( i ) ≡ Γ2 ( j ) then it d o es not m atter whic h of the t w o s equ ences is used to defin e A , T and R of their equiv alence class. In the cases of A and T this is simply th at Γ1 i = Γ2 j . In the case of R , it is the sp ecial case ∆ = h Γ1 i i of the equalit y of rewards ov er extensions. F act 2 is a m atter of chec king that the four conditions of Definition 1 hold. Of these, conditions 1 ( τ ([ s 0 ]) = s 0 ) and 2 ( A ([Γ( i )]) = A (Γ i )) hold trivially by the construction. Condition 4 sa ys that for an y feasible state sequence Γ ∈ e D ( s 0 ), w e h a v e R ([Γ( i )]) = R (Γ( i )) for all i . This also is giv en in the construction. C ondition 3 states: F or all s 1 , s 2 ∈ S , if there is a ∈ A ( s 1 ) such that Pr( s 1 , a, s 2 ) > 0, then for all Γ( i ) ∈ e D ( s 0 ) suc h that Γ i = s 1 , there exists a unique [∆( j )] ∈ E , ∆ j = s 2 , such that for all a ∈ A ([Γ( i )]), T ([Γ( i )] , a, [∆[ j ]]) = Pr( s 1 , a, s 2 ). Supp ose Pr( s 1 , α, s 2 ) > 0, Γ( i ) ∈ e D ( s 0 ) and Γ i = s 1 . Then the required ∆( j ) is Γ( i ); h s 2 i , and of course A ([Γ( i )]) = A (Γ i ), so the required condition reads: [Γ( i ); h s 2 i ] is the un ique elemen t X of E with τ ( X ) = s 2 suc h that for all a ∈ A (Γ i ), T ([Γ( i )] , a, X ) = Pr( s 1 , a, s 2 ). T o establish existence, we need that if a ∈ A (Γ i ) then T ([Γ( i )] , a, [Γ( i ); h s 2 i ]) = Pr(Γ i , a, s 2 ), whic h is immediate from the definition of T ab o v e. T o establish un iqueness, su pp ose that τ ( X ) = s 2 and T ([Γ( i )] , a, X ) = Pr( s 1 , a, s 2 ) for all actio ns a ∈ A (Γ i ). Since Pr( s 1 , α, s 2 ) > 0, the tr ans ition pr ob ability from [Γ( i )] to X is nonzero for some action, so by the defin ition of T , X can only b e [Γ( i ); h s 2 i ]. F act 3 is r eadily observe d. Let M b e an y equiv alen t MDP for D . F or an y states s 1 and s 2 of D , and any state X of M su c h that τ ( X ) = s 1 there is at most one state Y of M with τ ( Y ) = s 2 suc h that some action a ∈ A ( s 1 ) giv es a nonzero probabilit y of transition fr om X to Y . T his follo ws from the uniqueness part o f condition 3 of Definition 1 together with the fact that the transition function is a pr obabilit y distribu tion (sums to 1). Therefore for an y giv en finite state sequence Γ( i ) there is at most one state of M reac hed from the start state of M by f ollo win g Γ ( i ). Th erefore M induces an equiv alence relation ≈ M on S ∗ : Γ( i ) ≈ M ∆( j ) iff they lead to the same state of M (the sequences whic h are not feasible in M m a y all b e regarded as equiv alen t un der ≈ M ). Eac h reac hable state of M has asso ciated w ith it a nonempt y equiv alence class of finite sequ en ces of states of D . W orking through the definitions, we ma y observ e that ≈ M is a sub-relation of ≡ (if Γ( i ) ≈ M ∆( j ) then Γ( i ) ≡ ∆( j )). Hence the function that tak es the equiv alence class under ≈ M of eac h feasible sequence Γ( i ) to [Γ( i )] in duces a mapp ing h (an epimorphism in fact) from the reac hable sub set of states of M ont o E . T o establish F act 4, it must b e sh o wn that in the case of D ′ the mapping can b e rev ersed, or that eac h equiv alence class [Γ( i )] in D c corresp onds to exactly one elemen t of 69 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza D ′ . Sup p ose not (for cont radiction). Then th ere exist sequences Γ1( i ) and Γ2( j ) in e D ( s 0 ) suc h that Γ1( i ) ≡ Γ2( j ) but on follo wing th e t w o sequences from s ′ 0 w e arr iv e at t w o d ifferen t elemen ts s ′ 1 and s ′ 2 of D ′ with τ ( s ′ 1 ) = Γ1 i = Γ2 j = τ ( s ′ 2 ) but w ith µ ( s ′ 1 ) 6 = µ ( s ′ 2 ). Th erefore there exists a sequence ∆( k ) ∈ e D ( s ) such that R (Γ1( i − 1 ); ∆( k )) 6 = R (Γ2( j − 1); ∆( k )). But this con tradicts the condition for Γ1( i ) ≡ Γ2( j ). Theorem 3 follo ws immediately fr om f acts 1–4. Theorem 4 L e t D ′ b e the tr anslation of D as in Definition 5. D ′ is a blind minimal e quiv alent M DP for D . Pro of: Reac habilit y of all the e-state s is ob vious, as they are constructed only wh en reac hed. Eac h e-state is a pair h s, φ i w h ere s is a state of D and φ is a reward function sp ecification. In fact, s = τ ( h s, φ i ) an d φ determines a distribution of rew ards o v er all con tin uations of the sequences that r eac h h s, φ i . Th at is, f or all ∆ in S ∗ suc h that ∆ 0 = s , the rewa rd for ∆ is P ( f : r ) ∈ φ { r | ∆ ∈ B f } . If D ′ is n ot blind minimal, then there exist distinct e-states h s, φ i and h s, φ ′ i for wh ic h this sum is the same for all ∆. But th is mak es φ and φ ′ seman tically equiv alen t, con tradicting the s u pp osition that they are d istinct. App endix C. Random Problem Domains Random p roblem domains are pr o duced b y firs t creating a random action sp ecificat ion defining the domain dynamics. Some of the exp erimen ts we conducted 20 also in v olv ed pro ducing, in a second step, a r andom rewa rd sp ecification that h ad desired prop erties in relation to the generated dyn amics. The r andom generation of the domain dynamics tak es as parameters the num b er n of prop ositions in the d omain and the num b er of actions to b e pro duced, and starts by assigning some effects to eac h action s u c h that eac h p rop osition is affected b y exactly one action. F or example, if we ha ve 5 acti ons and 14 prop ositions, the first 4 acti ons ma y affect 3 p rop ositions e ac h, the 5th one only 2, and the affected prop ositions are al l differen t. Once eac h action has some initial effects, we con tin ue to add more effects one at a time, unt il a sufficien t pr op ortion of the state space is reac hable – see “prop ortion reac hable” parameter b elo w. Eac h add itional effect is generated by pic king up a random action and a rand om prop osition, and pro d ucing a r andom decision d iagram according to the “uncertain t y” and “structure” parameters b elow: The Uncertain t y parameter is th e pr obabilit y of a non zero/one v alue as a leaf no d e. An uncertain t y of 1 will result in all leaf no des havi ng rand om v alues f r om a u niform distribution. An uncertain t y of 0 w ill result in all leaf nodes ha ving v alues 0 or 1 with an equal probabilit y . The Structure (or influence) parameter is the probabilit y of a deci sion diagram con taining a p articular prop osition. S o an influ ence of 1 will result in all d ecision diagrams 20. None of those are included in this pap er, ho w ever. 70 Decision-Theoretic Planning with non-Markovian Rew ards including all p rop ositions (and very unlik ely to ha ve significan t structure), while 0 will result in decision diagrams that d o n ot dep end on the v alues of pr op ositions. The Prop ortion Reac hable parameter is a lo wer b ound on the p rop ortion of the en tire 2 n state space that is rea c hable from the start state. Th e algorithm adds b eha viour un til this lo w er b oun d is reac hed. A v alue of 1 will r esult in the algorithm run n ing until the actions are sufficient to allo w the en tire state sp ace to b e r eac h able. A rew ard s p ecification can b e pro d uced with r egard to the generated dynamics su c h that a s p ecified n umb er of the rewards are reac hable and a s p ecified num b er are unreac hable. First, a decision diagram is p ro duced to r epresen t whic h states are r eac h able and wh ic h are not, give n th e d omain dynamics. Next, a random path is tak en from the ro ot of this decision diagram to a tru e terminal if we are generating an attainable rew ard, or a false terminal if w e are pro ducing an unatta inable rewa rd. The prop ositions encoun tered on this path, b oth negated and not, form a conjunction that is th e rew ard f orm ula. T his pro cess is rep eated until the desired num b er of reac hable and un r eac h able rew ards are obtained. References A T&T Labs-Researc h (2000). Graphviz. Av ailable from http://w ww.resear ch.att.com/ sw/tools /graphviz / . Bacc hus, F., Boutilier, C., & Gro v e, A. (1996 ). Rew arding b ehavio rs. In Pr o c. Amer ic an National Confer enc e on Artificial Intel ligenc e (AAAI) , pp. 1160–1 167. Bacc hus, F., Boutilier, C ., & Grov e, A. (1997) . S tructured solution metho ds for non- Mark o vian decision pr o cesses. I n Pr o c. Americ an National Confer enc e on Artificial Intel ligenc e (A A AI) , pp . 112–117. Bacc hus, F., & K aban za, F. (1998). Plann in g for temp orally extended goals. Annals of Mathematics and Artificial Intel ligenc e , 22 , 5–27. Bacc hus, F., & Kabanza, F. (200 0). Using temp oral logic to express searc h con trol kno wledge for planning. Artificial Intel ligenc e , 116 (1-2 ). Baier, C., Gr¨ oßer, M., Leuck er, M., Bollig, B., & Ciesinski, F. (2004 ). Cont roller syn thesis for probabilistic systems (extended a bstract). In Pr o c. IFIP International Confer enc e on The or etic al Computer Scienc e (IFIP TCS) . Baral, C., & Zh ao, J. (2004). Goal s p ecification in presence of nondeterministic actions. In Pr o c . Eur op e an Confer enc e on Artificial Intel ligenc e (ECAI) , p p. 273–27 7. Barto, A., Bardtke, S., & Singh, S. (1995). Learning to act using real-time dynamic pro- gramming. Artificial Intel ligenc e , 72 , 81–13 8. Bonet, B., & Geffner, H. (2003). Lab eled R TDP: Im pro ving the con v ergence of r eal-time dynamic programming. In P r o c. International Confer enc e on Automate d Planning and Sche duling (ICAP S) , pp. 12–21 . 71 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza Bonet, B., & Geffner, H. (2005). mGPT: A pr obabilistic planner based on heuristic s earc h. Journal of Artificial Intel ligenc e R ese ar ch , 24 , 933–944. Boutilier, C., Dean, T., & Hanks, S. (1999). Decision-theoreti c planning: S tructural as- sumptions and computational leve rage. In Journal of Artificial Intel lige nc e R ese ar ch , V ol. 11, pp. 1–94. Boutilier, C ., Dearden, R., & Goldszmidt, M. (2000 ). S tochasti c dyn amic programming with factored representat ions. Artificial Intel ligenc e , 121 (1-2), 49–107. Calv anese, D., De Giacomo , G., & V ardi, M. (2002 ). Reasoning ab out actions and plan- ning in L TL action th eories. In Pr o c. International Confer enc e on the Principles of Know le dge R epr esentation and R e asoning (KR) , pp. 493–60 2. Cesta, A., Bahadori, S., G, C., Grisetti, G., Giuliani, M., Lo o c hi, L., L eone, G., Nardi, D., Oddi, A., P ecora, F., Rasconi, R., Sag gase, A., & S cop elliti , M. (2003). The Rob oCare pro ject. C ognitiv e systems for the care of the elderly . I n Pr o c. International Confer enc e on A ging, D i sability and Indep endenc e (ICADI) . Chomic ki, J. (1995). Efficient c hec king of temp oral integ rit y constrain ts using b ounded history enco ding. ACM T r ansactions on Datab ase Systems , 20 (2), 149–186. Dal Lago, U., Pistore, M., & T ra v erso, P . (2002). Plann ing with a language for extended goals. In Pr o c. Americ an National Confer enc e on Artificial Intel ligenc e (AAAI) , p p . 447–4 54. Dean, T., K aelbling, L., Kirman, J., & Nic holson, A. (1995) . Plannin g under time con- strain ts in sto c hastic domains. Artificial Intel ligenc e , 76 , 35–74 . Dean, T., & Kanaza w a, K. (1989). A mo d el for reasoning ab out p ersistance and causation. Computatio nal Intel ligenc e , 5 , 142–150. Drummond, M. (1989). Situated con trol rules. In Pr o c . International Confer enc e on the Principles of Know le dge R epr esentation and R e asoning (KR) , pp . 103–1 13. Emerson, E. A. (1990). T emp oral and mo dal logic. In Handb o ok of The or etic al Computer Scienc e , V ol. B, pp. 997–1072. Elsevier and MIT Press. F eng, Z., & Hansen, E. (2002) . S y mb olic LA O ∗ searc h for factored Mark o v decision pr o- cesses. In Pr o c . Americ an National Confer enc e on Artificial Intel ligenc e (A AAI) , pp. 455–4 60. F eng, Z., Hansen, E., & Zilb erstein, S. (2003). S ym b olic ge neralization for o n-line planning. In Pr o c. Confer enc e on Unc ertainty i n Artificial Intel ligenc e (UAI) , pp . 209–216. F ern, A., Y o on, S., & Giv an, R. (2004). L earnin g domain-sp ecific kno wledge from random w alks. In Pr o c. International Confer enc e on Automate d Planning and Sche duling (ICAPS) , pp. 191–19 8. F ourman, M. (2000). Prop ositional planning. In Pr o c. A IP S Workshop on Mo del-The or etic Appr o aches to P lanning , pp. 10–17. 72 Decision-Theoretic Planning with non-Markovian Rew ards Gretton, C., Price, D., & Thi ´ ebaux, S. (2 003a). Implementa tion and comparison o f soluti on metho ds for decision pro cesses with non-Marko vian rewards. In P r o c. Confer enc e on Unc ertainty in Artificial Intel ligenc e (UAI) , p p. 289–296. Gretton, C ., Pr ice, D., & Thi ´ eb au x, S . (2003b). NMRDPP: a system for decision-theoreti c planning with non-Mark o vian rew ards. I n Pr o c. ICAPS Workshop on Planning under Unc ertainty and Inc omplete Information , pp. 48–56. Hadda wy , P ., & Hanks, S. (1992) . Represen tations for d ecision-t heoretic planning: Utilit y functions and deadline goals. In P r o c. Internationa l Confer enc e on the Principles of Know le dge R epr esentation and R e asoning (KR) , pp. 71–82. Hansen, E., & Z ilb erstein, S . (200 1). LAO ∗ : A heuristic searc h algorithm that finds solutions with lo ops. Artificial Intel ligenc e , 129 , 35–62 . Ho ey , J ., St-Aubin , R., Hu, A., & Boutilie r, C . (1999). SPUDD: sto chasti c plannin g u s in g decision diagrams. In Pr o c. Confer enc e on U nc ertainty i n A rtificial Intel ligenc e (UA I) , pp. 279–288 . Hoffmann, J. (2002). Lo cal searc h top ology in plann ing b enc hmarks: A theoretical analysis. In Pr o c. Internationa l Confer enc e on AI Pla nning and Sche duling (AIPS) , pp . 92–100 . Hoffmann, J., & Neb el, B. (2001). The FF planning system: F ast p lan generation through heuristic searc h. Journal of Artificial Intel ligenc e R ese ar ch , 14 , 253–302. Ho w ard, R. (1960). D ynamic Pr o gr amming and Markov Pr o c esses . MIT Press, Cam bridge, MA. Kabanza, F., & Th i ´ ebaux, S. (2005). Searc h control in planning f or temp orally extended goals. In Pr o c. Internation al Confer enc e on Automate d Planning and Sche duling (ICAPS) , pp. 130–13 9. Karabaev, E., & S k vortso v a, O . (200 5). A Heuristic Searc h Algorithm for Solving First- Order MDPs. In Pr o c. Confer enc e on Unc ertainty in Artificial Intel ligenc e (UAI) , pp. 292–299 . Ko ehler, J., & Sc huster, K. (2000). Elev ator con trol as a planning problem. In Pr o c. Internationa l Confer e nc e on AI Planning and Sche duling (AIPS) , pp . 331–338. Korf, R. (1990). Real-time heu r istic searc h. Artificial Intel ligenc e , 42 , 189–211. Kushmerick, N., Hanks, S., & W eld, D. (1995). An algorithm for probabilistic plann ing. Art ificial Intel ligenc e , 76 , 239–2 86. Lic h tenstein, O., Pnueli, A., & Zuck, L. (1985). The glory of the p ast. In Pr o c. Confer enc e on L o g ics of Pr o gr ams , pp . 196–218 . LNCS, v olume 193. Onder, N., Whelan, G. C., & Li, L. (2006). Engineering a conformant probabilistic planner. Journal of Artificial Intel ligenc e R ese ar ch , 25 , 1–15. 73 Thi ´ eba ux, Gretton, Sl aney, Price & Kabanza Pistore, M., & T rav erso, P . (2001). Planning as mo del-c hec king for extended goals in non-deterministic domains. In P r o c. Internationa l Joint Confer enc e on Artificial In- tel ligenc e (IJCA I-01) , pp . 479–484. Slaney , J. (2005). S emi-p ositiv e L TL with an uninterpreted past op erator. L o gic Journal of the IGPL , 13 , 211–22 9. Slaney , J., & Thi´ ebaux, S. (2001). Bloc ks w orld revisited. A rtificial Intel ligenc e , 125 , 119–1 53. Somenzi, F. (2001). CUDD: CU Decision Diag ram P ac k age. Av ailable from ftp://vlsi.co lorado.edu/pub/. T eic h teil-K¨ onigsbuch, F., & F abiani, P . (2005). Symb olic heuristic p olicy iteration algo- rithms f or s tru ctured decision-theoretic exploration problems. In Pr o c. ICAPS work- shop on Planning under Unc e rtainty for Autono mous Systems . Thi ´ ebaux, S., Hertzb erg, J., S hoaff, W., & Sc hneider, M. (199 5). A sto c hastic mo del of actions and plans for anyt ime p lanning under un certain t y . International Journal of Intel ligent Systems , 10 (2), 155–183 . Thi ´ ebaux, S., K abanza, F., & Slaney , J . (2002 a). Anytime state-based solution met ho d s for decision pr o cesses with non-Marko vian rew ards. In Pr o c. Confer enc e on Unc ertainty in Artificial Intel ligenc e (U A I) , p p . 501–5 10. Thi ´ ebaux, S., Kabanza, F., & Slaney , J. (2002b). A mo del-c hec king app roac h to decision- theoretic planning with non-Mark o vian r ew ards. In Pr o c. ECAI Workshop on M o del- Che cking i n Artificial Intel ligenc e (MoChArt-02 ) , pp . 101–108. V ardi, M. (2003). Au tomated v erification = graph, logic, and automata. In Pr o c. Inter- national Joint Confer e nc e on Artificial Intel ligenc e (IJCAI) , pp. 603–6 06. Invit ed pap er. W olp er, P . (1987). On the relation of p rograms and computatio ns to mo d els of temp oral logic. In Pr o c. T emp or al L o gic in Sp e cific ation, LNCS 398 , pp . 75–123. Y ounes, H. L. S., & Littma n, M. (2004 ). PPDDL1.0: An extensio n to PDDL for expressing planning domains with pr obabilistic effects. T ec h. rep. CMU-CS-04-167, S c ho ol of Computer Science, Carnegie Mellon Universit y , Pittsburgh, Pennsylv ania. Y ounes, H. L. S., Littman, M., W eissmann , D., & Asm uth, J. (2005). T h e first probabilistic trac k of the Inte rnational Planning Comp etitio n. In Journal of Artificial Intel ligenc e R ese ar ch , V ol. 24, pp . 851–887. Y ounes, H., & Simmons, R. G. (2004). P olicy generation for con tin uous-time sto c hastic domains with concurrency . In Pr o c. International Confer enc e on Auto mate d Planning and Sche duling (ICAP S) , pp. 325–3 33. 74
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment