Decision-Theoretic Planning: Structural Assumptions and Computational Leverage
Planning under uncertainty is a central problem in the study of automated sequential decision making, and has been addressed by researchers in many different fields, including AI planning, decision analysis, operations research, control theory and ec…
Authors: C. Boutilier, T. Dean, S. Hanks
Journal of Articial In telligence Researc h 11 (1999) 1{94 Submitted 09/98; published 07/99 Decision-Theoretic Planning: Structural Assumptions and Computational Lev erage Craig Boutilier cebl y@cs.ubc.ca Dep artment of Computer Scienc e, University of British Columbia V anc ouver, BC, V6T 1Z4, Canada Thomas Dean tld@cs.br o wn.edu Dep artment of Computer Scienc e, Br own University Box 1910, Pr ovidenc e, RI, 02912, USA Stev e Hanks hanks@cs.w ashington.edu Dep artment of Computer Scienc e and Engine ering, University of Washington Se attle, W A, 98195, USA Abstract Planning under uncertain t y is a cen tral problem in the study of automated sequen tial decision making, and has b een addressed b y researc hers in man y dieren t elds, including AI planning, decision analysis, op erations researc h, con trol theory and economics. While the assumptions and p ersp ectiv es adopted in these areas often dier in substan tial w a ys, man y planning problems of in terest to researc hers in these elds can b e mo deled as Markov de cision pr o c esses (MDPs) and analyzed using the tec hniques of decision theory . This pap er presen ts an o v erview and syn thesis of MDP-related metho ds, sho wing ho w they pro vide a unifying framew ork for mo deling man y classes of planning problems studied in AI. It also describ es structural prop erties of MDPs that, when exhibited b y particu- lar classes of problems, can b e exploited in the construction of optimal or appro ximately optimal p olicies or plans. Planning problems commonly p ossess structure in the rew ard and v alue functions used to describ e p erformance criteria, in the functions used to describ e state transitions and observ ations, and in the relationships among features used to describ e states, actions, rew ards, and observ ations. Sp ecialized represen tations, and algorithms emplo ying these represen tations, can ac hiev e computational lev erage b y exploiting these v arious forms of structure. Certain AI tec hniques| in particular those based on the use of structured, in tensional represen tations|can b e view ed in this w a y . This pap er surv eys sev eral t yp es of represen tations for b oth classical and decision-theoretic planning problems, and planning algorithms that exploit these rep- resen tations in a n um b er of dieren t w a ys to ease the computational burden of constructing p olicies or plans. It fo cuses primarily on abstraction, aggregation and decomp osition tec h- niques based on AI-st yle represen tations. 1. In tro duction Planning using decision-theoretic notions to represen t domain uncertain t y and plan qualit y has recen tly dra wn considerable atten tion in articial in telligence (AI). 1 De cision-the or etic planning (DTP) is an attractiv e extension of the classical AI planning paradigm b ecause it allo ws one to mo del problems in whic h actions ha v e uncertain eects, the decision mak er has 1. See, for example, the recen t texts (Dean, Allen, & Aloimonos, 1995; Dean & W ellman, 1991; Russell & Norvig, 1995) and the researc h rep orted in (Hanks, Russell, & W ellman, 1994). c 1999 AI Access F oundation and Morgan Kaufmann Publishers. All righ ts reserv ed. Boutilier, Dean, & Hanks incomplete information ab out the w orld, where factors suc h as resource consumption lead to solutions of v arying qualit y , and where there ma y not b e an absolute or w ell-dened \goal" state. Roughly , the aim of DTP is to form courses of action (plans or p olicies) that ha v e high exp ected utilit y rather than plans that are guaran teed to ac hiev e certain goals. When AI planning is view ed as a particular approac h to solving sequen tial decision problems of this t yp e, the connections b et w een DTP and mo dels used in other elds of researc h|suc h as decision analysis, economics and op erations researc h (OR)|b ecome more apparen t. A t a conceptual lev el, most sequen tial decision problems can b e view ed as instances of Markov de cision pr o c esses (MDPs), and w e will use the MDP framew ork to mak e the connections explicit. Muc h recen t researc h on DTP has explicitly adopted the MDP framew ork as an under- lying mo del (Barto, Bradtk e, & Singh, 1995; Boutilier & Dearden, 1994; Boutilier, Dearden, & Goldszmidt, 1995; Dean, Kaelbling, Kirman, & Nic holson, 1993; Ko enig, 1991; Simmons & Ko enig, 1995; T ash & Russell, 1994), allo wing the adaptation of existing results and algo- rithms for solving MDPs (e.g., from the eld of OR) to b e applied to planning problems. In doing so, ho w ev er, this w ork has departed from the traditional denition of the \planning problem" in the AI planning comm unit y|one goal of this pap er is to mak e explicit the connection b et w een these t w o lines of w ork. Adopting the MDP framew ork as a mo del for p osing and solving planning problems has illuminated a n um b er of in teresting connections among tec hniques for solving decision problems, dra wing on w ork from AI planning, reasoning under uncertain t y , decision analysis and OR. One of the most in teresting insigh ts to emerge from this b o dy of w ork is that man y DTP problems exhibit considerable structure, and th us can b e solv ed using sp ecial-purp ose metho ds that recognize and exploit that structure. In particular, the use of feature-based represen tations to describ e problems, as is the t ypical practice in AI, often highligh ts the problem's sp ecial structure and allo ws it to b e exploited computationally with little eort. There are t w o general imp edimen ts to the more widespread acceptance of MDPs within AI as a general mo del of planning. The rst is the absence of explanations of the MDP mo del that mak e the connections to curren t planning researc h explicit, at either the conceptual or computational lev el. This ma y b e due in large part to the fact that MDPs ha v e b een dev elop ed and studied primarily in OR, where the dominan t concerns are, naturally , rather dieren t. One aim of this pap er is to mak e the connections clear: w e pro vide a brief description of MDPs as a conceptual mo del for planning that emphasizes the connection to AI planning, and explore the relationship b et w een MDP solution algorithms and AI planning algorithms. In particular, w e emphasize that most AI planning mo dels can b e view ed as sp ecial cases of MDPs, and that classical planning algorithms ha v e b een designed to exploit the problem c haracteristics asso ciated with these cases. The second imp edimen t is sk epticism among AI researc hers regarding the computational adequacy of MDPs as a planning mo del: can the tec hniques scale to solv e planning problems of reasonable size? One dicult y with solution tec hniques for MDPs is the tendency to rely on explicit, state-based problem form ulations. This can b e problematic in AI planning since state spaces gro w exp onen tially with the n um b er of problem features. State space size and dimensionalit y are of somewhat lesser concern in OR and decision analysis. In these elds, an op erations researc her or decision analyst will often hand-craft a mo del that ignores certain problem features deemed irrelev an t, or will dene other features that summarize a 2 Decision-Theoretic Planning: Str uctural Assumptions wide class of problem states. In AI, the emphasis is on the automatic solution of problems p osed b y users who lac k the exp ertise of a decision analyst. Th us, assuming a w ell-crafted, compact state space is often not appropriate. In this pap er w e sho w ho w sp ecialized represen tations and algorithms from AI planning and problem solving can b e used to design ecien t MDP solution tec hniques. In particular, AI planning metho ds assume a certain structur e in the state space, in the actions (or op erators), and in the sp ecication of a goal or other success criteria. Represen tations and algorithms ha v e b een designed that mak e the problem structure explicit and exploit that structure to solv e problems eectiv ely . W e demonstrate ho w this same pro cess of iden tifying structure, making it explicit, and exploiting it algorithmically can b e brough t to b ear in the solution of MDPs. This pap er has sev eral ob jectiv es. First, it pro vides an o v erview of DTP and MDPs suitable for readers familiar with traditional AI planning metho ds and mak es connections with this w ork. Second, it describ es the t yp es of structure that can b e exploited and ho w AI represen tations and metho ds facilitate computationally eectiv e planning with MDPs. As suc h, it is a suitable in tro duction to AI metho ds for those familiar with the classical presen tation of MDPs. Finally , it surv eys recen t w ork on the use of MDPs in AI and suggests directions for further researc h in this regard, and should therefore b e of in terest to researc hers in DTP . 1.1 General Problem Denition Roughly sp eaking, the class of problems w e consider are those in v olving systems whose dynamics can b e mo deled as sto chastic pr o c esses , where the actions of decision mak er, referred to here as the agent , can inuence the system's b eha vior. The system's curren t state and the c hoice of action join tly determine a probabilit y distribution o v er the system's p ossible next states. The agen t prefers to b e in certain system states (e.g., goal states) o v er others, and therefore m ust determine a course of action|also called a \plan" or \p olicy" in this pap er|that is lik ely to lead to these target states, p ossibly a v oiding undesirable states along the w a y . The agen t ma y not kno w the system's state exactly in making its decision on ho w to act, ho w ev er|it ma y ha v e to rely on incomplete and noisy sensors and b e forced to base its c hoice of action on a probabilistic estimate of the state. T o help illustrate the t yp es of problems in whic h w e are in terested, consider the follo wing example. Imagine that w e ha v e a rob ot agen t designed to help someone (the \user") in an oce en vironmen t (see Figure 1). There are three activities it migh t undertak e: pic king up the user's mail, getting coee, or tidying up the user's researc h lab. The rob ot can mo v e from lo cation to lo cation and p erform v arious actions that tend to ac hiev e certain target states (e.g., bringing coee to the user on demand, or main taining a minimal lev el of tidiness in the lab). W e migh t asso ciate a certain lev el of uncertain t y with the eects of the rob ot's actions (e.g., when it tries to mo v e to an adjacen t lo cation it migh t succeed 90% of the time and fail to mo v e at all the other 10% of the time). The rob ot migh t ha v e incomplete access to the true state of the system in that its sensors migh t supply it with incomplete information (it cannot tell whether mail is a v ailable for pic kup if it is not in the mail ro om) and incorrect 3 Boutilier, Dean, & Hanks Hallway My Office Coffee MailRoom Lab Figure 1: A decision-theoretic planning problem information (ev en when in the mail ro om its sensors o ccasionally fail to detect the presence of mail). Finally , the p erformance of the rob ot migh t b e measured in v arious w a ys: do its actions guaran tee that a goal will b e ac hiev ed? Do they maximize some ob jectiv e function dened o v er p ossible eects of its actions? Do they ac hiev e a goal state with sucien t probabil- it y while a v oiding \disastrous" states with near certain t y? The stipulation of optimal or acceptable b eha vior is an imp ortan t part of the problem sp ecication. The t yp es of problems that can b e captured using this general framew ork include clas- sical (goal-orien ted, deterministic, complete kno wledge) planning problems and extensions suc h as conditional and probabilistic planning problems, as w ell as other more general problem form ulations. The discussion to this p oin t has assumed an extensional represen tation of the system's states|one in whic h eac h state is explicitly named. In AI researc h, intensional represen- tations are more common. An in tensional represen tation is one in whic h states or sets of states are describ ed using sets of m ulti-v alued fe atur es . The c hoice of an appropriate set of features is an imp ortan t part of the problem design. These features migh t include the curren t lo cation of the rob ot, the presence or absence of mail, and so on. The p erformance metric is also t ypically expressed in tensionally . Figure 2 serv es as a reference for our exam- ple problem, whic h w e use throughout the pap er. It lists the basic features used to describ e the states of the system, the actions a v ailable to the rob ot and the exogenous ev en ts that migh t o ccur, together with an in tuitiv e description of the features, actions and ev en ts. The remainder of the pap er is organized as follo ws. In Section 2, w e presen t the MDP framew ork in the abstract, in tro ducing basic concepts and terminology and noting the rela- tionship b et w een this abstract mo del and the classical AI planning problem. Section 3 sur- v eys common solution tec hniques|algorithms based on dynamic programming for general MDP problems and searc h algorithms for planning problems|and p oin ts out the relation- ship b et w een problem assumptions and solution tec hniques. Section 4 turns from algorithms to represen tations, sho wing v arious w a ys in whic h the structured represen tations commonly used b y AI algorithms can b e used to represen t MDPs compactly as w ell. Section 5 surv eys 4 Decision-Theoretic Planning: Str uctural Assumptions F eatures Denoted Description Lo cation L o c ( M ), etc. Lo cation of rob ot. Fiv e p ossible lo cations: mailro om (M), coee ro om (C), user's oce (O), hallw a y (H), lab oratory (L) Tidiness T (0), etc. Degree of lab tidiness. Fiv e p ossible v alues: from 0 (messiest) to 4 (tidiest) Mail presen t M ; M Is there mail is user's mail b o x? T rue ( M ) or F alse ( M ) Rob ot has mail RHM ; RHM Do es the rob ot ha v e mail in its p ossession? Coee request CR ; CR Is there an outstanding (unfullled) request for coee b y the user? Rob ot has coee RHC ; RHC Do es the rob ot ha v e coee in its p ossession? Actions Denoted Description Mo v e clo c kwise Clk Mo v e to adjacen t lo cation (clo c kwise direction) Coun terclo c kwise CClk Mo v e to adjacen t lo cation (coun terclo c kwise direction) Tidy lab Tidy If the rob ot is in the lab, the degree of tidiness is increased b y 1 Pic kup mail PUM If the rob ot is in the mailro om and there is mail presen t, the rob ot tak es the mail ( RHM b ecomes true and M b ecomes false) Get coee GetC If the rob ot is in the coee ro om, it gets coee ( RHC b ecomes true) Deliv er mail DelM If the rob ot is in the oce and has mail, it hands the mail to the user ( RHM b ecomes false) Deliv er coee DelC If the rob ot is in the oce and has coee, it hands the coee to the user ( RHC and CR b oth b ecome false) Ev en ts Denoted Description Mail arriv al A rrM Mail arriv es causing M to b ecome true Request coee R e qC User issues coee request causing CR to b ecome true Un tidy the lab Mess The lab b ecomes messier (one degree less tidy) Figure 2: Elemen ts of the rob ot domain. some recen t w ork on abstraction, aggregation and problem decomp osition metho ds, and sho ws the connection to more traditional AI metho ds suc h as goal regression. This last section demonstrates that represen tational and computational metho ds from AI planning can b e used in the solution of general MDPs. Section 5 also p oin ts out additional w a ys in whic h this t yp e of computational lev erage migh t b e dev elop ed in the future. 2. Mark o v Decision Pro cesses: Basic Problem F orm ulation In this section w e in tro duce the MDP framew ork and mak e explicit the relationship b et w een this mo del and classical AI planning mo dels. W e are in terested in con trolling a sto chastic dynamic al system : a system that at an y p oin t in time can b e in one of a n um b er of distinct states , and in whic h the system's state c hanges o v er time in resp onse to events . An action is a particular kind of ev en t instigated b y an agen t in order to c hange the system's state. W e assume that the agen t has con trol o v er what actions are tak en and when, though the eects of taking an action migh t not b e p erfectly predictable. In con trast, exo genous events are not under the agen t's con trol, and their o ccurrence ma y b e only partially predictable. This abstract view of an agen t is consisten t b oth with the \AI" view where the agen t is an autonomous decision mak er and the \con trol" view where a p olicy is determined ahead of time, programmed in to a device, and executed without further delib eration. 5 Boutilier, Dean, & Hanks 2.1 States and State T ransitions W e dene a state to b e a description of the system at a particular p oin t in time. Ho w one denes states can v ary with particular applications, some notions b eing more natural than others. Ho w ev er, it is common to assume that the state captures all information relev an t to the agen t's decision-making pro cess. W e assume a nite state sp ac e S = f s 1 ; : : : ; s N g of p ossible system states. 2 In most cases the agen t will not ha v e complete information ab out the curren t state; this uncertain t y or incomplete information can b e captured using a probabilit y distribution o v er the states in S . A discrete-time sto c hastic dynamical system consists of a state space and probabilit y distributions go v erning p ossible state tr ansitions |ho w the next state of the system dep ends on past states. These distributions constitute a mo del of ho w the system ev olv es o v er time in resp onse to actions and exogenous ev en ts, reecting the fact that the eects of actions and ev en ts ma y not b e p erfectly predictable ev en if the prev ailing state is kno wn. Although w e are generally concerned with ho w the agen t cho oses an appropriate course of action, for the remainder of this section w e assume that the agen t's course of action is xed, concen trating on the problem of predicting the system's state after the o ccurrence of a predetermined sequence of actions. W e discuss the action selection problem in the next section. W e assume the system ev olv es in stages , where the o ccurrence of an ev en t marks the transition from one stage t to the next stage t + 1. Since ev en ts dene c hanges in stage, and since ev en ts often (but not necessarily) cause state transitions, w e often equate stage transitions with state transitions. Of course, it is p ossible for an ev en t to o ccur but lea v e the system in the same state. The system's progression through stages is roughly analogous to the passage of time. The t w o are iden tical if w e assume that some action (p ossibly a no-op) is tak en at eac h stage, and that ev ery action tak es unit time to complete. W e can th us sp eak lo osely as if stages corresp ond to units of time, and w e refer to T in terc hangeably as the set of all stages and the set of all time p oin ts. 3 W e can mo del uncertain t y b y regarding the system's state at some stage t as a random v ariable S t that tak es v alues from S . An assumption of \forw ard causalit y" requires that the v ariable S t do es not dep end dir e ctly on the v alue of future v ariable S k ( k > t ). Roughly , it requires that w e mo del our system suc h that the past history \directly" determines the distribution o v er curren t states, whereas kno wledge of future states can inuence the estimate of the curren t state only indirectly b y pro viding evidence on what the curren t state ma y ha v e b een so as to lead to these future states. Figure 3(a) sho ws a graphical p ersp ectiv e on a discrete-time, sto c hastic dynamical system. The no des are random v ariables denoting the state at a particular time, and the arcs indicate the direct probabilistic dep endence of states on previous states. T o describ e this system completely w e m ust also supply the conditional distributions Pr( S t j S 0 ; S 1 ; S t 1 ) for all times t . States should b e though t of as descriptions of the system b eing mo deled, so the ques- tion arises of ho w m uc h detail ab out the system is captured in a state description. More 2. Most of the discussion in this pap er also applies to cases where the state space is coun tably innite. See (Puterman, 1994) for a discussion of innite and con tin uous-state problems. 3. While w e do not deal with suc h topics here, there is a considerable literature in the OR comm unit y on con tin uous-time Mark o v decision pro cesses (Puterman, 1994). 6 Decision-Theoretic Planning: Str uctural Assumptions S (a) (b) (c) S 0 S 1 2 t-1 S t S S 0 S 1 S 2 S t-1 S t S t-1 S t Figure 3: A general sto c hastic pro cess (a), a Mark o v c hain (b), and a stationary Mark o v c hain (c). detail implies more information ab out the system, whic h in turn often translates in to b etter predictions of future b eha vior. Of course, more detail also implies a larger set S , whic h can increase the computational cost of decision making. It is commonly assumed that a state con tains enough information to predict the next state. In other w ords, an y information ab out the history of the system relev an t to predicting its future is captured explicitly in the state itself. F ormally , this assumption, the Markov assumption , sa ys that kno wledge of the presen t state renders information ab out the past irrelev an t to making predictions ab out the future: Pr ( S t +1 j S t ; S t 1 ; : : : ; S 0 ) = Pr( S t +1 j S t ) Mark o vian mo dels can b e represen ted graphically using a structure lik e that in Figure 3(b), reecting the fact that the presen t state is sucien t to predict future state ev olution. 4 Finally , it is common to assume that the eects of an ev en t dep end only on the prev ailing state, and not the stage or time at whic h the ev en t o ccurs. 5 If the distribution predicting the next state is the same regardless of stage, the mo del is said to b e stationary and can b e represen ted sc hematically using just t w o stages, as in Figure 3(c). In this case only a single conditional distribution is required. In this pap er w e generally restrict our atten tion to discrete-time, nite-state, sto c hastic dynamical systems with the Mark o v prop ert y , com- monly called Markov chains . F urthermore, most of our discussion is restricted to stationary c hains. T o complete the mo del w e m ust pro vide a probabilit y distribution o v er initial states, reecting the probabilit y of b eing in an y state at stage 0. This distribution can b e repre- 4. It is w orth men tioning that the Mark o v prop ert y applies to the particular mo del and not to the system itself. Indeed, an y non-Mark o vian mo del of a system (of nite order, i.e., whose dynamics dep end on at most the k previous states for some k ) can b e con v erted to an equiv alen t though larger Mark o v mo del. In con trol theory , this is called con v ersion to state form (Luen b erger, 1979). 5. Of course, this is also a statemen t ab out mo del detail, sa ying that the state carries enough information to mak e the stage irrelev an t to predicting transitions. 7 Boutilier, Dean, & Hanks 1 6 7 3 2 4 5 .7 .3 .5 .5 .8 .2 1.0 1.0 .1 .1 .9 .5 .4 Figure 4: A state-transition diagram. sen ted as a real-v alued (ro w) v ector of size N = j S j (one en try for eac h state). W e denote this v ector P 0 and use p 0 i to denote its i th en try , that is, the probabilit y of starting in state s i . W e can represen t a T -stage nonstationary Mark o v c hain with T tr ansition matric es , eac h of size N N , where matrix P t captures the transition probabilities go v erning the system as it mo v es from stage t to stage t + 1. Eac h matrix consists of probabilities p t ij , where p t ij = Pr( S t +1 = s j j S t = s i ). If the pro cess is stationary , the transition matrix is the same at all stages and one matrix (whose en tries are denoted p ij ) will suce. Giv en an initial distribution o v er states P 0 , the probabilit y distribution o v er states after n stages is Q 0 i = n P i . A stationary Mark o v pro cess can also b e represen ted using a state-tr ansition diagr am as in Figure 4. Here no des corresp ond to particular states and the stage is not represen ted explicitly . Arcs denote p ossible transitions (those with non-zero probabilit y) and are lab eled with the transition probabilities p ij = Pr ( S t +1 = s j j S t = s i ). The arc from no de i to no de j is lab eled with p ij if p ij > 0. 6 The size of suc h a diagram is at least O ( N ) and at most O ( N 2 ), dep ending on the n um b er of arcs. This is a useful represen tation when the transition graph is relativ ely sparse, for example, when most states ha v e immediate transitions to only few neigh b ors. Example 2.1 T o illustrate these notions, imagine that the rob ot in Figure 1 is executing the p olicy of mo ving coun terclo c kwise rep eatedly . W e restrict our atten tion to t w o v ariables, lo cation L o c and presence of mail M , giving a state space of size 10. W e supp ose that the rob ot alw a ys mo v es to the adjacen t lo cation with probabilit y 1 : 0. In addition, mail can arriv e at the mailro om with probabilit y 0 : 2 at an y time (inde- p enden t of the rob ot's lo cation), causing the v ariable M to b ecome true. Once M b ecomes true, the rob ot cannot mo v e to a state where M is false, since the action of mo ving do es not inuence the presence of mail. The state-transition diagram for this example is illustrated in Figure 5. The transition matrix is also sho wn. 2 The structure of a Mark o v c hain is o ccasionally of in terest to us in planning. A subset C S is close d if p ij = 0 for all i 2 C and j 62 C . It is a pr op er close d set if no prop er subset of C enjo ys this prop ert y . W e sometimes refer to prop er closed sets as r e curr ent classes of states. If a closed set consists of a single state, then that state is called an absorbing state. Once an agen t en ters a closed set or absorbing state, it remains there 6. It is imp ortan t to note that the no des here do not represen t random v ariables as in the earlier gures. 8 Decision-Theoretic Planning: Str uctural Assumptions LM LM OM OM HM HM CM CM MM MM 0.2 0.8 0.8 0.8 0.8 0.8 1.0 1.0 1.0 1.0 0.2 0.2 1.0 0.2 0.2 s3 s7 s8 s9 s2 s1 s6 s10 s4 s5 s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 s 9 s 10 s 1 0 : 0 0 : 8 0 : 0 0 : 0 0 : 0 0 : 0 0 : 2 0 : 0 0 : 0 0 : 0 s 2 0 : 0 0 : 0 0 : 8 0 : 0 0 : 0 0 : 0 0 : 0 0 : 2 0 : 0 0 : 0 s 3 0 : 0 0 : 0 0 : 0 0 : 8 0 : 0 0 : 0 0 : 0 0 : 0 0 : 2 0 : 0 s 4 0 : 0 0 : 0 0 : 0 0 : 0 0 : 8 0 : 0 0 : 0 0 : 0 0 : 0 0 : 2 s 5 0 : 8 0 : 0 0 : 0 0 : 0 0 : 0 0 : 2 0 : 0 0 : 0 0 : 0 0 : 0 s 6 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 1 : 0 0 : 0 0 : 0 0 : 0 s 7 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 1 : 0 0 : 0 0 : 0 s 8 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 1 : 0 0 : 0 s 9 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 1 : 0 s 10 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 1 : 0 0 : 0 0 : 0 0 : 0 0 : 0 Figure 5: The state-transition diagram and transition matrix for a mo ving rob ot. forev er with probabilit y 1. In the example ab o v e (Figure 5), the set of states where M holds forms a recurren t class. There are no absorbing states in the example, but should w e program the rob ot to sta y put whenev er it is in the state h M ; L o c ( O ) i , then this w ould b e an absorbing state in the altered c hain. Finally , w e sa y a state is tr ansient if it do es not b elong to a recurren t class. In Figure 5, eac h state where M holds is transien t|ev en tually (with probabilit y 1), the agen t lea v es the state and nev er returns, since there is no w a y to remo v e mail once it arriv es. 2.2 Actions Mark o v c hains can b e used to describ e the ev olution of a sto c hastic system, but they do not capture the fact that an agen t can cho ose to p erform actions that alter the state of the system. A k ey elemen t of MDPs is the set of actions a v ailable to the decision mak er. When an action is p erformed in a particular state, the state c hanges sto c hastically in resp onse to the action. W e assume that the agen t tak es some action at eac h stage of the pro cess, and then the system c hanges state accordingly . A t eac h stage t of the pro cess and eac h state s , the agen t has a v ailable a set of actions A t s . This is called the fe asible set for s at stage t . T o describ e the eects of a 2 A t s , w e m ust supply the state-transition distribution Pr( S t +1 j S t = s; A t = a ) for all actions a , states s , and stages t . Unlik e the case of a Mark o v c hain, the terms Pr( S t +1 j S t = s; A t = a ) are not true conditional distributions, but rather a family of distributions parameterized b y S t and A t , since the probabilit y of A t is not part of the mo del. W e retain this notation, ho w ev er, for its suggestiv e nature. W e often assume that the feasible set of actions is the same for all stages and states, in whic h case the set of actions is A = f a 1 ; : : : ; a K g and eac h can b e executed at an y time. This con trasts with the AI planning practice of assigning pr e c onditions to actions dening the states in whic h they can meaningfully b e executed. Our mo del tak es the view that an y action can b e executed (or \attempted") in an y state. If the action has no eect when executed in some state, or its execution leads to disastrous eects, this can b e noted in the action's transition matrix. Action preconditions are often a computational con v enience rather than a represen tational necessit y: they can mak e the planning pro cess more ecien t b y iden tifying states in whic h the planner should not ev en consider selecting that action. Preconditions can b e represen ted in MDPs b y relaxing the assumption that the set of 9 Boutilier, Dean, & Hanks s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 s 9 s 10 s 1 0 : 0 0 : 8 0 : 0 0 : 0 0 : 0 0 : 0 0 : 2 0 : 0 0 : 0 0 : 2 s 2 0 : 0 0 : 0 0 : 8 0 : 0 0 : 0 0 : 0 0 : 0 0 : 2 0 : 0 0 : 0 s 3 0 : 0 0 : 8 0 : 0 0 : 0 0 : 0 0 : 0 0 : 2 0 : 0 0 : 0 0 : 0 s 4 0 : 0 0 : 0 0 : 8 0 : 0 0 : 0 0 : 0 0 : 0 0 : 2 0 : 0 0 : 0 s 5 0 : 8 0 : 0 0 : 0 0 : 0 0 : 0 0 : 2 0 : 0 0 : 0 0 : 0 0 : 0 s 6 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 1 : 0 0 : 0 0 : 0 0 : 0 s 7 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 1 : 0 0 : 0 0 : 0 s 8 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 1 : 0 0 : 0 0 : 0 0 : 0 s 9 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 1 : 0 0 : 0 0 : 0 s 10 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 1 : 0 0 : 0 0 : 0 0 : 0 0 : 0 LM LM OM OM HM HM CM CM MM MM 0.8 0.8 0.8 1.0 1.0 1.0 1.0 1.0 0.8 0.8 0.2 0.2 0.2 0.2 0.2 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 Figure 6: The transition matrix for Clk and the induced transition diagram for a t w o-action p olicy . feasible actions is the same for all states. T o illustrate planning concepts b elo w, ho w ev er, w e sometimes assume actions do ha v e preconditions. W e again restrict our atten tion to stationary pro cesses, whic h in this case means that the eects of eac h action dep ends only on the state and not on the stage. Our transition matrices th us tak e the form p k ij = Pr( S t +1 = s j j S t = s i ; A t = a k ), capturing the probabilit y that the system mo v es to state s j when a k is executed in state s i . In stationary mo dels an action is fully describ ed b y a single N N transition matrix P k . It is imp ortan t to note that the transition matrix for an action includes not only the direct eects of executing the action but also the eects of an y exogenous ev en ts that migh t o ccur at the same stage. 7 Example 2.2 The example in Figure 5 can b e extended so the agen t has t w o a v ailable actions: mo ving clo c kwise and mo ving coun terclo c kwise. The transition matrix for CClk (with the assumption that mail arriv es with probabilit y 0 : 2) is sho wn in Figure 5. The matrix for Clk app ears on the left in Figure 6. Supp ose the agen t xes its b eha vior so that it mo v es clo c kwise in lo cations M and C and coun terclo c kwise in lo cations H , O and L (w e address b elo w ho w the agen t migh t come to kno w its lo cation so that it can actually implemen t this b eha vior). This denes the Mark o v c hain illustrated in the transition diagram on the righ t in Figure 6. 2 2.3 Exogenous Ev en ts Exo genous events are those ev en ts that sto c hastically cause state transitions, m uc h lik e actions, but b ey ond the con trol of the decision mak er. These migh t corresp ond to the ev olution of a natural pro cess or the action of another agen t. Notice that the eect of the action CClk in Figure 5 \com bines" the eects of the rob ot's action with that of the exogenous ev en t of mail arriv al: state-transition probabilities incorp orate b oth the motion of the rob ot (causing a c hange in lo cation) and the p ossible c hange in mail status due to mail arriv al. F or the purp oses of decision making, it is precisely this com bined eect 7. It is p ossible to assess the eects of actions and exogenous ev en ts separately , then com bine them in to a single transition matrix in certain cases (Boutilier & Puterman, 1995). W e discuss this later in this section. 10 Decision-Theoretic Planning: Str uctural Assumptions that is imp ortan t when predicting the distribution o v er p ossible states resulting when an action is tak en. W e call suc h mo dels of actions implicit-event mo dels , since the eects of the exogenous ev en t are folded in to the transition probabilities asso ciated with the action. Ho w ev er, it is often natural to view these transitions as comprised of these t w o separate ev en ts, eac h ha ving its o wn eect on the state. More generally , w e often think of transitions as determined b y the eects of the agen t's c hosen action and those of certain exo genous events b ey ond the agen t's con trol, eac h of whic h ma y o ccur with a certain probabilit y . When the eects of actions are decomp osed in this fashion, w e call the action mo del an explicit-event mo del . Sp ecifying a transition function for an action and zero or more exogenous ev en ts is not generally easy , for actions and ev en ts can in teract in complex w a ys. F or instance, consider sp ecifying the eect of action PUM (pic kup mail) at a state where no mail is presen t but there is the p ossibilit y of \sim ultaneous" mail arriv al (i.e., during the \same unit" of discrete time). If the ev en t A rrM o ccurs, do es the rob ot obtain the newly arriv ed mail, or do es the mail remain in the mailb o x? In tuitiv ely , this dep ends on whether the mail arriv ed b efore or after the pic kup w as completed (alb eit within the same time quan tum). The state transition in this case can b e view ed as the comp osition of t w o transitions where the precise description of the comp osition dep ends on the ordering of the agen t's action and the exogenous ev en t. If mail arriv es rst, the transition migh t b e s ! s 0 ! s 00 , where s 0 is a state where mail is w aiting and s 00 is a state where no mail is w aiting and the rob ot is holding mail; but if the pic kup action is completed rst, the transition w ould b e s ! s ! s 0 (i.e., PUM has no eect, then mail arriv es and remains in the b o x). The picture is more complicated if the actions and ev en ts can truly o ccur sim ultaneously o v er some in terv al|in this case the resulting transition need not b e a comp osition of the individual transitions. As an example, if the rob ot lifts the side of a table on whic h a glass of w ater is situated, the w ater will spill; similarly if an exogenous ev en t causes the other side to b e raised. But if the action and ev en t o ccur sim ultaneously , the result is qualitativ ely dieren t (the w ater is not spilled). Th us, the \in terlea ving" seman tics describ ed ab o v e is not alw a ys appropriate. Because of suc h complications, mo deling exogenous ev en ts and their com bination with actions or other ev en ts can b e approac hed in man y w a ys, dep ending on the mo deling as- sumptions one is willing to mak e. Generally , w e sp ecify three t yp es of information. First, w e pro vide transition probabilities for all actions and ev en ts under the assumption that these o c cur in isolation |these are standard transition matrices. The transition matrix in Figure 5 can b e decomp osed in to the t w o matrices sho wn in Figure 7, one for Clk and one for A rrM . 8 Second, for eac h exogenous ev en t, w e m ust sp ecify its pr ob ability of o c curr enc e . Since this can v ary with the state, w e generally require a v ector of length N indicating the probabilit y of o ccurrence at eac h state. The o ccurrence v ector for A rrM w ould b e [0 : 2 0 : 2 0 : 2 0 : 2 0 : 2 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0] 8. The fact that these individual matrices are deterministic is an artifact of the example. In general, the actions and ev en ts will eac h b e represen ted using gen uinely sto c hastic matrices. 11 Boutilier, Dean, & Hanks s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 s 9 s 10 s 1 0 : 0 1 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 s 2 0 : 0 0 : 0 1 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 s 3 0 : 0 0 : 0 0 : 0 1 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 s 4 0 : 0 0 : 0 0 : 0 0 : 0 1 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 s 5 1 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 s 6 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 1 : 0 0 : 0 0 : 0 0 : 0 s 7 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 1 : 0 0 : 0 0 : 0 s 8 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 1 : 0 0 : 0 s 9 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 1 : 0 s 10 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 1 : 0 0 : 0 0 : 0 0 : 0 0 : 0 Action Clk s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 s 9 s 10 s 1 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 1 : 0 0 : 0 0 : 0 0 : 0 0 : 0 s 2 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 1 : 0 0 : 0 0 : 0 0 : 0 s 3 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 1 : 0 0 : 0 0 : 0 s 4 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 1 : 0 0 : 0 s 5 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 1 : 0 s 6 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 1 : 0 0 : 0 0 : 0 0 : 0 0 : 0 s 7 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 1 : 0 0 : 0 0 : 0 0 : 0 s 8 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 1 : 0 0 : 0 0 : 0 s 9 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 1 : 0 0 : 0 s 10 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 0 : 0 1 : 0 Ev en t A rrM Figure 7: The transition matrices for an action and exogenous ev en t in an explicit-ev en t mo del. where w e assume, for illustration, that mail arriv es only when none is presen t. 9 The nal requiremen t is a c ombination function that describ es ho w to \comp ose" the transitions of an action with an y subset of ev en t transitions. As indicated ab o v e, this can b e v ery complex, sometimes almost unrelated to the individual action and ev en t transitions. Ho w ev er, under certain assumptions com bination functions can b e sp ecied reasonably concisely . One w a y of mo deling the comp osition of transitions is to assume an in terlea ving seman- tics of the t yp e alluded to ab o v e. In this case, one needs to sp ecify the probabilit y that the action and ev en ts that tak e place o ccur in a sp ecic order. F or instance, one migh t assume that eac h ev en t o ccurs at a time|within the discrete time unit|according to some con tin uous distribution (e.g., an exp onen tial distribution with a giv en rate). With this in- formation, the probabilit y of an y particular ordering of transitions, giv en that certain ev en ts o ccur, can b e computed, as can the resulting distribution o v er p ossible next states. In the example ab o v e, the probabilit y of (comp osed) transitions s 1 ! s 2 ! s 3 and s 1 ! s 1 ! s 2 w ould b e giv en b y the probabilities with whic h mail arriv ed rst or last, resp ectiv ely . In certain cases, the probabilit y of this ordering is not needed. T o illustrate another com bination function, assume that the action alw a ys o ccurs b efore the exogenous ev en ts. F urthermore, assume that ev en ts are c ommutative : (a) for an y initial state s and an y pair of ev en ts e 1 and e 2 , the distribution that results from applying ev en t sequence e 1 e 2 to s is iden tical to that obtained from the sequence e 2 e 1 ; and (b) the o ccurrence probabilities at in termediate states are iden tical. In tuitiv ely , the set of ev en ts in our domain, A rrM , R eq C and Mess , has this prop ert y . Under these conditions the com bined transition distribution for an y action a is computed b y considering the probabilit y of an y subset of ev en ts and applying that subset in an y order to the distribution asso ciated with a . Generally , w e can construct an implicit-ev en t mo del from the v arious comp onen ts of the explicit-ev en t mo del; th us, the \natural" sp ecication can b e con v erted to the form usually used b y MDP solution algorithms. Under the t w o assumptions ab o v e, for instance, w e can form an implicit ev en t transition matrix Pr( s i ; a; s j ) for an y action a , giv en the matrix c Pr a ( s i ; s j ) for a (whic h assumes no ev en t o ccurrences), the matrices Pr e ( s i ; s j ) for ev en ts e , and the o ccurrence v ector Pr e ( s i ) for eac h ev en t e . The ee ctive tr ansition matrix for 9. The probabilit y of dieren t ev en ts ma y b e correlated (p ossibly at particular states). If this is the case, then it is necessary to sp ecify o ccurrence probabilities for subsets of ev en ts. W e will treat ev en t o ccurrence probabilities as indep enden t for ease of exp osition. 12 Decision-Theoretic Planning: Str uctural Assumptions ev en t e is dened as follo ws: c Pr e ( s i ; s j ) = Pr e ( s i )Pr e ( s i ; s j ) + ( 1 Pr e ( s i ) : i = j 0 : i 6 = j This equation captures the ev en t transition probabilities with the probabilit y of ev en t o c- currence factored in. If w e let E ; E 0 denote the diagonal matrices with en tries E k k = Pr e ( s k ) and E 0 k k = 1 Pr e ( s k ), then c Pr e ( s i ; s j ) = E Pr e + E 0 . Under the assumptions ab o v e, the implicit-ev en t matrix Pr ( s i ; a; s j ) for action a is then giv en b y Pr = c Pr e 1 c Pr e n Pr a for an y ordering of the n p ossible ev en ts. Naturally , dieren t pro cedures for constructing implicit-ev en t matrices will b e required giv en dieren t assumptions ab out action and ev en t in teraction. Whether suc h implicit mo d- els are constructed or sp ecied directly without explicit men tion of the exogenous ev en ts, w e will alw a ys assume unless stated otherwise that action transition matrices tak e in to accoun t the eects of exogenous ev en ts as w ell, and th us represen t the agen t's b est information ab out what will happ en if it tak es a particular action. 2.4 Observ ations Although the ee cts of an action can dep end on an y asp ect of the prev ailing state, the choic e of action can dep end only on what the agen t can observe ab out the curren t state and r ememb er ab out its prior observ ations. W e mo del the agen t's observ ational or sensing capabilities b y in tro ducing a nite set of observations O = f o 1 ; : : : ; o H g . The agen t receiv es an observ ation from this set at eac h stage prior to c ho osing its action at that stage. W e can mo del this observ ation as a random v ariable O t whose v alue is tak en from O . The probabilit y that a particular O t is generated can dep end on: the state of the system at t 1 the action tak en at t 1 the state of the system at t after taking the action at t 1 and after the eects of an y exogenous ev en ts at t 1 are realized, but b efore the action at t is tak en. W e let Pr ( O t = o h j S t 1 = s i ; A t 1 = a k ; S t = s j ) b e the probabilit y that the agen t observ es o h at stage t giv en that it p erforms a k in state s i and ends up in state s j . As with actions, w e assume that observ ational distributions are stationary (indep enden t of the stage), using p h i;j;k = Pr( o h j s i ; a k ; s j ) to denote this quan tit y . W e can view the probabilistic dep endencies among state, action and observ ation v ariables as a graph in whic h the time-indexed v ariables are sho wn as no des and one v ariable is directly probabilistically dep enden t on another if there is an edge from the latter to the former; see Figure 8. This mo del allo ws a wide v ariet y of assumptions ab out the agen t's sensing capabilities. A t one extreme are ful ly observable MDPs (F OMDPs), in whic h the agen t kno ws exactly what state it is in at eac h stage t . W e mo del this case b y letting O = S and setting Pr ( o h j s i ; a k ; s j ) = ( 1 i o h = s j 0 otherwise 13 Boutilier, Dean, & Hanks S t-1 S t t-1 t A O Figure 8: Graph sho wing the dep endency relationships among states, actions and observ a- tions at dieren t times. In the example ab o v e, this means the rob ot alw a ys kno ws its exact lo cation and whether or not mail is w aiting in the mailb o x, ev en if it is not in the mailro om when the mail arriv es. The agen t th us receiv es p erfect feedbac k ab out the results of its actions and the eects of exogenous ev en ts|it has noisy eectors but complete, noise-free, and \instan taneous" sensors. Most recen t AI researc h that adopts the MDP framew ork explicitly assumes full observ abilit y . A t the other extreme w e migh t consider non-observable systems (NOMDPs) in whic h the agen t receiv es no information ab out the system's state during execution. W e can mo del this case b y letting O = f o g . Here the same observ ation is rep orted at eac h stage, rev ealing no information ab out the state, so that Pr( s j j s i ; a k ; o ) = Pr( s j j s i ; a k ). In these op en-lo op systems , the agen t receiv es no useful feedbac k ab out the results of its actions: the agen t has noisy eectors and no sensors. In this case an agen t c ho oses its actions according to a plan consisting of a sequence of actions executed unconditionally . In eect, the agen t is relying on its predictiv e mo del to determine go o d action c hoices b efore execution time. T raditionally , AI planning w ork has implicitly made the assumption of non-observ abilit y , often coupled with an omniscienc e assumption |that the agen t kno ws the initial state with certain t y , can predict the eects of its actions p erfectly , and can precisely predict the o c- currence of an y exogenous ev en ts and their eects. Under these circumstances, the agen t can predict the exact outcome of an y plan, th us ob viating the need for observ ation. Suc h an agen t can build a straigh t-line plan|a sequence of actions to b e p erformed without feedbac k|that is as go o d as an y plan whose execution migh t dep end on information gath- ered at execution time. These t w o extremes are sp ecial cases of the general observ ation mo del describ ed ab o v e, whic h allo ws the agen t to receiv e incomplete or noisy information ab out the system state (i.e., p artial ly observable MDPs, or POMDPs). F or example, the rob ot migh t b e able to determine its lo cation exactly , but migh t not b e able to determine whether mail arriv es unless it is in the mailro om. F urthermore, its \mail" sensors migh t o ccasionally rep ort inaccurately , leading to an incorrect b elief as to whether there is mail w aiting. Example 2.3 Supp ose the rob ot has a \c hec kmail" action that do es not c hange the system state but generates an observ ation that is inuenced b y the presence of mail, pro vided 14 Decision-Theoretic Planning: Str uctural Assumptions Pr ( Obs = mail ) Pr( Obs = nomail ) L o c ( M ) ; M 0 : 92 0 : 08 L o c ( M ) ; M 0 : 05 0 : 95 L o c ( M ) ; M 0 : 00 1 : 00 L o c ( M ) ; M 0 : 00 1 : 00 Figure 9: Observ ation probabilities for c hec king mailb o x. the rob ot is in the mailro om at the time the action is p erformed. If the rob ot is not in the mailro om, the sensor alw a ys rep orts \no mail." A noisy \c hec kmail" sensor can b e describ ed b y a probabilit y distribution lik e the one sho wn in Figure 9. W e can view these error probabilities as the probabilit y of \false p ositiv es" (0 : 05) and \false negativ es" (0 : 08). 2 2.5 System T ra jectories and Observ able Histories W e use the terms tr aje ctory and history in terc hangeably to describ e the system's b eha vior during the course of a problem-solving episo de, or p erhaps some initial segmen t thereof. The c omplete system history is the sequence of states, actions, and observ ations generated from stage 0 to some time p oin t of in terest, and can b e of nite or innite length. Complete histories can b e represen ted b y a (p ossibly innite) sequence of tuples of the form hh S 0 ; O 0 ; A 0 i ; h S 1 ; O 1 ; A 1 i ; : : : h S T ; O T ; A T i i W e can dene t w o alternativ e notions of history that con tain less complete information. F or some arbitrary stage t w e dene the observable history as the sequence hh O 0 ; A 0 i ; : : : ; h O t 1 ; A t 1 ii where O 0 is the observ ation of the initial state. The observ able history at stage t comprises all information a v ailable to the agen t ab out its history when it c ho oses its action at stage t . A third t yp e of tra jectory is the system tr aje ctory , whic h is the sequence hh S 0 ; A 0 i ; : : : ; h S t 1 ; A t 1 i ; S t i describing the system's b eha vior in \ob jectiv e" terms, indep enden t of the agen t's particular view of the system. In ev aluating an agen t's p erformance, w e will generally b e in terested in the system tra jectory . An agen t's p olicy m ust b e dened in terms of the observ able history , since the agen t do es not ha v e access to the system tra jectory , except in the fully observ able case, when the t w o are equiv alen t. 2.6 Rew ard and V alue The problem facing the decision mak er is to select an action to b e p erformed at eac h stage of the decision problem, making this decision on the basis of the observ able history . The agen t still needs some w a y to judge the quality of a course of action. This is done b y dening 15 Boutilier, Dean, & Hanks S t-1 S t t C t-1 A t R Figure 10: Decision pro cess with rew ards and action costs. a value function V( ) as a function mapping the set of system histories H S in to the reals; that is, V : H S ! R . 10 The agen t pr efers system history h to h 0 just in case V( h ) > V( h 0 ). Th us, the agen t judges its b eha vior to b e go o d or bad dep ending on its eect on the underlying system tra jectory . Generally , the agen t cannot predict with certain t y whic h system tra jectory will o ccur, and can at b est generate a probabilit y distribution o v er the p ossible tra jectories caused b y its actions. In that case, it computes the exp e cte d value of eac h candidate course of action and c ho oses a p olicy that maximizes that quan tit y . Just as with system dynamics, sp ecifying a v alue function o v er arbitrary tra jectories can b e cum b ersome and unin tuitiv e. It is therefore imp ortan t to iden tify structure in the v alue function that can lead to a more parsimonious represen tation. Tw o assumptions ab out v alue functions commonly made in the MDP literature are time-sep ar ability and additivity . A time-separable v alue function is dened in terms of more primitiv e functions that can b e applied to comp onen t states and actions. The r ewar d function R : S ! R asso ciates a rew ard with b eing in a state s . Costs can b e assigned to taking actions b y dening a c ost function C : S A ! R that asso ciates a cost with p erforming an action a in state s . Rew ards are added to the v alue function, while costs are subtracted. 11 A v alue function is time-sep ar able if it is a \simple com bination" of the rew ards and costs accrued at eac h stage. \Simple com bination" means that v alue is tak en to b e a function of costs and rew ards at eac h stage, where the costs and rew ards can dep end on the stage t , but the function that com bines these m ust b e indep endent of the stage, most commonly a linear com bination or a pro duct. 12 A v alue function is additive if the com bination function is a sum of the rew ard and cost function v alues accrued o v er the history's stages. The addition of rew ards and action costs in a system with time-separable v alue can b e view ed graphically as sho wn in Figure 10. The assumption of time-separabilit y is restrictiv e. In our example, there migh t b e certain goals in v olving temp oral deadlines (ha v e the w orkplace tidy as so on as p ossible after 9:00 tomorro w morning) and main tenance (do not allo w mail to sit in the mailro om 10. T ec hnically , the set of histories of in terest also dep ends on the horizon c hosen, as describ ed b elo w. 11. The term \rew ard" is somewhat of a misnomer in that the rew ard could b e negativ e, in whic h case \p enalt y" migh t b e a b etter w ord. Lik ewise, \costs" can b e either p ositiv e (punitiv e) or negativ e (b en- ecial). Th us, they admit great exibilit y in dening v alue functions. 12. See (Luen b erger, 1973) for a more precise denition of time-separabilit y . 16 Decision-Theoretic Planning: Str uctural Assumptions undeliv ered for more than 10 min utes) that require v alue functions that are non-separable giv en our curren t represen tation of the state. Note, ho w ev er, that separabilit y|lik e the Mark o v prop ert y|is a prop ert y of a particular represen tation. W e could add additional information to the state in our example: the clo c k time, the in terv al of time b et w een 9:00 and the time at whic h tidiness is ac hiev ed, the length of time mail sits in the mail ro om b efore the rob ot pic ks it up, and so on. With this additional information w e could re- establish a time-separable v alue function, but at the exp ense of an increase in the n um b er of states and a more ad ho c and cum b ersome action represen tation. 13 2.7 Horizons and Success Criteria In order to ev aluate a particular course of action, w e need to sp ecify ho w long (in ho w man y stages) it will b e executed. This is kno wn as the problem's horizon . In nite-horizon pr oblems , the agen t's p erformance is ev aluated o v er a xed, nite n um b er of stages T . Commonly , our aim is to maximize the total exp ected rew ard asso ciated with a course of action; w e therefore dene the (nite-horizon) v alue of an y length T history h as (Bellman, 1957): V ( h ) = T 1 X t =0 f R ( s t ) C ( s t ; a t ) g + R ( s T ) An innite-horizon pr oblem , on the other hand, requires that the agen t's p erformance b e ev aluated o v er an innite tra jectory . In this case the total rew ard ma y b e un b ounded, meaning that an y p olicy could b e arbitrarily go o d or bad if it w ere executed for long enough. In this case it ma y b e necessary to adopt a dieren t means of ev aluating a tra jectory . The most common is to in tro duce a disc ount factor , ensuring that rew ards or costs accrued at later stages are coun ted less than those accrued at earlier stages. The v alue function for an exp e cte d total disc ounte d r ewar d problem is dened as follo ws (Bellman, 1957; Ho w ard, 1960): V ( h ) = 1 X t =0 t ( R ( s t ) C ( s t ; a t )) where is a xed disc ount r ate (0 < 1). This form ulation is a particularly simple and elegan t w a y to ensure a b ounded measure of v alue o v er an innite horizon, though it is imp ortan t to v erify that discoun ting is in fact appropriate. Economic justications are often pro vided for discoun ted mo dels|a rew ard earned so oner is w orth more than one earned later pro vided the rew ard can someho w b e in v ested. Discoun ting can also b e suitable for mo deling a pro cess that terminates with probabilit y 1 at at an y p oin t in time (e.g., a rob ot that can break do wn), in whic h case discoun ted mo dels corresp ond to exp ected total rew ard o v er a nite but uncertain horizon. F or these reasons, discoun ting is sometimes used for nite-horizon problems as w ell. Another tec hnique for dealing with innite-horizon problems is to ev aluate a tra jectory based on the a v erage rew ard accrued p er stage, or gain . The gain of a history is dened as g ( h ) = lim n !1 1 n n X t =0 f R ( s t ) C ( s t ; a t ) g 13. See (Bacc h us, Boutilier, & Gro v e, 1996, 1997), ho w ev er, for a systematic approac h to handling certain t yp es of history-dep enden t rew ard functions. 17 Boutilier, Dean, & Hanks Renemen ts of this criterion ha v e also b een prop osed (Puterman, 1994). Sometimes the problem itself ensures that total rew ard o v er an y innite tra jectory is b ounded, and th us the exp ected total rew ard criterion is w ell-dened. Consider the case common in AI planners in whic h the agen t's task is to bring the system to a goal state. A p ositiv e rew ard is receiv ed only when the goal is reac hed, all actions incur a non-negativ e cost, and when a goal is reac hed the system en ters an absorbing state in whic h no further rew ards or costs are accrued. As long as the goal can b e reac hed with certain t y , this situation can b e form ulated as an innite-horizon problem where total rew ard is b ounded for an y desired tra jectory (Bertsek as, 1987; Puterman, 1994). In general, suc h problems cannot b e form ulated as (xed) nite-horizon problems unless an a priori b ound on the n um b er of steps needed to reac h the goal can b e established. These problems are sometimes called indenite-horizon problems: from a practical p oin t of view, the agen t will con tin ue to execute actions for some nite n um b er of stages, but the exact n um b er cannot b e determined ahead of time. 2.8 Solution Criteria T o complete our denition of the planning problem w e need to sp ecify what constitutes a solution to the problem. Here again w e see a split b et w een explicit MDP form ulations and w ork in the AI planning comm unit y . Classical MDP problems are generally stated as optimization pr oblems : giv en a v alue function, a horizon, and an ev aluation metric (e.g., exp ected total rew ard, exp ected total discoun ted rew ard, exp ected a v erage rew ard p er stage) the agen t seeks a b eha vioral p olicy that maximizes the ob jectiv e function. W ork in AI often seeks satiscing solutions to suc h problems. In the planning literature, it is generally tak en that an y plan that satises the goal is equally preferred to an y other plan that satises the goal, and that an y plan that satises the goal is preferable to an y plan that do es not. 14 In a probabilistic framew ork, w e migh t seek the plan that satises the goal with maxim um probabilit y (an optimization), but this can lead to situations in whic h the optimal plan has innite length if the system state is not fully observ able. The satiscing alternativ e (Kushmeric k, Hanks, & W eld, 1995) is to seek any plan that satises the goal with a probabilit y exceeding a giv en threshold. Example 2.4 W e extend our running example to demonstrate an innite-horizon, fully observ able, discoun ted rew ard situation. W e b egin b y adding one new dimension to the state description, the b o olean v ariable RHM (do es the rob ot ha v e mail), giving us a system with 20 states. W e also pro vide the agen t with t w o additional actions: PUM (pic kup mail) and DelM (deliv er mail) as describ ed in Figure 2. W e can no w rew ard the agen t in suc h a w a y that mail deliv ery is encouraged: w e asso ciate a rew ard of 10 with eac h state in whic h RHM and M are b oth false and 0 with all other states. If actions ha v e no cost, the agen t gets a total rew ard of 20 for this six-stage system tra jectory: h L o c ( M ) ; M ; RHM i ; Stay ; h L o c ( M ) ; M ; RHM i ; PUM ; h L o c ( M ) ; M ; RHM i ; Clk ; h L o c ( H ) ; M ; RHM i ; Clk ; h L o c ( O ) ; M ; RHM i ; DelM ; h L o c ( O ) ; M ; RHM i 14. Though see (Hadda wy & Hanks, 1998; Williamson & Hanks, 1994) for a restatemen t of planning as an optimization problem. 18 Decision-Theoretic Planning: Str uctural Assumptions If w e assign an action cost of 1 for eac h action except Stay (whic h has 0 cost), the total rew ard b ecomes 16. If w e use a discoun t rate of 0 : 9 to discoun t future rew ards and costs, this initial segmen t of an innite-horizon history w ould con tribute 10 + : 9( 1) + : 81( 1) + : 729( 1) + : 6561( 1) + : 59054( 1 + 10) = 12 : 2 to the total v alue of the tra jectory (as subsequen tly extended). F urthermore, w e can establish a b ound on the total exp ected v alue of this tra jectory . In the b est case, all subsequen t stages will yield a rew ard of 10, so the exp ected total discoun ted rew ard is b ounded b y 12 : 2 + : 9 6 (10) + : 9 7 (10) + : : : = 12 : 2 + 10 : 9 6 1 X i =0 0 : 9 i < 66 A similar eect on b eha vior can b e ac hiev ed b y p enalizing states (i.e., ha ving negativ e rew ards) in whic h either M or RHM is true. 2 2.9 P olicies W e ha v e men tioned p olicies (or courses of action, or plans) informally to this p oin t, and no w pro vide a precise denition. The decision problem facing an agen t can b e view ed most generally as deciding whic h action to p erform giv en the curren t observ able history . W e dene a p olicy to b e a mapping from the set of observ able histories H O to actions, that is, : H O ! A . In tuitiv ely , the agen t executes action a t = ( hh o 0 ; a 0 i ; : : : ; h o t 1 ; a t 1 i ; o t i ) at stage t if it has p erformed the actions a 0 ; a t 1 and made observ ations o 0 ; o t 1 at earlier stages, and has just made observ ation o t at the curren t stage. A p olicy induces a distribution Pr ( h j ) o v er the set of system histories H S ; this proba- bilit y distribution dep ends on the initial distribution P 0 . W e dene the exp e cte d value of a p olicy to b e: EV ( ) = X h 2H S V ( h ) Pr( h j ) W e w ould lik e the agen t to adopt a p olicy that either maximizes this exp ected v alue or, in a satiscing con text, has an acceptably high exp ected v alue. The general form of a p olicy , dep ending as it do es on an arbitrary observ ation history , can lead to v ery complicated p olicies and p olicy-construction algorithms. In sp ecial cases, ho w ev er, assumptions ab out observ abilit y and the structure of the v alue function can result in optimal p olicies that ha v e a m uc h simpler form. In the case of a fully observ able MDP with a time-separable v alue function, the optimal action at an y stage can b e computed using only information ab out the curr ent state and the stage: that is, w e can restrict p olicies to ha v e the simpler form : S T ! A without danger of acting sub optimally . This is due to the fact that full observ abilit y allo ws the state to b e observ ed completely , and the Mark o v assumption renders prior history irrelev an t. In the non-observ able case, the observ ational history con tains only v acuous observ ations and the agen t m ust c ho ose its actions using only kno wledge of its previous actions and the stage; ho w ev er, since incorp orates previous actions, it tak es the form : T ! A . This 19 Boutilier, Dean, & Hanks form of p olicy corresp onds to a linear, unconditional sequence of actions h a 1 ; a 2 ; : : : ; a T i , or a str aight-line plan in AI nomenclature. 15 2.10 Mo del Summary: Assumptions, Problems, and Computational Complexit y This concludes our exp osition of the MDP mo del for planning under uncertain t y . Its gen- eralit y allo ws us to capture a wide v ariet y of the problem classes that are curren tly b eing studied in the literature. In this section w e review the basic comp onen ts of the mo del, describ e problems commonly studied in the DTP literature with resp ect to this mo del, and summarize kno wn complexit y results for eac h. In Section 3, w e describ e some of the sp e- cialized computational tec hniques used to solv e problems in eac h of these problem classes. 2.10.1 Model Summar y and Assumptions The MDP mo del consists of the follo wing comp onen ts: The state space S , a nite or coun table set of states. W e generally mak e the Mark o v assumption, whic h requires that eac h state con v ey all information necessary to predict the eects of all actions and ev en ts indep enden t of an y further information ab out system history . The set of actions A . Eac h action a k is represen ted b y a transition matrix of size j S j j S j represen ting the probabilit y p k ij that p erforming action a k in state s i will mo v e the system in to state s j . W e assume throughout that the action mo del is stationary , meaning that transition probabilities do not v ary with time. The transition matrix for an action is generally assumed to accoun t for an y exogenous ev en ts that migh t o ccur at the stage at whic h the action is executed. The set of observ ation v ariables O . This is the set of \messages" sen t to the agen t after an action is p erformed, that pro vide execution-time information ab out the curren t system state. With eac h action a k and pair of states s i , s j , suc h that p k ij > 0, w e asso ciate a distribution o v er p ossible observ ations: p k m ij denotes the probabilit y of obtaining observ ation o m giv en that action a k w as tak en in s i and resulted in a transition to state s j . The v alue function V . The v alue function maps a state history in to a real n um b er suc h that V ( h 1 ) V ( h 2 ) just in case the agen t considers history h 1 at least as go o d as h 2 . A state history records the progression of states the system assumes along with the actions p erformed. Assumptions suc h as time-separabilit y and additivit y are common for V . In particular, w e generally use a rew ard function R and cost function C when dening v alue. The horizon T . This is the n um b er of stages o v er whic h the state histories should b e ev aluated using V . 15. Man y algorithms in the AI literature pro duce a p artial ly or der e d sequence of actions. These plans do not, ho w ev er, in v olv e conditional or nondeterministic execution. Rather, they represen t the fact that any linear sequence consisten t with the partial order will solv e the problem. Th us, a partially ordered plan is a concise represen tation for a particular set of straigh t-line plans. 20 Decision-Theoretic Planning: Str uctural Assumptions An optimalit y criterion. This pro vides a criterion for ev aluating p oten tial solutions to planning problems. 2.10.2 Common Planning Pr oblems W e can use this general framew ork to classify v arious problems commonly studied in the planning and decision-making literature. In eac h case b elo w, w e note the mo deling assump- tions that dene the problem class. Planning Problems in the OR/Decision Sciences T radition F ully Observ able Mark o v Decision Pro cesses (F OMDPs) | There is an ex- tremely large b o dy of researc h studying F OMDPs, and w e presen t the basic algorith- mic tec hniques in some detail in the next section. The most commonly used form ula- tion of F OMDPs assumes full observ abilit y and stationarit y , and uses as its optimalit y criterion the maximization of exp ected total rew ard o v er a nite horizon, maximiza- tion of exp ected total discoun ted rew ard o v er an innite horizon, or minimization of the exp ected cost to a goal state. F OMDPs w ere in tro duced b y Bellman (1957) and ha v e b een studied in depth in the elds of decision analysis and OR, including the seminal w ork of Ho w ard (1960). Re- cen t texts on F OMDPs include (Bertsek as, 1987) and (Puterman, 1994). Av erage re- w ard optimalit y has also receiv ed atten tion in this literature (Blac kw ell, 1962; Ho w ard, 1960; Puterman, 1994). In the AI literature, discoun ted or total rew ard mo dels ha v e b een most p opular as w ell (Barto et al., 1995; Dearden & Boutilier, 1997; Dean, Kael- bling, Kirman, & Nic holson, 1995; Ko enig, 1991), though the a v erage-rew ard criterion has b een prop osed as more suitable for mo deling AI planning problems (Boutilier & Puterman, 1995; Mahadev an, 1994; Sc h w artz, 1993). P artially Observ able Mark o v Decision Pro cesses (POMDPs) | POMDPs are closer than F OMDPs to the general mo del of decision pro cesses w e ha v e describ ed. POMDPs ha v e generally b een studied with the assumption of stationarit y and opti- malit y criteria iden tical to those of F OMDPs, though the a v erage-rew ard criterion has not b een widely considered. As w e discuss b elo w, a POMDP can b e view ed as a F OMDP with a state space consisting of the set of pr ob ability distributions o v er S . These probabilit y distributions represen t states of b elief: the agen t can \observ e" its state of b elief ab out the system although it do es not ha v e exact kno wledge of the system state itself. POMDPs ha v e b een widely studied in OR and con trol theory (Astr om, 1965; Lo v e- jo y , 1991b; Smallw o o d & Sondik, 1973; Sondik, 1978), and ha v e dra wn increasing atten tion in AI circles (Cassandra, Kaelbling, & Littman, 1994; Hauskrec h t, 1998; Littman, 1996; P arr & Russell, 1995; Simmons & Ko enig, 1995; Thrun, F o x, & Bur- gard, 1998; Zhang & Liu, 1997). Inuence diagrams (Ho w ard & Matheson, 1984; Shac h ter, 1986) are a p opular mo del for decision making in AI and are, in fact, a structured represen tational metho d for POMDPs (see Section 4.3). Planning Problems in the AI T radition 21 Boutilier, Dean, & Hanks Classical Deterministic Planning | The classical AI planning mo del assumes deterministic actions: an y action a k tak en at an y state s i has at most one successor s j . The other imp ortan t assumptions are non-observ abilit y and that v alue is determined b y reac hing a go al state : an y plan that leads to a goal state is preferred to an y that do es not. Often there is a preference for shorter plans; this can b e represen ted b y using a discoun t factor to \encourage" faster goal ac hiev emen t or b y assigning a cost to actions. Rew ard is asso ciated only with transitions to goal states, whic h are absorbing. Action costs are t ypically ignored, except as noted ab o v e. In classical mo dels it is usually assumed that the initial state is kno wn with certain t y . This con trasts with the general sp ecication of MDPs ab o v e, whic h do es not assume kno wledge of or ev en distributional information ab out the initial state. P olicies are dened to b e applicable no matter what state (or distribution o v er states) one nds oneself in|action c hoices are dened for ev ery p ossible state or history . Kno wledge of the initial state and determinism allo w optimal straigh t-line plans to b e constructed, with no loss in v alue asso ciated with non-observ abilit y , but unpredictable exogenous ev en ts and uncertain action eects cannot b e mo deled consisten tly if these assump- tions are adopted. F or an o v erview of early classical planning researc h and the v ariet y of approac hes adopted, see (Allen, Hendler, & T ate, 1990) as w ell as Y ang's (1998) recen t text. Optimal Deterministic Planning | A separate b o dy of w ork retains the classical assumptions of complete information and determinism, but tries to recast the planning problem as an optimization that relaxes the implicit assumption of \ac hiev e the goal at all costs." A t the same time, these metho ds use the same sort of represen tations and algorithms applied to satiscing planning. Hadda wy and Hanks (1998) presen t a m ulti-attribute utilit y mo del for planners that k eeps the explicit information ab out the initial state and goals, but allo ws prefer- ences to b e stated ab out the partial satisfaction of the goals as w ell as the cost of the resources consumed in satisfying them. The mo del also allo ws the expression of pref- erences o v er phenomena lik e temp oral deadlines and main tenance in terv als that are dicult to capture using a time-separable additiv e v alue function. Williamson (1996) (see also Williamson & Hanks, 1994). implemen ts this mo del b y extending a clas- sical planning algorithm to solv e the resulting optimization problem. Hadda wy and Su w andi (1994) also implemen t this mo del in a complete decision-theoretic framew ork. Their mo del of planning, r enement planning , diers somewhat from the generativ e mo del discussed in this pap er. In their mo del the set of all p ossible plans is pre-stored in an abstraction hierarc h y , and the problem solv er's job is to nd in the hierarc h y the optimal c hoice of concrete actions for a particular problem. P erez and Carb onell's (1994) w ork also incorp orates cost information in to the classical planning framew ork, but main tains the split b et w een a classical satiscing planner and additional cost information pro vided in the utilit y mo del. The cost information is used to learn searc h-con trol rules that allo w the classical planner to generate lo w-cost goal-satisfying plans. 22 Decision-Theoretic Planning: Str uctural Assumptions Conditional Deterministic Planning | The classical planning assumption of omniscience can b e relaxed somewhat b y allo wing the state of some asp ects of the w orld to b e unknown . The agen t is th us in a situation where it is certain that the system is one of a particular set of states, but do es not kno w whic h one. Unkno wn truth v alues can b e included in the initial state sp ecication, and taking actions can cause a prop osition to b ecome unkno wn as w ell. Actions can pro vide the agen t with information while the plan is b eing executed: con- ditional planners in tro duce the idea of actions pro viding runtime information ab out the prev ailing state, distinguishing b et w een an action that mak es prop osition P true and an action that will tell the agen t whether P is true when the action is executed. An action can ha v e b oth causal and informational eects, sim ultaneously c hanging the w orld and rep orting on the v alue of one or more prop ositions. This second sort of information is not useful at planning time except that it allo ws steps in the plan to b e executed conditionally , dep ending on the run time information pro vided b y prior information-pro ducing steps. The v alue of suc h actions lies in the fact that dieren t courses of action ma y b e appropriate under dieren t conditions|these informational eects allo w run time selection of actions based on the observ ations pro duced, m uc h lik e the general POMDP mo del. Examples of conditional planners in the classical framew ork include early w ork b y W arren (1976) and the more recen t CNLP (P eot & Smith, 1992), Cassandra (Pry or & Collins, 1993), Pl ynth (Goldman & Bo ddy , 1994), and UWL (Etzioni, Hanks, W eld, Drap er, Lesh, & Williamson, 1992) systems. Probabilistic Planning Without F eedbac k | A direct probabilistic extension of the classical planning problem can b e stated as follo ws (Kushmeric k et al., 1995): tak e as input (a) a pr ob ability distribution o v er initial states, (b) sto c hastic actions (explicit or implicit transition matrices), (c) a set of goal states, and (d) a probabilit y success threshold . The ob jectiv e is to pro duce a plan that reac hes an y goal state with probabilit y at least , giv en the initial state distribution. No pro vision is made for execution-time observ ation, th us straigh t-line plans are the only form of p olicy p ossible. This is a restricted case of the innite-horizon NOMDP problem, one in whic h actions incur no cost and goal states oer p ositiv e rew ard and are absorbing. It is also a sp ecial case in that the ob jectiv e is to nd a satiscing p olicy rather than an optimal one. Probabilistic Planning With F eedbac k | Drap er et al. (1994a) ha v e prop osed an extension of the probabilistic planning problem in whic h actions pro vide feedbac k, using exactly the observ ation mo del describ ed in Section 2.4. Again, the problem is p osed as that of building a plan that lea v es the system in a goal state with sucien t probabilit y . But a plan is no longer a simple sequence of actions|it can con tain con- ditionals and lo ops whose execution dep ends on the observ ations generated b y sensing actions. This problem is a restricted case of the general POMDP problem: absorb- ing goal states and cost-free actions are used, and the ob jectiv e is to nd an y p olicy (conditional plan) that lea v es the system in a goal state with sucien t probabilit y . 23 Boutilier, Dean, & Hanks Comparing the F ramew orks: T ask-orien ted V ersus Pro cess-orien ted Problems It is useful at this p oin t to pause and con trast the t yp es of problems considered in the clas- sical planning literature with those t ypically studied within the MDP framew ork. Although problems in the AI planning literature ha v e emphasized a goal-pursuit or \one-shot" view of problem solving, in some cases viewing the problem as an innite-horizon decision problem results in a more satisfying form ulation. Consider our running example in v olving the oce rob ot. It is simply not p ossible to mo del the problem of resp onding to coee requests, mail arriv al and k eeping the lab tidy as a strict go al-satisfaction pr oblem while capturing the p ossible n uances of in tuitiv ely optimal b eha vior. The primary dicult y is that no explicit and p ersisten t goal states exist. If w e w ere simply to require that the rob ot attain a state where the lab is tidy , no mail a w aits, and no unlled coee requests exist, no \successful" plan could an ticipate p ossible system b eha vior after a goal state w as reac hed. The p ossible o ccurrence of exogenous ev en ts after goal ac hiev emen t requires that the rob ot bias its metho ds for ac hieving its goals in a w a y that b est suits the exp ected course of subsequen t ev en ts. F or instance, if coee requests are v ery lik ely at an y p oin t in time and unmet requests are highly p enalized, the rob ot should situate itself in the coee ro om in order to satisfy an an ticipated futur e request quic kly . Most realistic decision scenarios in v olv e b oth task-oriente d and pr o c ess-oriente d b eha vior, and problem form ulations that tak e b oth in to accoun t will pro vide more satisfying mo dels for a wider range of situations. 2.10.3 The Complexity of Policy Constr uction W e ha v e no w dened the planning problem in sev eral dieren t w a ys, eac h ha ving a dieren t set of assumptions ab out the state space, system dynamics and actions (deterministic or sto c hastic), observ abilit y (full, partial, or none), v alue function (time-separable, goal only , goal rew ards and action costs, partially satisable goals with temp oral deadlines), planning horizon (nite, innite, or indenite), and optimalit y criterion (optimal or satiscing solu- tions). Eac h set of assumptions puts the corresp onding problem in a particular complexit y class, whic h denes w orst-case time and space b ounds on any represen tation and algorithm for solving that problem. Here w e summarize kno wn complexit y results for eac h of the problem classes dened ab o v e. F ully Observ able Mark o v Decision Pro cesses F ully observ able MDPs (F OMDPs) with time-separable, additiv e v alue functions can b e solv ed in time p olynomial in the size of the state space, the n um b er of actions, and the size of the inputs. 16 The most com- mon algorithms for solving F OMDPs are value iter ation and p olicy iter ation , whic h are describ ed in the next section. Both nite-horizon and discoun ted innite-horizon problems require a p olynomial amoun t of computation p er iteration| O ( j S j 2 j A j ) and O ( j S j 2 j A j + j S j 3 ), resp ectiv ely|and con v erge in a p olynomial n um b er of iterations (with factor 1 1 in the discoun ted case). On the other hand, these problems ha v e b een sho wn to b e P-complete (P apadimitriou & Tsitsiklis, 1987), whic h means that an ecien t parallel solution algorithm is unlik ely . 17 The space required to store the p olicy for an n -stage nite-horizon problem 16. More precisely , the maxim um n um b er of bits required to represen t an y of the transition probabilities or costs. 17. See (Littman, Dean, & Kaelbling, 1995) for a summary of these complexit y results. 24 Decision-Theoretic Planning: Str uctural Assumptions is O ( j S j n ). F or most in teresting classes of innite-horizon problems, sp ecically those in- v olving discoun ted mo dels with time-separable additiv e rew ard, the optimal p olicy can b e sho wn to b e stationary , and the p olicy can b e stored in O ( j S j ) space. Bear in mind that these are w orst-case b ounds. In man y cases, b etter time b ounds and more compact represen tations can b e found. Sections 4 and 5 explore w a ys to represen t and solv e these problems more ecien tly . P artially Observ able Mark o v Decision Pro cesses POMDPs are notorious for their computational dicult y . As men tioned ab o v e, a POMDP can b e view ed as a F OMDP with an innite state space consisting of pr ob ability distributions o v er S , eac h distribution represen ting the agen t's state of b elief at a p oin t in time (Astr om, 1965; Smallw o o d & Sondik, 1973). The problem of nding an optimal p olicy for a POMDP with the ob jectiv e of maximizing exp ected total rew ard or exp ected total discoun ted rew ard o v er a nite horizon T has b een sho wn to b e exp onen tially hard b oth in j S j and in T (P apadimitriou & Tsitsiklis, 1987). The problem of nding a p olicy that maximizes or appro ximately maximizes the exp ected discoun ted total rew ard o v er an innite horizon is sho wn to b e undecidable (Madani, Condon, & Hanks, 1999). Ev en restricted cases of the POMDP problem are computationally dicult in the w orst case. Littman (1996) considers the sp ecial case of b o olean rew ards: determining whether there is an innite-horizon p olicy with nonzero total rew ard giv en that the rew ards asso ci- ated with all states are non-negativ e. He sho ws that the problem is EXPTIME-complete if the transitions are sto c hastic, and PSP A CE-hard if the transitions are deterministic. Deterministic Planning Recall that the classical planning problem is dened quite dieren tly from the MDP problems ab o v e: the agen t has no abilit y to observ e the state but has p erfect predictiv e p o w ers, kno wing the initial state and the eects of all actions with certain t y . In addition, rew ards come only from reac hing a goal state, and an y plan that ac hiev es the goal suces. Planning problems are t ypically dened in terms of a set P of b o olean features or prop ositions: a complete assignmen t of truth v alues to features describ es exactly one state, and a partial assignmen t of truth v alues describ es a set of states. A set of prop ositions P induces a state space of size 2 jP j . Th us, the space required to represen t a planning problem using a feature-based represen tation can b e exp onen tially smaller than that required b y a at represen tation for the same problem (see Section 4). The abilit y to represen t planning problems compactly has a dramatic impact on w orst- case complexit y . Bylander (1994) sho ws that the deterministic planning problem without observ ation is PSP A CE-complete. Roughly sp eaking, this means that at w orst planning time will increase exp onen tially with P and A , and further, that the size of a solution plan can gro w exp onen tially with the problem size. These results hold ev en when the action space A is sev erely restricted. F or example, the planning problem is NP-complete ev en in cases where eac h action is restricted to one precondition feature and one p ostcondition feature. Conditional and optimal planning are PSP A CE-complete as w ell. These results are for inputs that are generally more compact (generally exp onen tially so) than those in terms of whic h the complexit y of the F OMDP and POMDP problems are phrased. Probabilistic Planning In probabilistic goal-orien ted planning, as for POMDPs, w e t ypically searc h for a solution in a space of pr ob ability distributions o v er states (or o v er 25 Boutilier, Dean, & Hanks form ulas that describ e states). Ev en the simplest problem in probabilistic planning|one that admits no observ abilit y|is undecidable at w orst (Madani et al., 1999). The in tuition is that ev en though the set of states is nite, the set of distributions o v er those states is not, and at w orst the agen t ma y ha v e to searc h an innite n um b er of plans b efore b eing able to determine whether or not a solution exists. An algorithm can b e guaran teed to nd a solution plan ev en tually if one exists, but cannot b e guaran teed to terminate in nite time if there is no solution plan. Conditional probabilistic planning is a generalization of the non-observ able probabilistic planning problem, and th us is undecidable as w ell. It is in teresting to note a connection b et w een conditional probabilistic planning and POMDPs. The actions and observ ations of the t w o problems ha v e equiv alen t expressiv e p o w er, but the rew ard structure of the conditional probabilistic planning problem is quite restrictiv e: goal states ha v e p ositiv e rew ards, all other states ha v e no rew ard, and goal states are absorbing. Since w e cannot put an a priori b ound on the length of a solution plan, conditional probabilistic planning m ust b e view ed as an innite-horizon problem where the ob jectiv e is to maximize total exp ected undisc ounte d rew ard. Note, ho w ev er, that since goal states are absorbing, w e can guaran tee that total exp ected rew ard will b e non-negativ e and b ounded, ev en o v er an innite horizon. T ec hnically this means that the conditional proba- bilistic planning problem is a restricted case of an innite-horizon p ositive-b ounde d problem (Puterman, 1994, Section 7.2). W e can therefore conclude that the problem of solving an arbitrary innite-horizon undiscoun ted p ositiv e-b ounded POMDP is also undecidable. The more commonly studied problem is the innite-horizon POMDP with a criterion of maxi- mizing exp ected disc ounte d total rew ard; nding optimal or near-optimal solutions to that problem is also undecidable, as noted ab o v e. 2.10.4 Conclusion W e end this section b y noting again that these results are algorithm-indep enden t and de- scrib e w orst-case b eha vior. In eect, they indicate ho w badly an y algorithm can b e made to p erform on an \arbitrarily unfortunate" problem instance. The more in teresting question is whether w e can build represen tations, tec hniques, and algorithms that t ypically p erform w ell on problem instances that t ypically arise in practice. This concern leads us to examine the problem c haracteristics with an ey e to w ard exploiting the restrictions placed on the states and actions, on observ abilit y , and on the v alue function and optimalit y criterion. W e b egin with algorithmic tec hniques that fo cus on the v alue function|particularly those that tak e adv an tage of time-separabilit y and goal orien tation. Then in the follo wing section w e explore complemen tary tec hniques for building compact problem represen tations. 3. Solution Algorithms: Dynamic Programming and Searc h In this section w e review standard algorithms for solving the problems describ ed ab o v e in terms of the \unstructured" or \at" problem represen tations. As noted in the analysis ab o v e, fully observ able Mark o v decision pro cesses (F OMDPs) are b y far the most widely studied mo dels in this general class of sto c hastic sequen tial decision problems. W e b egin b y describing tec hniques for solving F OMDPs, fo cusing on tec hniques that exploit structure in the v alue function lik e time-separabilit y and additivit y . 26 Decision-Theoretic Planning: Str uctural Assumptions 3.1 Dynamic Programming Approac hes Supp ose w e are giv en a fully-observ able MDP with a time-separable, additiv e v alue function. In other w ords, w e are giv en the state space S , action space A , a transition matrix Pr ( s 0 j s; a ) for eac h action a , a rew ard function R , and a cost function C . W e start with the problem of nding the p olicy that maximizes exp ected total rew ard for some xed, nite-horizon T . Supp ose w e are giv en a p olicy suc h that ( s; t ) is the action to b e p erformed b y the agen t in state s with t stages remaining to act (for 0 t T ). 18 Bellman (1957) sho ws that the exp ected v alue of suc h a p olicy at an y state can b e computed using the set of t -stage-to-go value functions V t . W e dene V 0 ( s ) to b e R ( s ), then dene: V t ( s ) = R ( s ) + C ( ( s; t )) + X s 0 2S f Pr( s 0 j ( s; t ) ; s ) V t 1 ( s 0 ) g (1) This denition of the v alue function for mak es its dep endence on the initial state clear. W e sa y a p olicy is optimal if V T ( s ) V 0 T ( s ) for all p olicies 0 and all s 2 S . The optimal T -stage-to-go v alue function, denoted V T , is simply the v alue function of an y optimal T -horizon p olicy . Bellman's principle of optimality (Bellman, 1957) forms the basis of the sto c hastic dynamic programming algorithms used to solv e MDPs, establishing the follo wing relationship b et w een the optimal v alue function at t th stage and the optimal v alue function at the previous stage: V t ( s ) = R ( s ) + max a 2A f C ( a ) + X s 0 2S Pr( s 0 j a; s ) V t 1 ( s 0 ) g (2) 3.1.1 V alue Itera tion Equation 2 forms the basis of the value iter ation algorithm for nite-horizon problems. V alue iteration b egins with the v alue function V 0 = R , and uses Equation 2 to compute in sequence the v alue functions for longer time in terv als, up to the horizon T . An y action that maximizes the righ t-hand side of Equation 2 can b e c hosen as the p olicy elemen t ( s; t ). The resulting p olicy is optimal for the T -stage, fully observ able MDP , and indeed for an y shorter horizon t < T . It is imp ortan t to note that a p olicy describ es what should b e done at ev ery stage and for ev ery state of the system, ev en if the agen t cannot reac h certain states giv en the system's initial conguration and its a v ailable actions. W e return to this p oin t b elo w. Example 3.1 Consider a simplied v ersion of the rob ot example, in whic h w e ha v e four state v ariables M , CR , RHC , and RHM (mo v emen t to v arious lo cations is ignored), and four actions GetC , PUM , DelC , and DelM . Actions GetC and PUM mak e RHC and RHM , resp ectiv ely , true with certain t y . Action DelM , when RHM holds, mak es b oth M and RHM false with probabilit y 1.0; DelC mak es b oth CR and RHC false with probabilit y 0.3, lea ving the state unc hanged with probabilit y 0.7. A rew ard of 3 is asso ciated with CR and a rew ard of 1 is asso ciated with M . The rew ard for an y state is the sum of the rew ards for eac h ob jectiv e satised in that state. Figure 11 sho ws the optimal 0-stage, 1-stage and 2-stage v alue functions for v arious states, along with 18. Recall that for F OMDPs other asp ects of history are not relev an t. 27 Boutilier, Dean, & Hanks State V 0 V 1 (1) V 2 (2) s 0 = h M ; RHM ; CR ; RHC i 0 0 an y 1 PUM s 1 = h M ; RHM ; CR ; RHC i 0 1 DelM 2 DelM s 2 = h M ; RHM ; CR ; RHC i 0 0.9 DelC 2.43 DelC s 3 = h M ; RHM ; CR ; RHC i 0 1 DelM 2.9 DelM s 4 = h M ; CR ; RHC i 1 2.9 DelC 5.43 DelC s 5 = h M ; CR ; RHC i 1 2 an y 3.9 GetC s 6 = h M ; RHM ; CR i 3 7 DelM 11 DelM s 7 = h M ; RHM ; CR i 3 6 an y 10 PUM s 8 = h M ; CR i 4 8 an y 12 an y Figure 11: Finite-horizon optimal v alue and p olicy . the optimal c hoice of action at eac h state-stage pairing (the v alues for an y \state" with missing v ariables hold for all instan tiations of those v ariables). Note that V 0 ( s ) is simply R ( s ) for eac h state s . T o illustrate the application of Equation 2, rst consider the calculation of V 1 ( s 3 ). The rob ot has the c hoice of deliv ering coee or deliv ering mail, and the exp ected v alue of eac h option, with one stage remaining, is giv en b y: EV 1 ( s 3 ; DelC ) = 0 : 3 V 0 ( s 6 ) + 0 : 7 V 0 ( s 3 ) = 0 : 9 EV 1 ( s 3 ; DelM ) = 1 : 0 V 0 ( s 4 ) = 1 : 0 Th us ( s 3 ; 1) = DelM and V 1 ( s 3 ) is the v alue of this maximizing c hoice. Notice that the rob ot with one action to p erform will aim for the \lesser" ob jectiv e M due to the risk of failure inheren t in deliv ering coee. With t w o stages remaining at the same state, the rob ot will again deliv er mail, but with certain t y will mo v e to s 4 with one stage to go, where it will attempt to deliv er coee ( ( s 4 ; 1) = DelC ). T o illustrate the eects a xed nite horizon can ha v e on p olicy c hoice, note that ( s 0 ; 2) = PUM . With t w o stages remaining and the c hoice of getting mail or coee, the rob ot will get mail b ecause subsequen t deliv ery (in the last stage) is guaran teed to succeed, whereas subsequen t coee deliv ery ma y fail. Ho w ev er, if w e compute ( s 0 ; 3), w e see: EV 3 ( s 0 ; GetC ) = 1 : 0 V 2 ( s 2 ) = 2 : 43 EV 3 ( s 0 ; PUM ) = 1 : 0 V 2 ( s 1 ) = 2 : 0 With thr e e stages to go, the rob ot will instead retriev e coee at s 0 . Once it has coee, it has t w o c hances at successful deliv ery . The exp ected v alue of this course of action is greater than that of (guaran teed) mail deliv ery . Note that three stages do es not allo w sucien t time to try to ac hiev e b oth ob jectiv es at s 0 . In fact, the larger rew ard asso ciated with coee deliv ery ensures that with an y greater n um b er of stages remaining, the rob ot should fo cus rst on coee retriev al and deliv ery , and then attempt mail retriev al and deliv ery once coee deliv ery is successfully completed. 2 Often w e are faced with tasks that do not ha v e a xed nite horizon. F or example, w e ma y w an t our rob ot to p erform the tasks of k eeping the lab tidy , pic king up mail whenev er it 28 Decision-Theoretic Planning: Str uctural Assumptions arriv es, resp onding to coee requests, and so on. There is no xed time horizon asso ciated with these tasks|they are to b e p erformed as the need arises. Suc h problems are b est mo deled as innite-horizon problems. W e consider here the problem of building a p olicy that maximizes the discoun ted sum of exp ected rew ards o v er an innite horizon. 19 Ho w ard (1960) sho w ed that there alw a ys exists an optimal stationary p olicy for suc h problems. In tuitiv ely , this is the case b ecause no matter what stage the pro cess is in, there are still an innite n um b er of stages remaining; so the optimal action at an y state is indep enden t of the stage. W e can therefore restrict our atten tion to p olicies that c ho ose the same action for a state regardless of the stage of the pro cess. Under this restriction, the p olicy will ha v e the same size jS j regardless of the n um b er of stages o v er whic h the p olicy is executed|the p olicy has the form : S ! A . In con trast, optimal p olicies for nite-horizon problems are generally nonstationary , as illustrated in Example 3.1. Ho w ard also sho ws that the v alue of p olicy satises the follo wing recurrence: V ( s ) = R ( s ) + f C ( ( s )) + X s 0 2S Pr ( s 0 j ( s ) ; s ) V ( s 0 ) g (3) and that the optimal v alue function V satises: V ( s ) = R ( s ) + max a 2A f C ( a ) + X s 0 2S Pr( s 0 j a; s ) V ( s 0 ) g (4) The v alue of a xed p olicy can b e ev aluated using the metho d of suc c essive appr oxi- mation , whic h is almost iden tical to the pro cedure describ ed in Equation 1 ab o v e. W e b egin with an arbitrary assignmen t of v alues to V 0 ( s ), then dene: V t ( s ) = R ( s ) + C ( ( s; t )) + X s 0 2S f Pr( s 0 j ( s; t ) ; s ) V t 1 ( s 0 ) g (5) The sequence of functions V t con v erges linearly to the true v alue function V . One can also alter the v alue-iteration algorithm sligh tly so it builds optimal p olicies for innite-horizon discoun ted problems. The algorithm starts with a v alue function V 0 that assigns an arbitrary v alue to eac h s 2 S . Giv en v alue estimate V t ( s ) for eac h state s , V t +1 ( s ) is calculated as: V t +1 ( s ) = R ( s ) + max a 2A f C ( a ) + X s 0 2 S Pr ( s 0 j a; s ) V t ( s 0 ) g (6) The sequence of functions V t con v erges linearly to the optimal v alue function V ( s ). After some nite n um b er of iterations n , the c hoice of maximizing action for eac h s forms an optimal p olicy and V n appro ximates its v alue. 20 19. This is b y far the most commonly studied problem in the literature, though it is argued in (Boutilier & Puterman, 1995; Mahadev an, 1994; Sc h w artz, 1993) that suc h problems are often b est mo deled using a v erage rew ard p er stage as the optimalit y criterion. F or a discussion of a v erage rew ard optimalit y and its man y v arian ts and renemen ts, see (Puterman, 1994). 20. The n um b er of iterations n is based on a stopping criterion that generally in v olv es measuring the dif- ference b et w een V t and V t +1 . F or a discussion of stopping criteria and con v ergence of the algorithm, see (Puterman, 1994). 29 Boutilier, Dean, & Hanks 3.1.2 Policy Itera tion Ho w ard's (1960) p olicy-iter ation algorithm is an alternativ e to v alue iteration for innite- horizon problems. Rather than iterativ ely impro ving the estimated v alue function, it instead mo dies the p olicies directly . It b egins with an arbitrary p olicy 0 , then iterates, computing i +1 from i . Eac h iteration of the algorithm comprises t w o steps, p olicy evaluation and p olicy im- pr ovement : 1. (P olicy ev aluation) F or eac h s 2 S , compute the v alue function V i ( s ) based on the curren t p olicy i . 2. (P olicy impro v emen t) F or eac h s 2 S , nd the action a that maximizes Q i +1 ( a; s ) = R ( s ) + C ( a ) + X s 0 2 S Pr( s 0 j a; s ) V i ( s 0 ) (7) If Q i +1 ( a ; s ) > V i ( s ), let i +1 = a ; otherwise i +1 ( s ) = i ( s ). 21 The algorithm iterates un til i +1 ( s ) = i ( s ) for all states s . Step 1 ev aluates the curren t p olicy b y solving the N N linear system represen ted b y Equation 3 (one equation for eac h s 2 S ), and can b e computationally exp ensiv e. Ho w ev er, the algorithm con v erges to an optimal p olicy at least linearly and under certain conditions con v erges sup erlinearly or quadratically (Puterman, 1994). In practice, p olicy iteration tends to con v erge in man y few er iterations than do es v alue iteration. P olicy iteration th us sp ends more computational time at eac h individual stage, with the result that few er stages need b e computed. 22 Mo die d p olicy iter ation (Puterman & Shin, 1978) pro vides a middle ground b et w een p olicy iteration and v alue iteration. The structure of the algorithm is exactly the same as that of p olicy iteration, alternating ev aluation and impro v emen t phases. The k ey insigh t is that one need not ev aluate a p olicy exactly in order to impro v e it. Therefore, the ev aluation phase in v olv es some (usually small) n um b er of iterations of successiv e appro ximation (i.e., setting V = V t for some small t , using Equation 6). With some tuning of the v alue of t used at eac h iteration, mo died p olicy iteration can w ork extremely w ell in practice (Puterman, 1994). Both v alue iteration and p olicy iteration are sp ecial cases of mo died p olicy iteration, corresp onding to setting t = 0 and t = 1 , resp ectiv ely . A n um b er of other v arian ts of b oth v alue and p olicy iteration ha v e b een prop osed. F or instance, asynchr onous versions of these algorithms do not require that the v alue function b e constructed, or p olicy impro v ed, at eac h state in lo c kstep. In the case of v alue iteration for innite-horizon problems, as long as eac h state is up dated sucien tly often, con v ergence can b e assured. Similar guaran tees can b e pro vided for async hronous forms of p olicy it- eration. Suc h v arian ts are imp ortan t to ols for understanding v arious online approac hes to solving MDPs (Bertsek as & Tsitsiklis, 1996). F or a nice discussion of async hronous dynamic programming, see (Bertsek as, 1987; Bertsek as & Tsitsiklis, 1996). 21. The Q -function dened b y Equation 7, and so called b ecause of its use in Q-le arning (W atkins & Da y an, 1992), giv es the v alue of p erforming action a at state s assuming the v alue function V accurately reects future v alue. 22. See (Littman et al., 1995) for a discussion of the complexit y of the algorithm. 30 Decision-Theoretic Planning: Str uctural Assumptions 3.1.3 Undiscounted Infinite-Horizon Pr oblems The dicult y with nding optimal solutions to innite-horizon problems is that total rew ard can gro w without limit o v er time. Th us, the problem denition m ust pro vide some w a y to ensure that the v alue metric is b ounded o v er arbitrarily long horizons. The use of exp ected total discoun ted rew ard as the optimalit y criterion oers a particularly elegan t w a y to guaran tee a b ound, since the innite sum of discoun ted rew ards is nite. Ho w ev er, although discoun ting is appropriate for certain classes of problems (e.g., economic problems, or those where the system ma y terminate at an y p oin t with a certain probabilit y), for man y realistic AI domains it is dicult to justify coun ting future rew ards less than presen t rew ards, and the discoun ted-rew ard criterion is not appropriate. There are a v ariet y of w a ys to b ound total rew ard in undiscoun ted problems. In some cases the problem itself is structured so that rew ard is b ounded. In planning problems, for example, the goal rew ard can b e collected at most once, and all actions incur a cost. In that case total rew ard is b ounded from ab o v e and the problem can legitimately b e p osed in terms of maximizing total exp ected undiscoun ted rew ard in man y cases (e.g., if the goal can b e reac hed with certain t y). In cases where discoun ting is inappropriate and total rew ard is un b ounded, dieren t success criteria can b e emplo y ed. F or example, the problem can instead b e p osed as one in whic h w e wish to maximize exp ected aver age r ewar d p er stage , or gain . Computational tec hniques for constructing gain-optimal p olicies are similar to the dynamic-programming algorithms describ ed ab o v e, but are generally more complicated, and the con v ergence rate tends to b e quite sensitiv e to the comm unicating structure and p erio dicit y of the MDP . Renemen ts to gain optimalit y ha v e also b een studied. F or example, bias optimality can b e used to distinguish t w o gain-optimal p olices b y giving preference to the p olicy whose total rew ard o v er some initial segmen t of p olicy execution is larger. Again, while the algorithms are more complicated than those for discoun ted problems, they are v arian ts of standard p olicy or v alue iteration. See (Puterman, 1994) for details. 3.1.4 D ynamic Pr ogramming and POMDPs Dynamic programming tec hniques can b e applied in partially observ able settings as w ell (Smallw o o d & Sondik, 1973). The main dicult y in building p olicies for situations in whic h the state is not fully observ able is that, since past observ ations can pro vide information ab out the system's curren t state, decisions m ust b e based on information gleaned in the past. As a result, the optimal p olicy can dep end on all observ ations the agen t has made since the b eginning of execution. These history-dep enden t p olicies can gro w in size exp onen tial in the length of the horizon. While history-dep endence precludes dynamic programming, the observ able history can b e summarized adequately with a probabilit y distribution o v er S (Astr om, 1965), and p olicies can b e computed as a function of these distributions, or b elief states . A k ey observ ation of Sondik (Smallw o o d & Sondik, 1973; Sondik, 1978) is that when one views a POMDP with a time-separable v alue function b y taking the state space to b e the set of probabilit y distributions o v er S , one obtains a fully observ able MDP that can b e solv ed b y dynamic programming. The (computational) problem with this approac h is 31 Boutilier, Dean, & Hanks that the state space for this F OMDP is an N -dimensional con tin uous space, 23 and sp ecial tec hniques m ust b e used to solv e it (Smallw o o d & Sondik, 1973; Sondik, 1978). W e do not explore these tec hniques here, but note that they are curren tly practical only for v ery small problems (Cassandra et al., 1994; Cassandra, Littman, & Zhang, 1997; Littman, 1996; Lo v ejo y , 1991b). A n um b er of appro ximation metho ds, dev elop ed b oth in OR (Lo v ejo y , 1991a; White I I I & Sc herer, 1989) and AI (Brafman, 1997; Hauskrec h t, 1997; P arr & Russell, 1995; Zhang & Liu, 1997), can b e used to increase the range of solv able problems, but ev en these tec hniques are presen tly of limited practical v alue. POMDPs pla y a k ey role in reinforcemen t learning as w ell, where the \natural state space" consisting of agen t observ ations pro vides incomplete information ab out the under- lying system state (see, e.g., McCallum, 1995). 3.2 AI Planning and State-Based Searc h W e noted in Section 2.7 that the classical AI planning problem can b e form ulated as an innite-horizon MDP and can therefore b e solv ed using an algorithm lik e v alue iteration. Recall that t w o assumptions in classical planning sp ecialize the general MDP mo del, namely determinism of actions and the use of goal states instead of a more general rew ard function. A third assumption|that w e w an t to construct an optimal course of action starting from a kno wn initial state|do es not ha v e a coun terpart in the F OMDP mo del as presen ted ab o v e, since the p olicy dictates the optimal action from any state at any stage of the plan. As w e will see b elo w, the in terest in online algorithms within AI has led to revised form ulations of F OMDPs that do tak e initial and curr ent states in to accoun t. Though w e dened the classical planning problem earlier as a non-observ able pro cess (NOMDP), it can b e solv ed as if it w ere fully observ able. W e let G b e the set of goal states and s init b e the initial state. Applying v alue iteration to this t yp e of problem is equiv alen t to determining the r e achability of goal states from all system states. F or instance, if w e mak e goal states absorbing, assign a rew ard of 1 to all transitions from an y s 2 S G to some g 2 G and 0 to all others, then the set of all states where V k ( s ) > 0 is exactly the set of states that can lead to a goal state. 24 In particular, if V k ( s init ) > 0, then a successful plan can b e constructed b y extracting actions from the k -stage (nite-horizon) p olicy pro duced b y v alue iteration. The determinism assumption means that the agen t can pr e dict the state p erfectly at ev ery stage of execution; the fact that it cannot observe the state is unimp ortan t. The assumptions commonly made in classical planning can b e exploited computation- ally in v alue iteration. First, w e can terminate the pro cess at the rst iteration k where V k ( s init ) > 0, b ecause w e are in terested only in plans that b egin at s init , not in acting optimally from ev ery p ossible start state. Second, w e can terminate v alue iteration after j S j iterations: if V j S j ( s init ) = 0 at that p oin t, the algorithm will ha v e searc hed ev ery p ossible state and can guaran tee that no solution plan exists. Therefore, w e can view classical plan- ning as a nite-horizon decision problem with a horizon of j S j . This use of v alue iteration 23. More accurately , it is an N -dimensional simplex, or ( N 1)-dimensional space. 24. Sp ecically , V k ( s ) indicates the probabilit y with whic h one reac hes the goal region under the optimal p olicy from s 2 S G in sto c hastic settings. In the deterministic case b eing discussed, this v alue m ust b e 1 or 0. 32 Decision-Theoretic Planning: Str uctural Assumptions is equiv alen t to using the Flo yd-W arshall algorithm to nd a minim um-cost path through a w eigh ted graph (Flo yd, 1962). 3.2.1 Planning and Sear ch While v alue iteration can, in theory , b e used for classical planning, it do es not tak e adv an tage of the fact that the goal and initial states are kno wn. In particular, it computes the v alue and p olicy assignmen t for all states at all stages. This can b e v ery w asteful since optimal actions will b e computed for states that cannot b e reac hed from s init or that cannot p ossibly lead to an y state g 2 G . It is also problematic when j S j is large, since eac h iteration of v alue iteration requires O ( j S jj A j ) computations. F or this reason dynamic programming approac hes ha v e not b een used extensiv ely in AI planning. The restricted form of the v alue function, esp ecially the fact that initial and goal states are giv en, mak es it more adv an tageous to view planning as a gr aph-se ar ch problem. Unlik e general F OMDPs, where it is generally not kno wn a priori whic h states are most desirable with resp ect to (long-term) value , the w ell-dened set of target states in a classical planning problem mak es searc h-based algorithms appropriate. This is the approac h tak en b y most AI planning algorithms. One w a y to form ulate the problem as a graph searc h is to mak e eac h no de of the graph corresp ond to a state in S . The initial state and goal states can then b e iden tied, and the searc h can pro ceed either forwar d or b ackwar d through the graph, or in b oth directions sim ultaneously . In forw ard searc h, the initial state is the ro ot of the searc h tree. A no de is then c hosen from the tree's fringe (the set of all leaf no des), and all feasible actions are applied. Eac h action application extends the plan b y one step (or one stage) and generates a unique new successor state, whic h is a new leaf no de in the tree. This no de can b e pruned if the state it denes is already in the tree. The searc h ends when a state is iden tied as a mem b er of the goal set (in whic h case a solution plan can b e extracted from the tree), or when all branc hes ha v e b een pruned (in whic h case no solution plan exists). F orw ard searc h attempts to build a plan from b eginning to end, adding actions to the end of the curren t sequence of actions. F orw ard searc h nev er considers states that cannot b e reac hed from the s init . Bac kw ard searc h can b e view ed in sev eral dieren t w a ys. W e could arbitrarily select some g 2 G as the ro ot of the searc h tree, and expand the searc h tree at the fringe b y selecting a state on the fringe and adding to the tree all states suc h that some action w ould c ause the system to en ter the c hosen state. In general, an action can giv e rise to more than one pr e de c essor v ertex, ev en if actions are deterministic. A state can again b e pruned if it app ears in the searc h tree already . The searc h terminates when the initial state is added to the tree, and a solution plan can again b e extracted from the tree. This searc h is similar to dynamic-programming-based algorithms for nding the shortest path through a graph. The dierence is that bac kw ard searc h considers only those states at a depth k in the searc h tree that can reac h the c hosen goal state within k steps. Dynamic programming algorithms, in con trast, visit ev ery state at ev ery stage of the searc h. One dicult y with the bac kw ard approac h as describ ed ab o v e is the commitmen t to a particular goal state. Of course, this assumption can b e relaxed, as an algorithm could \sim ultaneously" searc h for paths to all goal states b y adding at the rst lev el of the searc h 33 Boutilier, Dean, & Hanks tree an y v ertex that can reac h some g 2 G . W e will see in Section 5 that goal regression can b e view ed as doing this, at least implicitly . It is generally though t that r e gr ession (or bac kw ard) tec hniques are more eectiv e in practice than pr o gr ession (or forw ard) metho ds. The reasoning is that the branc hing factor in the forw ard graph, whic h is the n um b er of actions that can feasibly b e applied in a giv en state, is substan tially larger than the branc hing factor in the rev erse graph, whic h is the n um b er of op erators that could bring the system in to a giv en state. 25 This is esp ecially true when goal sets are represen ted b y a small set of prop ositional literals (Section 5). The t w o approac hes are not m utually exclusiv e, ho w ev er: one can mix forw ard and bac kw ard expan- sions of the underlying problem graph and terminate when a forw ard path and bac kw ard path meet. The imp ortan t thing to observ e ab out these algorithms is that they restrict their atten- tion to the r elevant and r e achable states. In forw ard searc h, only those states that can b e reac hed from s init are ev er considered: this can pro vide b enet o v er dynamic programming metho ds if few states are reac hable, since unreac hable states cannot pla y a role in construct- ing a successful plan. In bac kw ard approac hes, similarly , only states lying on some path to the goal region G are considered, and this can ha v e signican t adv an tages o v er dynamic programming if only a fraction of the state space is connected to the goal region. In con trast, dynamic programming metho ds (with the exception of async hronous meth- o ds) m ust examine the en tire state space at ev ery iteration. Of course, the abilit y to ignore parts of the state space comes from planning's stringen t denition of what is relev an t: states in G ha v e p ositiv e rew ard, no other states matter except to the exten t they mo v e the agen t closer to the goal, and the c hoice of action at states unreac hable from s init is not of in terest. While state-based searc h tec hniques use kno wledge of a sp ecic initial state and a sp ecic goal set to constrain the searc h pro cess, forw ard searc h do es not exploit kno wledge of the goal set, nor do es bac kw ard searc h exploit kno wledge of the initial state. The GraphPlan algorithm (Blum & F urst, 1995) can b e view ed as a planning metho d that in tegrates the propagation of forw ard reac habilit y constrain ts with bac kw ard goal-informed searc h. W e describ e this approac h in Section 5. F urthermore, w ork on p artial or der planning (POP) can b e view ed as a sligh tly dieren t approac h to this form of searc h. It to o is describ ed in Section 5, after w e discuss feature-based or intensional represen tations for MDPs and planning problems. 3.2.2 Decision Trees and Real-time D ynamic Pr ogramming State-based searc h tec hniques are not limited to deterministic, goal-orien ted domains. Kno wl- edge of the initial state can b e exploited in more general MDPs as w ell, forming the basis of de cision tr e e se ar ch algorithms . Assume w e ha v e b een giv en a nite-horizon F OMDP with horizon T and initial state s init . A de cision tr e e r o ote d at s init is constructed in m uc h the same w a y as a searc h tree for a deterministic planning problem (F renc h, 1986). Eac h action applicable at s init forms lev el 1 of the tree. The states s 0 that result with p ositiv e proba- bilit y when an y of those actions o ccur are applied at s init are placed at lev el 2, with an arc 25. See Bacc h us et al. (1995, 1998) for some recen t w ork that mak es the case for progression with go o d searc h con trol, and Bonet et al. (1997) who argue that progression in deterministic planning is useful when in tegrating planning and execution. 34 Decision-Theoretic Planning: Str uctural Assumptions a 1 a 2 a 1 a 2 a 1 a 2 a 1 a 2 s 1 s 2 s 3 s 4 a 2 a 1 s init p1 V = max(V , V ) 2 1 V V = p V + p V 2 1 3 3 4 4 V V 3 4 p2 p3 p4 Figure 12: The initial stages of a decision tree for ev aluating action c hoices at s init . The v alue of an action is the exp ected v alue of its successor states, while the v alue of a state is the maxim um of the v alues of its successor actions (as indicated b y dashed arro ws at selected no des). lab eled with probabilit y Pr ( s 0 j a; s init ) relating s 0 with a . Lev el 3 has the actions applicable at the states at lev el 2, and so on, un til the tree is gro wn to depth 2 T , at whic h p oin t eac h branc h of the tree is a path consisting of a p ositiv e-probabilit y length- T tra jectory ro oted at s init (see Figure 12). The r elevant part of the optimal T -stage v alue function and the optimal p olicy can easily b e computed using this tree. W e sa y that v alue of an y no de in the tree lab eled with an action is the exp e cte d value of its successor states in the tree (using the probabilities lab eling the arcs), while the v alue of an y no de in the tree lab eled with state s is the sum of R ( s ) and the maximum value of its successor actions. 26 The r ol lb ack pr o c e dur e , whereb y v alue at the lea v es of the tree are rst computed and then v alues at successiv ely higher lev els of the tree are determined using the preceding v alues, is, in fact, a form of v alue iteration. The v alue of an y state s at lev el 2 t is precisely V T t ( s ) and the maximizing actions form the optimal nite-horizon p olicy . This form of v alue iteration is dir e cte d : ( T t )-stage-to-go v alues are computed only for states that are reac hable from s init within t steps. Innite-horizon problems can b e solv ed in an analogous fashion if one can determine a priori the depth required (i.e., the n um b er of iterations of v alue iteration needed) to ensure con v ergence to an optimal p olicy . Unfortunately , the branc hing factor for sto c hastic problems is generally m uc h greater than that for deterministic problems. More troublesome still is the fact that one m ust construct the en tire decision tree to b e sure that the prop er v alues are computed, and hence the optimal p olicy constructed. This stands in con trast with classical planning searc h, where atten tion can b e fo cused on a single branc h: if a goal state is reac hed, the path constructed determines a satisfactory plan. While w orst-case b eha vior for planning may require searc hing the whole tree, decision-tree ev aluation is esp ecially problematic b ecause 26. States at lev el 2 T are giv en v alue R ( s ). 35 Boutilier, Dean, & Hanks the en tire tree must b e generated in general to ensure optimal b eha vior. F urthermore, innite-horizon problems p ose the dicult y of determining a sucien tly deep tree. One w a y around this dicult y is the use of r e al time se ar ch (Korf, 1990). In particular, r e al-time dynamic pr o gr amming , or R TDP , has b een prop osed in (Barto et al., 1995) as a w a y of appro ximately solving large MDPs in an online fashion. One can in terlea v e searc h with execution of an appro ximately optimal p olicy using a form of R TDP similar to decision- tree ev aluation as follo ws. Imagine the agen t nds itself in a particular state s init . It can then build a partial searc h tree to some depth, p erhaps uniformly or p erhaps with some branc hes expanded more deeply than others. P artial tree construction ma y b e halted due to time pressure or due to an assessmen t b y the agen t that further expansion of the tree ma y not b e fruitful. When a decision to act m ust b e made, the rollbac k pro cedure is applied to this partial, p ossibly unev enly expanded decision tree. Rew ard v alues can b e used to ev aluate the lea v es of the tree, but this ma y oer an inaccurate picture of the v alue of no des higher in the tree. Heuristic information can b e used to estimate the long-term v alue of states lab eling lea v es. As with v alue iteration, the deep er the tree, the more accurate the estimated v alue at the ro ot (generally sp eaking) for a xed heuristic. W e will see in Section 5 that structured represen tations of MDPs can pro vide a means to construct suc h heuristics (Dearden & Boutilier, 1994, 1997). Sp ecically , with admissible heuristics or upp er and lo w er b ounds on the true v alues of leaf no des in the tree, metho ds suc h as A* or branc h-and-b ound searc h can b e used. A k ey adv an tage of in tegrating searc h with execution is that the actual outcome of the action tak en can b e used to prune from the tree the branc hes ro oted at the unrealized outcomes. The subtree ro oted at the realized state can then b e expanded further to mak e the next action c hoice. The algorithm of Hansen and Zilb erstein (1998) can b e view ed as a v arian t of these metho ds in whic h stationary p olicies (i.e., state-action mappings) can b e extracted during the searc h pro cess. R TDP is form ulated b y Barto et al. (1995) more generally as a form of online, asyn- c hronous v alue iteration. Sp ecically , the v alues \rolled bac k ed" can b e cac hed and used as impro v ed heuristic estimates of the v alue function at the states in question. This tec h- nique is also in v estigated in (Bonet et al., 1997; Dearden & Boutilier, 1994, 1997; Ko enig & Simmons, 1995), and is closely tied to Korf 's (1990) LR T A* algorithm. These v alue up dates also need not pro ceed strictly using a decision tree to determine the states; the k ey requiremen t of R TDP is simply that the actual state s init b e one of the states whose v alue is up dated at eac h decision-action iteration. A second w a y to a v oid some of the computational diculties that arise in large searc h spaces is to use sampling metho ds . These metho ds sample from the space of p ossible tra jec- tories and use this sampled information to pro vide estimates of the v alues of sp ecic courses of action. This approac h is quite common in reinforcemen t learning (Sutton & Barto, 1998), where sim ulation mo dels are often used to generate exp erience from whic h a v alue function can b e learned. In the presen t con text, Kearns, Mansour and Ng (Kearns, Mansour, & Ng, 1999) ha v e in v estigated searc h metho ds for innite-horizon MDPs where the successor states of an y sp ecic action are randomly sampled according to the transition distribution. Th us, rather than expand all successor states, only sampled states are searc hed. Though this metho d is exp onen tial in the \eectiv e" horizon (or mixing rate) of the MDP and is required to expand all actions, the n um b er of states expanded can b e less than that required 36 Decision-Theoretic Planning: Str uctural Assumptions b y full searc h, ev en if the underlying transition graph is not sparse. They are able to pro- vide p olynomial b ounds (ignoring action branc hing and horizon eects) on the n um b er of tra jectories that need to b e sampled in order to generate appro ximately optimal b eha vior with high probabilit y . 3.3 Summary W e ha v e seen that dynamic programming metho ds and state-based searc h metho ds can b oth b e used for fully observ able and non-observ able MDPs, with forw ard searc h meth- o ds in terpretable as \directed" forms of v alue iteration. Dynamic programming algorithms generally require explicit en umeration of the state space at eac h iteration, while searc h tec hniques en umerate only reac hable states; but the branc hing factor ma y require that, at sucien t depth in the searc h tree, searc h metho ds en umerate individual states m ultiple times, whereas they are considered only once p er stage in dynamic programming. Ov ercom- ing this dicult y in searc h requires the use of cycle-c hec king and m ultiple-path-c hec king metho ds. W e note that searc h tec hniques can b e applied to partially observ able problems as w ell. One w a y to do this is to searc h through the space of b elief states (just as dynamic pro- gramming can b e applied to the b elief space MDP|see Section 2.10.2). Sp ecically , b elief states pla y the role of system states and the sto c hastic eects of actions on b elief states are induced b y sp ecic observ ation probabilities, since eac h observ ation has a distinct, but xed eect on an y b elief state. This t yp e of approac h has b een pursued in (Bonet & Gener, 1998; Ko enig & Simmons, 1995). 4. F actored Represen tations T o this p oin t our discussion of MDPs has used an explicit or extensional represen tation for the set of states (and actions) in whic h states are en umerated directly . In man y cases it is adv an tageous, from b oth the represen tational and computational p oin t of view, to talk ab out pr op erties of states or sets of states: the set of p ossible initial states, the set of states in whic h action a can b e executed, and so on. It is generally more con v enien t and compact to describ e sets of states based on certain prop erties or features than to en umerate them explicitly . Represen tations in whic h descriptions of ob jects substitute for the ob jects themselv es are called intensional and are the tec hnique of c hoice in AI systems. An in tensional represen tation for planning systems is often built b y dening a set of fe atur es that are sucien t to describ e the state of the dynamic system of in terest. In the example in Figure 2, the state w as describ ed b y a set of six features: the rob ot's lo cation, the lab's tidiness, whether or not mail is presen t, whether or not the rob ot has mail, whether or not there is a p ending coee request, and whether or not the rob ot has coee. The rst and second features can eac h tak e one of v e v alues, and the last four can eac h tak e one of t w o v alues (true or false). An assignmen t of v alues to the six features completely denes a state; the state space th us comprises all p ossible com binations of feature v alues, with jS j = 400. Eac h feature, or factor , is t ypically assigned a unique sym b olic name, as indicated in the second column of Figure 2. The fundamen tal tradeo b et w een extensional and in tensional represen tations b ecomes clear in this example. An extensional represen tation of the coee example views the space of p ossible states as a single v ariable that tak es on 400 p ossible 37 Boutilier, Dean, & Hanks v alues, whereas the in tensional or factor e d represen tation views a state as the cross pro duct of six v ariables, eac h of whic h tak es on substan tially few er v alues. Generally , the state space gro ws exp onen tially in the n um b er of features required to describ e a system. The fact that the state of a system can b e describ ed using a set of features allo ws one to adopt factor e d r epr esentations of actions, rew ards and other comp onen ts of an MDP . In a factored action represen tation, for instance, one generally describ es the eect of an action on sp ecic state features rather than on en tire states. This often pro vides considerable rep- resen tational econom y . F or instance, in the Strips action represen tation (Fik es & Nilsson, 1971), the state transitions induced b y actions are represen ted implicitly b y describing the eects of actions on only those features that change v alue when the action is executed. F actored represen tations can b e v ery compact when individual actions aect relativ ely few features, or when their eects exhibit certain regularities. Similar remarks apply to the represen tation of rew ard functions, observ ation mo dels, and so on. The regularities that mak e factored represen tations suitable for man y planning problems can often b e exploited b y planning and decision-making algorithms. While factored represen tations ha v e long b een used in classical AI planning, similar represen tations ha v e also b een adopted in the recen t use of MDP mo dels within AI. In this section (Section 4), w e fo cus on the econom y of represen tation aorded b y exploiting the structure inheren t in man y planning domains. In the follo wing section (Section 5), w e describ e ho w this structure|when made explicit b y the factored represen tations|can b e exploited computationally in plan and p olicy construction. 4.1 F actored State Spaces and Mark o v Chains W e b egin b y examining structur e d states , or systems whose state can b e describ ed using a nite set of state variables whose v alues c hange o v er time. 27 T o simplify our illustration of the p oten tial space sa vings, w e assume that these state v ariables are b o olean. If there are M suc h v ariables, then the size of the state space is jS j = N = 2 M . F or large M , sp ecifying or represen ting the dynamics explicitly using state-transition diagrams or N N matrices is impractical. F urthermore, represen ting a rew ard function as an N -v ector, and sp ecifying the observ ational probabilities, is similarly infeasible. In Section 4.2, w e dene a class of problems in whic h the dynamics can b e represen ted in O ( M ) space in man y cases. W e b egin b y considering ho w to represen t Mark o v c hains compactly and then consider incorp orating actions, observ ations and rew ards. W e let a state v ariable X tak e on a nite n um b er of v alues and let X stand for the set of p ossible v alues. W e assume that X is nite, though m uc h of what follo ws can b e applied to coun table state and action spaces as w ell. W e sa y the state space is at if it is sp ecied using one state v ariable (this v ariable is denoted S as in the general mo del, taking v alues from S ). The state space is factor e d if there is more than one state v ariable. A state is an y p ossible assignmen t of v alues to these v ariables. Letting X i represen t the i th state v ariable, the state space is the cross pro duct of the v alue spaces for the individual state v ariables; that is, S = M i =1 X i . Just as S t denotes the state of the pro cess at stage t , w e let X t i b e the random v ariable represen ting the v alue of the i th state v ariable at stage t . 27. These v ariables are often called uents in the AI literature (McCarth y & Ha y es, 1969). In classical planning, these are the atomic prop ositions used to describ e the domain. 38 Decision-Theoretic Planning: Str uctural Assumptions A Bayesian network (P earl, 1988) is a represen tational framew ork for compactly repre- sen ting a probabilit y distribution in factored form. Although these net w orks ha v e most t ypi- cally b een used to represen t atemp oral problem domains, w e can apply the same tec hniques to represen t Mark o v c hains, enco ding the c hain's transition probabilities in the net w ork structure (Dean & Kanaza w a, 1989). F ormally , a Ba y es net is a directed acyclic graph in whic h v ertices corresp ond to random v ariables and an edge b et w een t w o v ariables indicates a direct probabilistic dep endency b et w een them. A net w ork so constructed also reects implicit indep endencies among the v ariables. The net w ork m ust b e quantie d b y sp ecifying a probabilit y for eac h v ariable (v ertex) conditioned on all p ossible v alues of its immediate paren ts in the graph. In addition, the net w ork m ust include a marginal distribution: an unconditional probabilit y for eac h v ertex that has no paren ts. This quan tication is captured b y asso ciating a c onditional pr ob ability table (CPT) with eac h v ariable in the net w ork. T ogether with the indep endence assumptions dened b y the graph, this quan tication denes a unique join t distribution o v er the v ariables in the net w ork. The probabilit y of an y ev en t o v er this space can then b e computed using algorithms that exploit the indep endencies represen ted within the graphical structure. W e refer to P earl (1988) for details. Figures 3(a)-(c) (page 7) are sp ecial cases of Ba y es nets called \temp oral" Ba y esian net w orks. In these net w orks, v ertices in the graph represen t the system's state at dieren t time p oin ts and arcs represen t dep endencies across time p oin ts. In these temp oral net w orks, eac h v ertex's paren t is its temp oral predecessor, the conditional distributions are transition probabilit y distributions, and the marginal distributions are distributions o v er initial states. The net w orks in Figure 3 reect an extensional represen tation sc heme in whic h states are explicitly en umerated, but tec hniques for building and p erforming inference in probabilis- tic temp oral net w orks are designed esp ecially for application to factored represen tations. Figure 13 illustrates a two-stage temp or al Bayes net (2TBN) describing the state-transition probabilities asso ciated with the Mark o v c hain induced b y the xed p olicy of executing the action CClk (rep eatedly mo ving coun terclo c kwise). In a 2TBN, the set of v ariables is partitioned in to those corresp onding to state v ariables at a giv en time (or stage) t and those corresp onding to state v ariables at time t + 1. Directed arcs indicate probabilistic dep en- dencies b et w een those v ariables in the Mark o v c hain. Diachr onic ar cs are those directed from time t v ariables to time t + 1 v ariables, while synchr onic ar cs are directed b et w een v ariables at time t + 1. Figure 13 con tains only diac hronic arcs; sync hronic arcs will b e discussed later in this section. Giv en an y state at time t , the net w ork induces a unique distribution o v er states at t + 1. The quan tication of the net w ork describ es ho w the state of an y particular v ariable c hanges as a function of certain state v ariables. The lac k of a direct arc (or more generally a directed path if there are sync hronic arcs among the t + 1 v ariables) from a v ariable X t to another v ariable Y t +1 means that kno wledge of X t is irrelev an t to the prediction of the (immediate, or one-stage) ev olution of v ariable Y in the Mark o v pro cess. Figure 13 sho ws ho w compact this represen tation can b e in the b est of circumstances, as man y of the p oten tial links b et w een one stage and the next can b e omitted. The graphical represen tation mak es explicit the fact that the p olicy (i.e., the action CClk ) can aect only the state v ariable L o c , and the exogenous ev en ts A rrM , R e qC , and Mess can aect only 39 Boutilier, Dean, & Hanks P(RHC ) t+1 M RHM RHM RHC RHC CR CR T T Loc Loc M O L C M H O 0.1 0 0 0 0.9 L 0.9 0.1 0 0 0 C 0 0.9 0.1 0 0 M 0 0 0.9 0.1 0 H 0 0 0 0.9 0.1 t f t 1.0 0.2 f 0 0.8 t f t 1.0 0 f 0 1.0 t RHC P(Loc ) t+1 P(CR ) t+1 CR t Loc t Time t Time t+1 Figure 13: A factored 2TBN for the Mark o v c hain induced b y mo ving coun terclo c kwise (with selected CPTs sho wn). the v ariables M , CR , and Tidy , resp ectiv ely . 28 F urthermore, the dynamics of L o c (and the other v ariables) can b e describ ed using only kno wledge of the state of their paren t v ariables; for instance, the distribution o v er L o c at t + 1 dep ends only on the v alue of L o c at the previous stage (e.g., if L o c t = O , then L o c t +1 = M with probabilit y 0 : 9 and L o c t +1 = O with probabilit y 0 : 1). Similarly , CR can b ecome true with probabilit y 0 : 2 (due to a R e qC ev en t), but once true, cannot b ecome false (under this simple p olicy); and RHC remains true (or false) with certain t y if it w as true (or false) at the previous stage. Finally , the eects on the relev an t v ariables are indep enden t. F or an y instan tiation of the v ariables at time t , the distribution o v er next states can b e computed b y m ultiplying the conditional probabilities of relev an t t + 1 v ariables. The abilit y to omit arcs from the graph based on the lo calit y and indep endence of action eects has a strong eect on the n um b er of parameters that m ust b e supplied to complete the mo del. Although the full transition matrix for CClk w ould b e of size 400 2 = 160000, the transition mo del in Figure 13 requires only 66 parameters. 29 The example ab o v e sho ws ho w 2TBNs exploit indep endence to represen t Mark o v c hains compactly , but the example is extreme in that there is eectiv ely no relationship b et w een the v ariables|the c hain can b e view ed as the pro duct of six indep enden tly ev olving pro cesses. 28. W e sho w only some of the CPTs for brevit y . 29. In fact, w e can exploit the fact that probabilities sum to one and lea v e one en try unsp ecied p er ro w of an y CPT or explicit transition matrix. In this case, the 2TBN requires only 48 explicit parameters, while the transition matrix requires 400 300 = 159 ; 600 en tries. W e generally ignore this fact when comparing the sizes of represen tations. 40 Decision-Theoretic Planning: Str uctural Assumptions M RHM RHM RHC RHC CR CR T T Loc Loc M 0.0 0.0 1.0 0.0 1.0 0.0 etc. O O L L C C t f t f t f 1.0 1.0 0.0 0.0 1.0 t f etc. 1.0 Time t+1 Time t RHC Loc t t O O O O L L L L t t f f t t f f t f t f t f t f t f .05 0.2 1.0 0.2 1.0 0.2 1.0 0.2 .95 0.8 0.0 0.8 0.0 0.8 0.0 0.8 etc. etc. O L C M H 1.0 0 0 0 0.9 0.1 0 0 0 0 0.9 0.1 0 0 0 0 0.9 0.1 0 0 0 0 0.9 0.1 t t t t t f f f f f O L C M H O L C M H 0.1 0 0 0 0.9 0.9 0.1 0 0 0 0 0.9 0.1 0 0 0 0 0.9 0.1 0 0 0 0 0.9 0.1 0 Loc RHC CR t t t Loc RHC t t P(Loc ) t+1 P(CR ) t+1 P(RHC ) t+1 Figure 14: A 2TBN for the Mark o v c hain induced b y mo ving coun terclo c kwise and deliv- ering coee. 41 Boutilier, Dean, & Hanks In general, these \subpro cesses" will in teract, but still exhibit certain indep endencies and regularities that can b e exploited b y a 2TBN represen tation. W e consider t w o distinct Mark o v c hains that exhibit dieren t t yp es of dep endencies. Figure 14 illustrates a 2TBN represen ting the Mark o v c hain induced b y the follo wing p olicy: the rob ot consisten tly mo v es coun terclo c kwise unless it is in the oce and has coee, in whic h case it deliv ers coee to the user. Notice that dieren t v ariables are no w dep enden t: for instance, predicting the v alue of RHC at t + 1 requires kno wing the v alues of L o c and RHC at t . The CPT for RHC sho ws that the rob ot retains coee at stage t + 1 with certain t y , if it has it at stage t , in all lo cations except O (where it executes DelC , th us losing the coee). The v ariable L o c also dep ends on the v alue of RHC . The lo cation will c hange as in Figure 13 with one exception: if the rob ot is in the oce with coee, the lo cation remains the same (since the rob ot do es not mo v e, but executes DelC ). The eect on the v ariable CR is explained as follo ws: if the rob ot is in the oce and deliv ers coee in its p ossession, it will fulll an y outstanding coee request. Ho w ev er, the 0 : 05 c hance of CR remaining true under these conditions indicates a 5% c hance of spilling the coee. Ev en though there are more dep endencies (i.e., additional diac hronic arcs) in this 2TBN, it still requires only 118 parameters. Again, the distribution o v er resulting states is deter- mined b y m ultiplying the conditional distributions for the individual v ariables. Ev en though the v ariables are \related," when the state S t is kno wn, the v ariables at time t + 1 ( L o c t +1 , RHC t +1 , etc.) are indep enden t. In other w ords, Pr ( L o c t +1 ; T t +1 ; CR t +1 ; RHC t +1 ; RHM t +1 ; M t +1 j S t ) = Pr ( L o c t +1 j S t ) Pr( T t +1 j S t ) Pr( CR t +1 j S t ) Pr( RHC t +1 j S t ) Pr( RHM t +1 j S t ) Pr( M t +1 j S t ) (8) Figure 15 illustrates a 2TBN represen ting the Mark o v c hain induced b y the same p olicy as ab o v e, but where w e assume that the act of mo ving coun terclo c kwise has a sligh tly dieren t eect. W e no w supp ose that, when the rob ot mo v es from the hallw a y in to some adjacen t lo cation, it has a 0 : 3 c hance of spilling an y coee it has in its p ossession: the fragmen t of the CPT for RHC in Figure 15 illustrates this p ossibilit y . F urthermore, should the rob ot b e carrying mail whenev er it loses coee (whether \acciden tally" or \in ten tionally" via the DelC action), there is a 0 : 5 c hance it will lose the mail. Notice that the eects of this p olicy on the v ariables RHC and RHM are c orr elate d : one cannot accurately predict the probabilit y of RHM t +1 without determining the probabilit y of RHC t +1 . This correlation is mo deled b y the sync hronic arc b et w een RHC and RHM at the t + 1 slice of the net w ork. The indep endence of the t + 1 v ariables giv en S t do es not hold in 2TBNs with sync hronic arcs. Determining the probabilit y of a resulting state requires some simple probabilistic reasoning, for example, application of the c hain rule. In this example, w e can write Pr ( RHC t +1 ; RHM t +1 j S t ) = Pr ( RHM t +1 j RHC t +1 ; S t ) Pr( RHC t +1 j S t ) The join t distribution o v er t + 1 v ariables giv en S t can then b e computed as in Equa- tion 8 ab o v e, with this term replacing the Pr ( RHC t +1 j S t ) Pr( RHM t +1 j S t )|while these t w o v ariables are correlated, the remaining v ariables are indep enden t. W e refer to 2TBNs with no sync hronic arcs, lik e the one in Figure 14, as simple 2TBNs . Gener al 2TBNs allo w sync hronic as w ell as diac hronic arcs, as in Figure 15. 42 Decision-Theoretic Planning: Str uctural Assumptions M RHM RHM RHC RHC CR CR T T Loc Loc M t t t t f f f f t t f f t t f f t f t f t f t f t 1.0 0.0 0.5 0.0 1.0 0.0 1.0 0.0 f 0.0 1.0 0.5 1.0 0.0 1.0 0.0 1.0 Time t Time t+1 RHC RHC RHM t t+1 t 1.0 0.0 0.7 0.0 1.0 0.0 etc. O O H H C C t f t f t f 1.0 1.0 0.3 0.0 1.0 t f etc. 1.0 RHC Loc t t Pr(RHC ) Pr(RHM ) t+1 t+1 Figure 15: A 2TBN for the Mark o v c hain induced b y mo ving coun terclo c kwise and deliv- ering coee with correlated eects. 4.2 F actored Action Represen tations Just as w e extended Mark o v c hains to accoun t for dieren t actions, w e m ust extend the 2TBN represen tation to accoun t for the fact that the state transitions are inuenced b y the agen t's c hoice of action. W e discuss a v ariet y of tec hniques for sp ecifying the transition matrices that exploit the factored state represen tation to pro duce represen tations that are more natural and compact than explicit transition matrices. 4.2.1 Implicit-Event Models W e b egin with the implicit-ev en t mo del from Section 2.3 in whic h the eects of actions and exogenous ev en ts are com bined in a single transition matrix. W e will consider explicit- ev en t mo dels in Section 4.2.4. As w e sa w in the previous section, algorithms suc h as v alue and p olicy iteration require the use of transition mo dels that reect the ultimate transition probabilities, including the eects of an y exogenous ev en ts. One w a y to mo del the dynamics of a fully observ able MDP is to represen t eac h action b y a separate 2TBN. The 2TBN sho wn ab o v e in Figure 13 can b e seen as a represen tation of the action CClk (since the p olicy inducing the Mark o v c hain in that example consists of the rep eated application of that action alone). The net w ork fragmen t in Figure 16(a) illustrates the in teresting asp ects of the 2TBN for the DelC action including the eects of exogenous ev en ts. As ab o v e, the rob ot satises an outstanding coee request if it deliv ers coee while it is in the oce and has coee (with a 0 : 05 c hance of spillage), as sho wn in the conditional probabilit y table for CR . The eect on RHC can b e explained as follo ws: the 43 Boutilier, Dean, & Hanks Loc Loc RHC RHC CR CR 1.0 CR 0.2 CR 0.05 0.2 1.0 CR 0.2 RHC Loc Off t t t t else f f f f RHC Loc Off t t f else CR 0.05 CR 1.0 0.2 t f f (a) (c) (b) Time t Time t+1 O O O O L L L L t t f f t t f f t f t f t f t f t f .05 0.2 1.0 0.2 1.0 0.2 1.0 0.2 .95 0.8 0.0 0.8 0.0 0.8 0.0 0.8 etc. etc. Loc RHC CR 0.0 0.0 0.3 0.0 0.3 0.0 etc. O O L L C C t f t f t f 1.0 1.0 0.7 1.0 0.7 1.0 t f etc. RHC Loc t t t t t Pr(RHC ) Pr(CR ) t+1 t+1 Figure 16: A factored 2TBN for action DelC (a) and structured CPT represen tations (b,c). rob ot loses the coee (to the user or to spillage) if it deliv ers it in the oce; if it attempts deliv ery elsewhere, there is a 0 : 7 c hance that a random passerb y will tak e the coee from the rob ot. As in the case of Mark o v c hains, the eects of actions on dieren t v ariables can b e correlated, in whic h case w e m ust in tro duce sync hronic arcs. Suc h correlations can b e though t of as r amic ations (Bak er, 1991; Finger, 1986; Lin & Reiter, 1994). 4.2.2 Str uctured CPTs The conditional probabilit y table (CPT) for the no de CR in Figure 16(a) has 20 ro ws, one for eac h assignmen t to its paren ts. Ho w ev er, the CPT con tains a n um b er of regularities. In tuitiv ely , this reects the fact that the coee request will b e met successfully (i.e., the v ariable b ecomes false) 95% of the time when DelC is executed, if the rob ot has coee and is in the righ t lo cation (the user's oce). Otherwise, CR remains true if it w as true and b ecomes true with probabilit y 0 : 2 if it w as not. In other w ords, there are three distinct cases to b e considered, corresp onding to three \rules" go v erning the (sto c hastic) eect of DelC on CR . This can b e represen ted more compactly b y using a decision tree represen tation (with \else" branc hes to summarize groups of cases in v olving m ultiv alued v ariables suc h as L o c ) lik e that sho wn in Figure 16(b), or more compactly still using a decision graph (Figure 16(c)). In tree- and graph-based represen tations of CPTs, in terior no des are lab eled b y paren t v ariables, edges b y v alues of the v ariables, and lea v es or terminals b y distributions o v er the c hild v ariable's v alues. 30 Decision-tree and decision-graph represen tations are used to represen t actions in fully observ able MDPs in (Boutilier et al., 1995; Ho ey , St-Aubin, Hu, & Boutilier, 1999) and 30. When the c hild is b o olean, w e lab el the lea v es with only the probabilit y of that v ariable b eing true (the probabilit y of the v ariable b eing false is one min us this v alue). 44 Decision-Theoretic Planning: Str uctural Assumptions are describ ed in detail in (P o ole, 1995; Boutilier & Goldszmidt, 1996). 31 In tuitiv ely , trees and graphs em b o dy the rule-lik e structure presen t in the family of conditional distributions represen ted b y the CPT, and in the settings w e consider often yield considerable represen- tational compactness. Rule-based represen tations ha v e b een used directly b y P o ole (1995, 1997a) in the con text of decision pro cesses and can often b e more compact than trees (P o ole, 1997b). W e generically refer to represen tations of this t yp e as 2TBNs with structur e d CPTs . 4.2.3 Pr obabilistic STRIPS Opera tors The 2TBN represen tation can b e view ed as orien ted to w ard describing the eects of actions on distinct v ariables. The CPT for eac h v ariable expresses ho w it (sto c hastically) c hanges (or p ersists), p erhaps as a function of the state of certain other v ariables. Ho w ev er, it has long b een noted in AI researc h on planning and reasoning ab out action that most actions c hange the state in limited w a ys; that is, they aect a relativ ely small n um b er of v ariables. One dicult y with variable-oriente d represen tations suc h as 2TBNs is that one m ust explicitly assert that v ariables unaected b y a sp ecic action p ersist in v alue (e.g., see the CPT for RHC in Figure 13)|this is an instance of the infamous fr ame pr oblem (McCarth y & Ha y es, 1969). Another form of represen tation for actions migh t b e called an outc ome-oriente d repre- sen tation: one explicitly describ es the p ossible outc omes of an action or the p ossible join t eects o v er all v ariables. This w as the idea underlying the Strips represen tation from classical planning (Fik es & Nilsson, 1971). A classical Strips op erator is describ ed b y a pr e c ondition and a set of ee cts . The former iden ties the set of states in whic h the action can b e executed, and the latter describ es ho w the input state c hanges as a result of taking the action. A probabilistic Strips op erator (PSO) (Hanks, 1990; Hanks & McDermott, 1994; Kushmeric k et al., 1995) extends the Strips represen tation in t w o w a ys. First, it allo ws actions to ha v e dieren t eects dep ending on con text, and second, it recognizes that the eects of actions are not alw a ys kno wn with certain t y . 32 F ormally , a PSO consists of a set of m utually exclusiv e and exhaustiv e logical form ulae, called c ontexts , and a sto chastic ee ct asso ciated with eac h con text. In tuitiv ely , a con- text discriminates situations under whic h an action can ha v e diering sto c hastic eects. A sto c hastic eect is itself a set of change sets |a simple list of v ariable v alues|with a probabilit y attac hed to eac h c hange set, with the requiremen t that these probabilities sum to one. The seman tics of a sto c hastic eect can b e describ ed as follo ws: when the sto c hastic eect of an action is applied at state s , the p ossible resulting states are determined b y the c hange sets, eac h o ccurring with the corresp onding probabilit y; the resulting state asso ci- ated with a c hange set is constructed b y c hanging v ariable v alues at state s to matc h those in the c hange set, while all unmen tioned v ariables p ersist in v alue. Note that since only one 31. The fact that certain direct dep endencies among v ariables in a Ba y es net are rendered irrelev an t under sp ecic v ariable assignmen ts has b een studied more generally in the guise of c ontext-sp e cic indep endenc e (Boutilier, F riedman, Goldszmidt, & Koller, 1996); see (Geiger & Hec k erman, 1991; Shimon y , 1993) for related notions. 32. The c onditional nature of eects is also a feature of a deterministic extension of Strips kno wn as ADL (P ednault, 1989). 45 Boutilier, Dean, & Hanks -CR -RHC +M -CR -RHC -RHC +M -RHC 0.19 0.76 0.01 0.04 -RHC +CR +M -RHC +CR -RHC +M -RHC +CR +M +CR +M nil 0.028 0.112 0.112 0.448 0.012 0.048 0.048 0.192 +CR +M +CR +M nil 0.04 0.16 0.16 0.64 Loc RHC t f O else Figure 17: A PSO represen tation for the DelC action. +Loc(L) nil 0.9 0.1 +Loc(C) nil 0.9 0.1 +Loc(M) nil 0.9 0.1 +Loc(H) nil 0.9 0.1 +Loc(O) -RHC -RHM +Loc(O) -RHC +Loc(O) -RHC -RHM -RHC nil 0.135 0.135 0.63 0.015 0.015 0.07 Loc O L C M H Figure 18: A PSO represen tation of a simplied CClk action. con text can hold in an y state s , the transition distribution for the action at an y state s is easily determined. Figure 17 giv es a graphical depiction of the PSO for the DelC action (sho wn as a 2TBN in Figure 16). The three con texts : RHC , RHC ^ L o c ( O ) and RHC ^ : L o c ( O ) are represen ted using a decision tree. A t the leaf of eac h branc h in the decision tree is the sto c hastic eect (set of c hange sets and asso ciated probabilities) determined b y the corresp onding con text. F or example, when RHC ^ L o c ( O ) holds, the action has four p ossible eects: the rob ot loses the coee; it ma y or ma y not satisfy the coee request (due to the 0 : 05 c hance of spillage); and mail ma y or ma y not arriv e. Notice that eac h outcome is sp elled out completely . The n um b er of outcomes in the other t w o con texts is rather large due to p ossible exogenous ev en ts (w e discuss this further in Section 4.2.4). 33 A k ey dierence b et w een PSOs and 2TBNs lies in their treatmen t of p ersistence. All v ariables that are unaected b y an action m ust b e giv en CPTs in the 2TBN mo del, while suc h v ariables are not men tioned at all in the PSO mo del (e.g., compare the v ariable L o c in b oth represen tations of DelC ). In this w a y , PSOs can b e said to \solv e" the frame problem, since unaected v ariables need not b e men tioned in an action's description. 34 33. T o k eep Figure 17 manageable, w e ignore the eect of the exogenous ev en t Mess on v ariable T . 34. F or a discussion of the frame problem in 2TBNs, see (Boutilier & Goldszmidt, 1996). 46 Decision-Theoretic Planning: Str uctural Assumptions ε 1 ε 2 RHM RHC Loc CR M T RHM RHC Loc CR M T RHM RHC CR M T ArrM Loc RHM RHC CR M T Mess Loc t+ t+ t t+1 Figure 19: An simplied explicit-ev en t mo del for DelC . PSOs can pro vide an eectiv e means for represen ting actions with correlated eects. Recall the description of the CClk action captured in Figure 15, where the rob ot ma y drop its coee as it mo v es from the hallw a y , and ma y drop its mail only if it drops the coee. In the 2TBN represen tation of CClk , one m ust ha v e b oth RHC t and RHC t +1 as paren ts of RHM t +1 : w e m ust mo del the dep endence of RHM on a change in v alue in the v ariable RHC . Figure 18 sho ws the CClk action in PSO format (for simplicit y , w e ignore the o ccurrence of exogenous ev en ts). The PSO represen tation can oer an economical represen tation of correlated eects suc h as this since the p ossible outcomes of mo ving in the hallw a y are sp elled out explicitly . Sp ecically , the (p ossible) sim ultaneous c hange in v alues of the v ariables in question is made clear. 4.2.4 Explicit-Event Models Explicit-ev en t mo dels can also b e represen ted using 2TBNs in a somewhat dieren t form. As in our discussion in Section 2.3, the form tak en b y explicit-ev en t mo dels dep ends cru- cially on one's assumptions ab out the in terpla y b et w een the eects of the action itself and exogenous ev en ts. Ho w ev er, under certain assumptions ev en explicit-ev en t mo dels can b e rather concise. T o illustrate, Figure 19 sho ws the deliv er-coee action represen ted as a 2TBN with exogenous ev en ts explicitly represen ted. The rst \slice" of the net w ork sho ws the eects of the action DelC without the presence of exogenous ev en ts. The subsequen t slices describ e the eects of the ev en ts A rrM and Mess (w e use only t w o ev en ts for illustration). Notice the presence of the extra random v ariables represen ting the o c curr enc e of the ev en ts in question. The CPTs for these no des reect the o ccurrence probabilities for the ev en ts under v arious 47 Boutilier, Dean, & Hanks conditions, while the directed arcs from the ev en t v ariables to state v ariables indicate the eects of these ev en ts. These probabilities do not dep end on all state v ariables in general; th us, this 2TBN represen ts the o ccurrence v ectors (see Section 2.3) in a compact form. Also notice that, in con trast to the ev en t o ccurrence v ariables, w e do not explicitly represen t the action o ccurrence as a v ariable in the net w ork, since w e are mo deling the eect on the system given that the action w as tak en. 35 This example reects the assumptions describ ed in Section 2.3, namely , that ev en ts o ccur after the action tak es place and that ev en t eects are comm utativ e, and for this reason the ordering of the ev en ts A rrM and Mess in the net w ork is irrelev an t. Under this mo del, the system actually passes through t w o in termediate though not necessarily distinct states as it go es from stage t to stage t + 1; w e use subscripts " 1 and " 2 to suggest this pro cess. Of course, as describ ed earlier, not all actions and ev en ts can b e com bined in suc h a decomp osable w a y; more complex com bination functions can also b e mo deled using 2TBNs (for one example, see Boutilier & Puterman, 1995). 4.2.5 Equiv alence of Represent a tions An ob vious question one migh t ask concerns the exten t to whic h certain represen tations are inheren tly more concise than others. Here w e fo cus on the standard implicit-ev en t mo dels, describing some of the domain features that mak e the dieren t represen tations more or less suitable. Both 2TBN and PSO represen tations are orien ted to w ard represen ting the c hanges in the v alues of the state v ariables induced b y an action; a k ey distinction lies in the fact that 2TBNs mo del the inuence on eac h v ariable separately , while the PSO mo del explicitly represen ts complete outcomes. A simple 2TBN|a net w ork with no sync hronic arcs|can b e used to represen t an action in cases where there are no correlations among the action's eect on dieren t state v ariables. In the w orst case, when the eect on eac h v ariable diers at eac h state, eac h time t + 1 v ariable m ust ha v e all time t v ariables as paren ts. If there are no regularities that can b e exploited in structured CPT represen tations, then suc h an action requires the sp ecication of O ( n 2 n ) parameters (assuming b o olean v ariables), compared with the 2 2 n en tries required b y an explicit transition matrix. When the n um b er of paren ts of an y v ariable is b ounded b y k , then w e need sp ecify no more than n 2 k conditional probabilities. This can b e further reduced if the CPTs exhibit structure (e.g., can b e represen ted concisely in a decision tree). F or instance, if the CPT can b e captured b y the represen tation of c hoice with no more than f ( k ) en tries, where f is a p olynomial function of the n um b er of paren ts of a v ariable, then the represen tation size, O ( n f ( k )), is p olynomial in the n um b er of state v ariables. This is often the case, for instance, in actions where one of its (sto c hastic) eects on a v ariable requires that some n um b er of (pre-) conditions hold; if an y of them do not, a dieren t eect comes in to pla y . A PSO represen tation ma y not b e as concise as a 2TBN when an action has m ultiple indep enden t sto c hastic eects. A PSO requires that eac h p ossible c hange list b e en umer- ated with its corresp onding probabilit y of o ccurrence. The n um b er of suc h c hanges gro ws exp onen tially with the n um b er of v ariables aected b y the action. This fact is eviden t in 35. Sections 4.2.7 and 4.3 discuss represen tations that mo del the c hoice of action explicitly as a v ariable in the net w ork. 48 Decision-Theoretic Planning: Str uctural Assumptions -RHC -CR -RHC 0.95 0.05 -RHC nil 0.7 0.3 nil 1.0 t f Loc else O RHC +M nil 0.2 0.8 Figure 20: A \factored" PSO represen tation for the DelC action. Figure 17, where the impact of exogenous ev en ts aects a n um b er of v ariables sto c hasti- cally and indep enden tly . The problem can arise with resp ect to \direct" action eects, as w ell. Consider an action in whic h a set of 10 unpain ted parts is spra y pain ted; eac h part is successfully pain ted with probabilit y 0 : 9, and these successes are uncorrelated. Ignoring the complexit y of represen ting dieren t conditions under whic h the action could tak e place, a simple 2TBN can represen t suc h an action with 10 parameters (one success probabilit y p er part). In con trast, a PSO represen tation migh t require one to list all 2 10 distinct c hange lists and their asso ciated probabilities. Th us, a PSO represen tation can b e exp onen tially larger (in the n um b er of aected v ariables) than a simple 2TBN represen tation. F ortunately , if certain v ariables are aected deterministically , these do not cause the PSO represen tation to blo w up. F urthermore, PSO represen tations can also b e mo died to exploit the indep endence of an action's eects on dieren t state v ariables (Boutilier & Dearden, 1994; Dearden & Boutilier, 1997), th us escaping this com binatorial dicult y . F or instance, w e migh t represen t the DelC action sho wn in Figure 17 in the more \factored form" illustrated in Figure 20 (for simplicit y , w e sho w only the eect of the action and the exogenous ev en t A rrM ). Muc h lik e a 2TBN, w e can determine an o v erall eect b y com bining the c hange sets (in the appropriate con texts) and m ultiplying the corresp onding probabilities. Simple 2TBNs dened o v er the original set of state v ariables are not sucien t to rep- resen t all actions. 36 Correlated action eects require the presence of sync hronic arcs. In the w orst case, this means that time t + 1 v ariables can ha v e up to 2 n 1 paren ts. In fact, the acyclicit y condition assures that in the w orst case, the total n um b er of paren ts is P n k =1 2 k 1; th us, w e end up sp ecifying O (2 2 n ) en tries, the same as required b y an explicit transition matrix. Ho w ev er, if the n um b er of paren ts (whether o ccurring within the time slice t or t + 1) can b e b ounded, or if regularities in the CPTs allo w a compact represen tation, then 2TBNs can still b e protably used. PSO represen tations compare more fa v orably to 2TBNs in cases in whic h most of an action's eects on dieren t v ariables are correlated. In this case, PSOs can pro vide a somewhat more economical represen tation of action eects, primarily b ecause one needn't w orry ab out frame conditions. The main adv an tage of PSOs is that one need not enlist the aid of probabilistic reasoning pro cedures to determine the transitions induced b y actions with correlated eects. Con trast the explicit sp ecication of outcomes in PSOs with the t yp e of reasoning required to determine the join t eects of an action represen ted in 2TBN 36. Ho w ev er, Section 4.2.6 discusses certain problem transformations that do render simple 2TBNs sucien t for an y MDP . 49 Boutilier, Dean, & Hanks form with sync hronic arcs, as describ ed in Section 4.1. Essen tially , correlated eects are \compiled" in to explicit outcomes in PSOs. Recen t results b y Littman (1997) ha v e sho wn that simple 2TBNs and PSOs can b oth b e used to represen t an y action represen ted as a 2TBN without an exp onen tial blo wup in represen tation size. This is eected b y a clev er problem transformation in whic h new sets of actions and prop ositional v ariables are in tro duced (using either a simple 2TBN or PSO represen tation). The structure of the original 2TBN is reected in the new planning problem, incurring no more than a p olynomial increase in the size of the input action descriptions and the description of an y p olicy . Though the resulting p olicy consists of actions that do not exist in the underlying domain, extracting the true p olicy is not dicult. It should b e noted, ho w ev er, that while suc h a represen tation can automatically b e constructed from a general 2TBN sp ecication, it is unlik ely that it could b e pro vided directly , since the actions and v ariables in the transformed problem ha v e no \ph ysical" meaning in the original MDP . 4.2.6 Transf orma tions to Elimina te Synchr onic Constraints The discussion ab o v e has assumed that the v ariables or prop ositions used in the 2TBN or PSO action descriptions are the original state v ariables. Ho w ev er, certain problem trans- formations can b e used to ensure that one can represen t an y action using simple 2TBNs, as long as one do es not require the original state v ariables to b e used. One suc h transformation simply clusters all v ariables on whic h some action has a correlated eect. A new c omp ound variable |whic h tak es as v alues assignmen ts to the clustered v ariables|can then b e used in the 2TBN, remo ving the need for sync hronic arcs. Of course, this v ariable will ha v e a domain size exp onen tial in the n um b er of clustered v ariables. Some of the in tuitions underlying PSOs can b e used to con v ert general 2TBN action de- scriptions to simple 2TBN descriptions with explicit \ev en ts" dictating the precise outcome of the action. In tuitiv ely , this ev en t can o ccur in k dieren t forms, eac h corresp onding to a dieren t c hange list induced b y the action (or a c hange list with resp ect to the v ariables in question). As an example, w e can con v ert the \action" description for CClk in Figure 15 in to the explicit-ev en t mo del sho wn in Figure 21. 37 Notice that the \ev en t" tak es on v alues corresp onding to the p ossible eects on the correlated v ariables RHC and RHM . Sp eci- cally , a denotes the ev en t of the rob ot escaping the hallw a y successfully without losing its cargo, b denotes the ev en t of the rob ot losing only its coee, and c denotes the ev en t of losing b oth the coee and the mail. In eect, the ev en t space represen ts all p ossible \com bined" eects, ob viating the need for sync hronic arcs in the net w ork. 4.2.7 A ctions as Explicit Nodes in the Netw ork One dicult y with the 2TBN and PSO approac h to action description is that eac h action is represen ted separately , oering no opp ortunit y to exploit patterns across actions. F or instance, the fact that lo cation p ersists in all actions except mo ving clo c kwise or coun ter- clo c kwise means that the \frame axiom" is duplicated in the 2TBN for all other actions (this is not the case for PSOs, of course). In addition, r amic ations (or correlated action 37. While Figure 15 describ es a Mark o v c hain induced b y a p olicy , the represen tation of CClk can easily b e extracted from it. 50 Decision-Theoretic Planning: Str uctural Assumptions Loc Loc RHC RHC RHM RHM Event a b c 0.0 0.0 1.0 RHC t f Event 0.0 a b c 1.0 0.0 0.0 Event RHM t f 0.0 Loc Off else 1.0 Loc Hall else RHC t t f f RHM a: 1.0 b: 0.0 c: 0.0 a:0.7 c:0.15 a:0.7 b:0.3 c:0.0 a: 1.0 b: 0.0 c: 0.0 b:0.15 Time t Time t+1 Figure 21: An explicit-ev en t mo del that remo v es correlations. eects) are duplicated across actions as w ell. F or instance, if a coee request o ccurs (with probabilit y 0 : 2) only when the rob ot ends up in the oce, then this correlation is duplicated across all actions. A more comp elling example migh t b e one in whic h the rob ot can mo v e a briefcase to a new lo cation in one of a n um b er of w a ys. W e'd lik e to capture the fact (or ramication) that the con ten ts of the briefcase mo v e to the same lo cation as the briefcase regardless of the action that mo v es the briefcase. T o circum v en t this dicult y , w e can in tro duce the c hoice of action as a \random v ari- able" in the net w ork, conditioning the distribution o v er state v ariable transitions on the v alue of this v ariable. Unlik e state v ariables (or ev en t v ariables in explicit ev en t mo dels), w e do not generally require a distribution o v er this action v ariable|the in ten t is simply to mo del sc hematically the conditional state-transition distributions given an y particular c hoice of action. This is b ecause the c hoice of action will b e dictated b y the decision mak er once a p olicy is determined. F or this reason, an ticipating terminology used for inuence diagrams (see Section 4.3), w e call these no des de cision no des and depict them in our net- w ork diagrams with b o xes. Suc h a v ariable can tak e as its v alue an y action a v ailable to the agen t. A 2TBN with an explicit decision no de is sho wn in Figure 22. In this restricted example, w e migh t imagine the decision no de can tak e one of t w o v alues, Clk or CClk . The fact that the issuance of a coee request at t + 1 dep ends on whether the rob ot successfully mo v ed from (or remained in) the oce is no w represen ted \once" b y the arc b et w een L o c t +1 and CR t +1 , rather than rep eated across m ultiple action net w orks. F urthermore, the noisy p ersistence of M under b oth actions is also represen ted only once (adding the action PUM , ho w ev er, undercuts this adv an tage as w e will see when w e try to com bine actions). One dicult y with this straigh tforw ard use of decision no des (whic h is the standard represen tation in the inuence diagram literature) is that adding candidate actions can cause an explosion in the net w ork's dep endency structure. F or example, consider the t w o 51 Boutilier, Dean, & Hanks CR CR Loc Loc Act M M Time t Time t+1 Figure 22: An inuence diagram for a restricted pro cess. Act Y Z X Y 1.0 Y 1.0 Y 1.0 X 0.9 0 0.9 0 Z 0 Act a1 a2 t t t t t f f f f f else Y Z X Y Z X Y Z X Y Z X Y Z X (a) action a1 (b) action a2 (d) CPT for Y (c) influence diagram Figure 23: Un w an ted dep endencies in inuence diagrams. 52 Decision-Theoretic Planning: Str uctural Assumptions action net w orks sho wn in Figure 23(a) and (b). Action a 1 mak es Y true with probabilit y 0 : 9 if X is true (ha ving no eect otherwise), while a 2 mak es Y true if Z is true. Com bining these actions in a single net w ork in the ob vious w a y pro duces the inuence diagram sho wn in Figure 23(c). Notice that Y no w has four paren t no des, inheriting the union of all its paren ts in the individual net w orks (plus the action no de) and requiring a CPT with 16 en tries for actions a 1 and a 2 together with eigh t additional en tries for e ach action that do es not aect Y . The individual net w orks reect the fact that Y dep ends on X only when a 1 is p erformed and on Z only when a 2 is p erformed. This fact is lost in the naiv ely constructed inuence diagram. Ho w ev er, structured CPTs can b e used to recapture this indep endence and compactness of represen tation: the tree of Figure 23(d) captures the distribution m uc h more concisely , requiring only eigh t en tries. This structured represen tation also allo ws us concisely to express that Y p ersists under all other actions. In large domains, w e exp ect v ariables to generally b e unaected b y a substan tial n um b er of (p erhaps most) actions, th us requiring represen tations suc h as this for inuence diagrams. See (Boutilier & Goldszmidt, 1996) for a deep er discussion of this issue and its relationship to the frame problem. While w e pro vide no distributional information o v er the action c hoice, it is not hard to see that a 2TBN with an explicit decision no de can b e used to represen t the Mark o v c hain induced b y a particular p olicy in a v ery natural w a y . Sp ecically , b y adding arcs from state v ariables at time t to the decision no de, the v alue of the decision no de (i.e., the c hoice of action at that p oin t) can b e dictated b y the prev ailing state. 38 4.3 Inuence Diagrams Inuenc e diagr ams (Ho w ard & Matheson, 1984; Shac h ter, 1986) extend Ba y esian net w orks to include sp ecial de cision no des to represen t action c hoices, and value no des to represen t the eect of action c hoice on a v alue function. The presence of decision no des means that action c hoice is treated as a v ariable under the decision mak er's con trol. V alue no des treat rew ard as a v ariable inuenced (usually deterministically) b y certain state v ariables. Inuence diagrams ha v e not t ypically b een asso ciated with the sc hematic represen tation of stationary systems, instead b eing used as a to ol for decision analysts where the sequen tial decision problem is carefully handcrafted. This more generic use of inuence diagrams has b een discussed b y T atman and Shac h ter (1990). In an y ev en t, there is no theory of plan c onstruction asso ciated with inuence diagrams: the c hoice of all p ossible actions at eac h stage m ust b e explicitly enco ded in the mo del. Inuence diagrams are, therefore, usually used to mo del nite-horizon decision problems b y explicitly describing the ev olution of the pro cess at eac h stage in terms of state v ariables. As in Section 4.2.7, decision no des tak e as v alues sp ecic actions, though the set of p ossible actions can b e tailored to the particular stage. In addition, an analyst will generally include at eac h stage only state v ariables that are though t relev an t to the decision at that or subsequen t stages. V alue no des are also a k ey feature of inuence diagrams and are discussed Section 4.5. Usually , a single v alue no de is sp ecied, with arcs indicating the 38. More generally , a randomized p olicy can b e represen ted b y sp ecifying a distribution o v er p ossible actions conditioned on the state. 53 Boutilier, Dean, & Hanks Rew T 0 1 4 2 3 -4 -3.5 -3 -2.5 -2 -7 -6.5 -6 -5.5 -5 T 0 1 4 2 3 etc. etc. RHM M CR T M RHM CR Figure 24: The represen tation of a rew ard function in an inuence diagram. inuence of particular state and decision v ariables (often o v er m ultiple stages) on the o v erall v alue function. Inuence diagrams are t ypically used to mo del partially observ able problems. An arc from a state v ariable to a decision no de reects the fact that the v alue of that state v ariable is a v ailable to the decision mak er at the time the action is to b e c hosen. In other w ords, this v ariable's v alue forms part of the observ ation made at time t prior to the action b eing selected at time t + 1, and the p olicy constructed can refer to this v ariable. Once again, this allo ws a compact sp ecication of the observ ation probabilities asso ciated with a system. The fact that the probabilit y of a giv en observ ation dep ends directly only on certain v ariables and not on others can mean that far few er mo del parameters are required. 4.4 F actored Rew ard Represen tation W e ha v e already noted that it is v ery common in form ulating MDP problems to adopt a simplied v alue function: assigning rew ards to states and costs to actions, and ev aluat- ing histories b y com bining these factors according to some simple function lik e addition. This simplication alone allo ws a represen tation for the v alue function signican tly more parsimonious than one based on a more complex comparison of complete histories. Ev en this represen tation requires an explicit en umeration of the state and action space, ho w ev er, motiv ating the need for more compact represen tations for these parameters. F actored rep- resen tations for rew ards and action costs can often ob viate the need to en umerate state and action parameters explicitly . Lik e an action's eect on a particular v ariable, the rew ard asso ciated with a state often dep ends only on the v alues of certain features of the state. F or example, in our rob ot domain, w e can asso ciate rew ards or p enalties with undeliv ered mail, with unfullled coee requests and with un tidiness in the lab. This rew ard or p enalt y is indep enden t of other v ariables, and individual rew ards can b e asso ciated with the gr oups of states that dier on the v alues of the relev an t v ariables. The relationship b et w een rew ards and state v ariables is represen ted in value no des in inuence diagrams, represen ted b y the diamond in Figure 24. The c onditional r ewar d table (CR T) for suc h a no de is a table that asso ciates a rew ard with ev ery com bination of v alues for its paren ts in the graph. This table, not sho wn in Figure 24, is lo cally exp onen tial in the n um b er of relev an t v ariables. Although Figure 24 sho ws the case of a stationary Mark o vian rew ard function, inuence diagrams can b e used to represen t 54 Decision-Theoretic Planning: Str uctural Assumptions nonstationary or history-dep enden t rew ards and are often used to represen t v alue functions for nite-horizon problems. Although in the w orst case the CR T will tak e exp onen tial space to store, in man y cases the rew ard function exhibits structure, allo wing it to b e represen ted compactly using decision trees or graphs (Boutilier et al., 1995), Strips -lik e tables (Boutilier & Dearden, 1994), or logical rules (P o ole, 1995, 1997a). Figure 24 sho ws a fragmen t of one p ossible decision-tree represen tation for the rew ard function used in the running example. The indep endence assumptions studied in m ultiattribute utilit y theory (Keeney & Raia, 1976) pro vide y et another w a y in whic h rew ard functions can b e represen ted compactly . If w e assume that the comp onen t attributes of the rew ard function mak e indep enden t con tri- butions to a state's total rew ard, the individual con tributions can b e com bined functionally . F or instance, w e migh t imagine p enalizing states where CR holds with a (partial) rew ard of 3, p enalizing situations where there is undeliv ered mail ( M _ RHM ) with 2, and p enalizing un tidiness T ( i ) with i 4 (i.e., in prop ortion to ho w un tidy things are). The rew ard for an y state can then b e determined simply b y adding the individual p enalties as- so ciated with eac h feature. The individual comp onen t rew ards along with the com bination function constitute a compact represen tation of the rew ard function. The tree fragmen t in Figure 24, whic h reects the additiv e indep enden t structure just describ ed, is considerably more complex than a represen tation that denes the (indep enden t) rew ards for individual prop ositions separately . The use of additiv e rew ard functions for MDPs is considered in (Boutilier, Brafman, & Geib, 1997; Meuleau, Hauskrec h t, Kim, P eshkin, Kaelbling, Dean, & Boutilier, 1998; Singh & Cohn, 1998). Another example of structured rew ards is the go al structure studied in classical planning. Goals are generally sp ecied b y a single prop osition (or a set of literals) to b e ac hiev ed. As suc h, they can generally b e represen ted v ery compactly . Hadda wy and Hanks (1998) explore generalizations of goal-orien ted mo dels that p ermit extensions suc h as partial goal satisfaction, y et still admit compact represen tations. 4.5 F actored P olicy and V alue F unction Represen tation The tec hniques studied so far ha v e b een concerned with the input sp ecication of the MDP: the states, actions, and rew ard function. The comp onen ts of a problem's solution |the p ol- icy and optimal v alue function|are also candidates for compact structured represen tation. In the simplest case, that of a stationary p olicy for a fully observ able problem, a p olicy m ust asso ciate an action with ev ery state, nominally requiring a represen tation of size O ( jS j ). The problem is exacerbated for nonstationary p olicies and POMDPs. F or example, the p olicy for a nite-horizon F OMDP with T stages generates a p olicy of size O ( T jS j ). F or a nite-horizon POMDP , eac h p ossible observ able history of length t < T migh t require a dieren t action c hoice; as man y as P T k =1 b k suc h histories can b e generated b y a xed p olicy , where b is the maxim um n um b er of p ossible observ ations one can mak e follo wing an action. 39 The fact that p olicies require to o m uc h space motiv ates the need to nd compact func- tional represen tations, and standard tec hniques lik e the tree structures discussed ab o v e for 39. Other metho ds of dealing with POMDPs, b y con v ersion to F OMDPs o v er b elief space (see Section 2.10.2), are more complex still. 55 Boutilier, Dean, & Hanks etc. DelC Clk Clk O L C Loc M H Clk M Loc H O L C RHC CR GetC Cclk M PUM M PUM Cclk Cclk Cclk HRM DelM Cclk Figure 25: A tree represen tation of a p olicy . actions and rew ard functions can b e used to represen t p olicies and v alue functions as w ell. Here w e fo cus on stationary p olicies and v alue functions for F OMDPs, for whic h an y logical function represen tation ma y b e used. F or example, Sc hopp ers (1987) uses a Strips -st yle represen tation for universal plans , whic h are deterministic, plan-lik e p olicies. Decision trees ha v e also b een used for p olicies and v alue functions (Boutilier et al., 1995; Chapman & Kaelbling, 1991). An example p olicy for the rob ot domain sp ecied with a decision tree is giv en in Figure 25. This p olicy dictates that, for instance, if CR and RHC are true: (a) the rob ot deliv er the coee to the user if it is in the oce, and (b) it mo v e to w ard the oce if it is not in the oce, unless (c) there is mail and it is in the mailro om, in whic h case it should pic kup the mail on its w a y . 4.6 Summary In this section w e discussed a n um b er of compact factored represen tations for comp onen ts of an MDP . W e b egan b y discussing in tensional state represen tations, then temp oral Ba y esian net w orks as a device for represen ting the system dynamics. T ree-structured conditional probabilit y tables (CPTs) and probabilistic Strips op erators (PSOs) w ere in tro duced as an alternativ e to transition matrices. Similar tree structures and other logical represen tations w ere in tro duced for represen ting rew ard functions, v alue functions, and p olicies. While these represen tations can often b e used to describ e a problem compactly , b y themselv es they oer no guaran tee that the problem can b e solve d eectiv ely . In the next section w e explore algorithms that use these factored represen tations to a v oid iterating explicitly o v er the en tire set of states and actions. 5. Abstraction, Aggregation, and Decomp osition Metho ds The greatest c hallenge in using MDPs as the basis for DTP lies in disco v ering computation- ally feasible metho ds for the construction of optimal, appro ximately optimal or satiscing p olicies. Of course, arbitrary decision problems are in tractable|ev en pro ducing satiscing or appro ximately optimal p olicies is generally infeasible. Ho w ev er, the previous sections suggest that man y realistic application domains ma y exhibit considerable structure, and furthermore that the structure can b e mo deled explicitly and exploited so that t ypical problems can b e solv ed eectiv ely . F or instance, structure of this t yp e can lead to compact 56 Decision-Theoretic Planning: Str uctural Assumptions factored represen tations of b oth input data and output p olicies, often p olynomial-sized with resp ect to the n um b er of v ariables and actions describing the problem. This suggests that for these compact problem represen tations, p olicy construction tec hniques can b e dev el- op ed that exploit this structure and are tractable for man y commonly o ccurring problem instances. Both the dynamic programming and state-based searc h tec hniques describ ed in Sec- tion 3 exploit structure of a dieren t kind. V alue functions that can b e decomp osed in to state-dep enden t rew ard functions, or state-based goal functions, can b e tac kled b y dynamic programming and regression searc h, resp ectiv ely . These algorithms exploit the structure in decomp osable v alue functions to prev en t ha ving to searc h explicitly through all p ossible p olicies. Ho w ev er, while these algorithms are p olynomial in the size of the state space, the curse of dimensionalit y mak es ev en these algorithms infeasible for practical problems. Though compact problem represen tations aid in the sp ecication of large problems, it is clear that a large system can b e sp ecied compactly only if the represen tation exploits \regularities" found in the domain. Recen t AI researc h on DTP has stressed using the regularities implicit in compact represen tations to sp eed up the planning pro cess. These tec hniques fo cus on b oth optimal and appro ximately optimal p olicy construction. In the follo wing subsection w e fo cus on abstr action and aggr e gation tec hniques, esp e- cially those that manipulate factored represen tations. Roughly , these tec hniques allo w the explicit or implicit grouping of states that are indistinguishable with resp ect to certain c har- acteristics (e.g., v alue or optimal action c hoice). W e refer to a set of states group ed in this manner as an aggr e gate or abstr act state , or sometimes as a cluster , and assume that the set of abstract states constitutes a p artition of the state space; that is to sa y , ev ery state is in exactly one abstract state and the union of all abstract states comprises the en tire state space. 40 By grouping similar states, eac h abstract state can b e treated as a single state, th us alleviating the need to p erform computations for eac h state individually . These tec hniques can b e used for appro ximation if the elemen ts of an abstract state are only appro ximately indistinguishable (e.g., if the v alues of those states lie within some small in terv al). W e then lo ok at the use of pr oblem de c omp osition tec hniques in whic h an MDP is brok en in to v arious pieces, eac h of whic h is solv ed indep enden tly; the solutions are then pieced together or used to guide the searc h for a global solution. If subpro cesses whose solutions in teract minimally are treated as indep enden t, w e migh t exp ect an appro ximately optimal global solution. F urthermore, if the structure of the problem requires a solution to a particular subproblem only , then the solutions to other subproblems can b e ignored altogether. Related is the use of r e achability analysis to restrict atten tion to \relev an t" regions of state space. Indeed, reac habilit y analysis and the c ommunic ating structur e of an MDP can b e used to form certain t yp es of decomp ositions. Sp ecically , w e distinguish serial decomp ositions from p ar al lel decomp ositions. The result of a serial decomp osition can b e view ed as a partitioning of the state space in to blo cks , eac h represen ting a (more or less) indep enden t subpro cess to b e solv ed. In serial decomp osition, the relationship b et w een blo c ks is generally more complicated than in the case of abstraction or aggregation. In a partition resulting from decomp osition, the 40. W e migh t also group states in to non-disjoin t sets that co v er the en tire state space. W e do not consider suc h soft-state aggr e gation here, but see (Singh, Jaakk ola, & Jordan, 1994). 57 Boutilier, Dean, & Hanks states within a particular blo c k ma y b eha v e quite dieren tly with resp ect to (sa y) v alue or dynamics. The imp ortan t consideration in c ho osing a decomp osition is that it is p ossible to represen t eac h blo c k compactly and to compute ecien tly the consequences of mo ving from one blo c k to another and, further, that the subproblems corresp onding to the subpro cesses can themselv es b e solv ed ecien tly . A parallel decomp osition is somewhat more closely related to an abstract MDP . An MDP is divided in to \parallel sub-MDPs" suc h that eac h decision or action causes the state to c hange within eac h sub-MDP . Th us, the MDP is the cross pro duct or join of the sub-MDPs (in con trast to the union, as in serial decomp osition). W e briey discuss sev eral metho ds that are based on parallel MDP decomp osition. 5.1 Abstraction and Aggregation One w a y problem structure can b e exploited in p olicy construction relies on the notion of aggr e gation |grouping states that are indistinguishable with resp ect to certain problem c haracteristics. F or example, w e migh t group together all states that ha v e the same optimal action, or that ha v e the same v alue with resp ect to the k -stage-to-go v alue function. These aggregates can b e constructed during the solution of the problem. In AI, emphasis has generally b een placed on a particular form of aggregation, namely abstr action metho ds, in whic h states are aggregated b y ignoring certain problem features. The p olicy in Figure 25 illustrates this t yp e of abstraction: those states in whic h CR , R H C and L o c ( O ) are true are group ed, and the same action is selected for eac h suc h state. In tuitiv ely , when these three prop ositions hold, other problem features are ignored and abstracted a w a y (i.e., they are deemed irrelev an t). A decision-tree represen tation of a p olicy or a v alue function partitions the state space in to a distinct cluster for eac h leaf of the tree. Other represen tations (e.g., Strips -lik e rules) abstract the state space similarly . It is precisely this t yp e of abstraction that is used in the compact, factored represen- tations of actions and goals discussed in Section 4. In the 2TBN sho wn in Figure 16, the eect of the action DelC on the v ariable CR is giv en b y the CPT for CR t +1 ; ho w ev er, this (sto c hastic) eect is the same at an y state for whic h the paren t v ariables ha v e the same v alue. This represen tation abstracts a w a y other v ariables, com bining states that ha v e distinct v alues for the irrelev an t (non-paren t) v ariables. In tensional represen tations often mak e it easy to decide whic h features to ignore at a certain stage of problem solving, and th us (implicitly) ho w to aggregate the state space. There are at least three dimensions along whic h abstractions of this t yp e can b e com- pared. The rst is uniformity : a uniform abstraction is one in whic h v ariables are deemed relev an t or irrelev an t uniformly across the state space, while a non uniform abstraction al- lo ws certain v ariables to b e ignored under certain conditions and not under others. The distinction is illustrated sc hematically in Figure 26. The tabular represen tation of a CPT can b e view ed as a form of uniform abstraction|the eect of an action on a v ariable is distinguished for all clusters of states that dier on the v alue of a paren t v ariable, and is not distinguished for states that agree on paren t v ariables but disagree on others|while a decision tree represen tation of a CPT em b o dies a non uniform abstraction. A second dimension of comparison is ac cur acy . States are group ed together on the basis of certain c haracteristics, and the abstraction is called exact if all states within a 58 Decision-Theoretic Planning: Str uctural Assumptions A B C A B C A B C A B C A B C A B C A B C A B C A B C A B C A B C Uniform Nonuniform 5.3 2.9 5.3 9.0 Exact Approximate Adaptive Fixed A B A 5.3 5.3 5.3 5.3 2.9 2.9 9.3 9.3 = 5.2 5.5 2.7 9.3 Figure 26: Dieren t forms of state space abstraction. cluster agree on this c haracteristic. A non-exact abstraction is called appr oximate . This is illustrated sc hematically in Figure 26: the exact abstraction groups together states that agree on the v alue assigned to them b y a v alue function, while the appro ximate abstraction allo ws states to b e group ed together that dier in v alue. The exten t to whic h these states dier is often used as a measure of the qualit y of an appro ximate abstraction. A third dimension is adaptivity . T ec hnically , this is a prop ert y not of an abstraction itself, but of ho w abstractions are used b y a particular algorithm. An adaptive abstraction tec hnique is one in whic h the abstraction can c hange during the course of computation, while a xe d abstraction sc heme groups together states once and for all (again, see Figure 26). F or example, one can imagine using an abstraction in the represen tation of a v alue function V k , then revising this abstraction to represen t V k +1 more accurately . Abstraction and aggregation tec hniques ha v e b een studied in the OR literature on MDPs. Bertsek as and Castanon (1989) dev elop an adaptiv e aggregation (as opp osed to abstraction) tec hnique. The prop osed metho d op erates on at state spaces, ho w ev er, and therefore do es not exploit implicit structure in the state space itself. An adaptiv e, uniform abstraction metho d is prop osed b y Sc h w eitzer et al. (1985) for solving sto c hastic queu- ing mo dels. These metho ds, often referred to as aggr e gation-disaggr e gation pr o c e dur es , are t ypically used to accelerate the calculation of the v alue function for a xed p olicy . V alue- function calculation requires computational eort at least quadratic in the size of the state space, whic h is impractical for v ery large state spaces. In aggregation-disaggregation pro- cedures, the states are rst aggregated in to clusters. A system of equations is then solv ed, or a series of summations p erformed, requiring eort no more than cubic in the n um b er of clusters. Next, a disaggregation step is p erformed for eac h cluster, requiring eort at least linear in the size of the cluster. The net result is that the total w ork, while at least linear in the total n um b er of states, is at w orst cubic in the size of the largest cluster. In DTP it is generally assumed that computations ev en linear in the size of the full state space are infeasible. Therefore it is imp ortan t to dev elop metho ds that p erform 59 Boutilier, Dean, & Hanks w ork p olynomial in the log of the size of the state space. Not all problems are amenable to suc h reductions without some (p erhaps unacceptable) sacrice in solution qualit y . In the follo wing section, w e review some recen t tec hniques for DTP aimed at ac hieving suc h reductions. 5.1.1 Go al Regression and Classical Planning In Section 3.2 w e in tro duced the general tec hnique of regression (or bac kw ard) searc h through state space to solv e classical planning problems, those in v olving deterministic ac- tions and p erformance criteria sp ecied in terms of reac hing a goal-satisfying state. One dicult y is that suc h a searc h requires that an y branc h of the searc h tree lead to a p articular goal state. This commitmen t to a goal state ma y ha v e to b e retracted (b y bac ktrac king the searc h pro cess) if no sequence of actions can lead to that particular goal state from the initial state. Ho w ev er, a goal is usually sp ecied as a set of literals G represen ting a set of states, where reac hing an y state in G is equally suitable|it ma y , therefore, b e w asteful to restrict the searc h to nding a plan that reac hes a p articular elemen t of G . Go al r e gr ession is an abstraction tec hnique that a v oids the problem of c ho osing a partic- ular goal state to pursue. A regression planner w orks b y searc hing for a sequence of actions as follo ws: the curr ent set of sub go als SG 0 is initialized as G . A t eac h iteration an action is selected that ac hiev es one or more of the curren t subgoals of SG i without deleting the others, and whose preconditions do not conict with the \unac hiev ed subgoals." The subgoals so ac hiev ed are remo v ed from the curren t subgoal set and replaced b y a form ula represen ting the con text under whic h will ac hiev e the curren t subgoals, forming SG i +1 . This pro cess is kno wn as r e gr essing SG i through . The pro cess is rep eated un til one of t w o conditions holds: (a) the curren t subgoal set is satised b y the initial state, in whic h case the curren t sequence of actions so selected is a successful plan; or (b) no action can b e applied, in whic h case the curren t sequence cannot b e extended in to a successful plan and some earlier action c hoice m ust b e reconsidered. Example 5.1 As an example, consider the simplied v ersion of the rob ot planning exam- ple used in Section 3.1 to illustrate v alue iteration: the rob ot has only four actions PUM , GetC , DelC and DelM , whic h w e mak e deterministic in the ob vious w a y . The initial state s init is h CR ; M ; RHC ; RHM i and the goal set G is f CR ; M g . Regress- ing G through DelM results in SG 1 = f CR ; M ; RHM g . Regressing SG 1 through DelC results in SG 2 = f RHC ; M ; RHM g . Regressing SG 2 through PUM results in SG 3 = f RHC ; M g . Regressing SG 3 through GetC results in SG 4 = f M g . Note that s init 2 SG 4 , so the sequence of actions GetC , PUM , DelC , DelM will successfully reac h a goal state. 2 T o see ho w this algorithm implemen ts a form of abstraction, rst note that the goal itself pro vides an initial partition of the state space, dividing it in to one set of states in whic h the goal is satised ( G ) and a second set in whic h it is not ( G ). View ed as a partition of a zero-stage-to-go v alue function, G represen ts those states whose v alue is p ositiv e while G represen ts those states whose v alue is zero. Ev ery regression step can b e though t of as revising this partition. When the planning algorithm attempts to satisfy the curren t subgoal set SG i b y applying action , it uses 60 Decision-Theoretic Planning: Str uctural Assumptions CR M CR M RHM RHC M RHM RHC M M GetC PUM DelC DelM Goal S S S S 4 3 2 1 Figure 27: An example of goal regression. regression to compute the (largest) set of states suc h that, after executing , all subgoals are satised. In particular, the state space is repartitioned in to t w o abstract states: SG i +1 and SG i +1 . In this w a y , the abstraction mec hanism implemen ted b y goal regression should b e considered adaptive . This can b e view ed as an ( i + 1)-stage v alue function: an y state satisfying SG i +1 can reac h a goal state in i + 1 steps using the action sequence that pro duced SG i +1 . 41 The regression pro cess can b e stopp ed when the initial state is a mem b er of the abstract state SG i +1 . Figure 27 illustrates the repartitioning of the state space in to the dieren t regions SG i +1 for eac h of the steps in the example ab o v e. While regression pro duces a compact represen tation of something lik e a v alue function (as in our discussion of deterministic, goal-based dynamic programming in Section 3.2), the analogy is not exact in that the regions pro duced b y regression record only the prop ert y of goal reac habilit y con tingen t on a p articular choic e of action or action sequence. Standard dynamic programming metho ds can b e implemen ted in a structured w a y b y simply noticing that a n um b er of dieren t regions can b e pro duced at the i th iteration b y considering al l actions that can b e regressed at that stage. The union of all of these regressions form the states that ha v e p ositiv e v alues in V i , th us making the represen tation of the i -stage-to-go v alue function exact. Notice that eac h iteration is no w more costly , since regression through all actions m ust b e attempted, but this approac h ob viates the need for bac ktrac king and can ensure that a shortest plan is found. Standard regression do es not pro vide suc h guaran tees without commitmen t to a particular searc h strategy (e.g., breadth- rst). This use of dynamic programming using Strips action descriptions forms the basic idea of Sc hopp ers's universal planning metho d (Sc hopp ers, 1987). Another general tec hnique for solving classical planning problems is p artial or der plan- ning (POP) (Chapman, 1987; Sacerdoti, 1975), em b o died in suc h p opular planning algo- rithms as SNLP (McAllester & Rosen blitt, 1991) and UCPOP (P en b erth y & W eld, 1992). 42 The main motiv ation for the least-commitmen t approac h comes from the realization that regression tec hniques are incremen tally building a plan from the end to the b eginning (in the temp oral dimension). Th us, eac h iteration m ust commit to inserting a step last in the plan. In man y cases it can b e determined that a particular step m ust app ear somewher e in the plan, but not necessarily as the last step in the plan; and, indeed, in man y cases the step 41. It is not the case, ho w ev er, that states in SG i +1 cannot reac h the goal region in i + 1 steps. It is only the case that they cannot do so using the sp ecic sequence of actions c hosen so far. 42. This t yp e of planning is also sometimes called nonlinear or least-commitmen t planning. See W eld's (1994) surv ey for a nice o v erview. 61 Boutilier, Dean, & Hanks under consideration c annot app ear last, but this fact cannot b e recognized un til later c hoices rev eal an inconsistency . In these cases, a regression algorithm will prematurely commit to the incorrect ordering and will ev en tually ha v e to bac ktrac k o v er that c hoice. F or example, supp ose in the problem scenario ab o v e that the rob ot can hold only one item at a time, coee or mail. Pic king up mail causes the rob ot to spill an y coee in its p ossession, and similarly grasping the coee mak es it drop the mail. The plan generated b y regression w ould no longer b e v alid: once the rst t w o actions ( DelC and DelM ) ha v e b een inserted in to the plan, no action can b e added to ac hiev e RHC or RHM without making the other one false; the searc h for a plan w ould ha v e to bac ktrac k. Ultimately it w ould b e disco v ered that no successful plan can end with these t w o actions p erformed in sequence. P artial-order planning algorithms pro ceed m uc h lik e regression algorithms, c ho osing actions to ac hiev e unac hiev ed subgoals and using regression to determine new subgoals, but lea ving actions unordered to whatev er exten t p ossible. Strictly sp eaking, subgoal sets aren't regressed; rather, eac h unac hiev ed goal or action precondition is addressed separately , and actions are ordered relativ e to one another only if one action threatens to negate the desired eect of another. In the example ab o v e, the algorithm migh t rst place actions DelC and DelM in to the plan, but lea v e them unordered. PUM can b e added to the plan to ac hiev e the requiremen t RHM of DelM ; it is ordered b efore DelM but is still unordered with resp ect to DelC . When GetC is nally added to the plan so as to ac hiev e RHC for action DelC , t w o thr e ats arise. First, GetC threatens the desired eect RHM of PUM . This can b e resolv ed b y ordering GetC b efore PUM or after DelM . Assume the former ordering is c hosen. Second, PUM threatens the desired eect RHC of GetC . This threat can also b e resolv ed b y placing PUM b efore GetC or after DelC ; since the rst threat w as resolv ed b y ordering GetC b efore PUM , the latter ordering is the only consisten t one. The result is the plan GetC , DelC , PUM , DelM . No bac ktrac king w as required to generate the plan, b ecause the actions w ere initially unordered, and orderings w ere in tro duced only when the disco v ery of threats required them. In terms of abstraction, an y incomplete, partially ordered plan that is threat-free, but p erhaps has certain \op en conditions" (unac hiev ed preconditions or subgoals), can b e view ed in m uc h the same w a y as a partially completed regression plan: an y state satisfying the op en conditions can reac h a goal state b y executing an y total ordering of the plan's actions consisten t with curren t set of ordering constrain ts. See (Kam bhampati, 1997) for a framew ork that unies v arious approac hes to solving classical plan-generation problems. While tec hniques relying on regression ha v e b een studied extensiv ely in the deterministic setting, they ha v e only recen tly b een applied to probabilistic unobserv able (Kushmeric k et al., 1995) and partially observ able (Drap er, Hanks, & W eld, 1994b) domains. F or the most part, these tec hniques assume a goal-based p erformance criterion and attempt to construct plans whose probabilit y of reac hing a goal state exceeds some threshold. These augmen t standard POP metho ds with tec hniques for ev aluating a plan's probabilit y of ac hieving the goal, and tec hniques for impro ving this probabilit y b y adding further structure to the plan. In the next section, w e consider ho w to use regression-related tec hniques to solv e MDPs with p erformance criteria more general than goals. 62 Decision-Theoretic Planning: Str uctural Assumptions 5.1.2 Stochastic D ynamic Pr ogramming with Str uctured Represent a tions A k ey idea underlying prop ositional goal regression|that one need only regress the rele- v an t prop ositions through an action|can b e extended to sto c hastic dynamic programming metho ds, lik e v alue iteration and p olicy iteration, and used to solv e general MDPs. There are, ho w ev er, t w o k ey diculties to o v ercome: the lac k of a sp ecic goal region and the uncertain t y asso ciated with action eects. Instead of viewing the state space as partitioned in to goal and non-goal clusters, w e consider grouping states according to their exp ected v alues. Ideally , w e migh t w an t to group states according to their v alue with resp ect to the optimal p olicy . Here w e consider a somewhat less dicult task, that of grouping states according to their v alue with resp ect to a xed p olicy . This is essen tially the task p erformed b y the p olicy ev aluation step in p olicy iteration, and the same insigh ts can b e used to construct optimal p olicies. F or a xed p olicy , w e w an t to group states that ha v e the same v alue under that p olicy . Generalizing the goal v ersus non-goal distinction, w e b egin with a partition that groups states according their immediate rew ards. Then, using an analogue of regression dev elop ed for the sto c hastic case, w e reason bac kw ard to construct a new partition in whic h states are group ed according to their v alue with resp ect to the one-stage-to-go v alue function. W e iterate in this manner so that on the k th iteration w e pro duce a new partition that groups states according the k -stage-to-go v alue function. On eac h iteration, w e p erform w ork p olynomial in the n um b er of abstract states (and the size of the MDP represen tation) and, if w e are luc ky , the total n um b er of abstract states will b e b ounded b y some logarithmic factor of the size of the state space. T o implemen t this sc heme eectiv ely , w e ha v e to p erform op erations lik e regression without ev er en umerating the set of all states, and this is where the structured represen tations for state-transition, v alue, and p olicy functions pla y a role. F or F OMDPs, approac hes of this t yp e are tak en in (Boutilier, 1997; Boutilier & Dear- den, 1996; Boutilier et al., 1995; Boutilier, Dearden, & Goldszmidt, 1999; Dietteric h & Flann, 1995; Ho ey et al., 1999). W e illustrate the basic in tuitions b ehind this approac h b y describing ho w v alue iteration for discoun ted innite-horizon F OMDPs migh t w ork. W e assume that the MDP is sp ecied using a compact represen tation of the rew ard function (suc h as a decision tree) and actions (suc h as 2TBNs). In v alue iteration, w e pro duce a sequence of v alue functions V 0 ; V 1 ; ; V n , eac h V k represen ting the utilit y of the optimal k -stage p olicy . Our aim is to pro duce a compact represen tation of eac h v alue function and, using V n for some suitable n , pro duce a compact represen tation of the optimal stationary p olicy . Giv en a compact represen tation of the rew ard function R , it is clear that this constitutes a compact represen tation of V 0 . As usual, w e think of eac h leaf of the tree as a cluster of states ha ving iden tical utilit y . T o pro duce V 1 in compact form, w e can pro ceed in t w o phases. Eac h branc h of the tree for V 0 pro vides an in tensional description|namely , the con- junction of v ariable v alues lab eling the branc h|of an abstract state, or region, comprising states with iden tical v alue with resp ect to the initial v alue function V 0 . F or an y determin- istic action , w e can p erform a regression step using this description to determine the conditions under whic h, should w e p erform , w e w ould end up in this cluster. This w ould, furthermore, determine a region of the state space con taining states of iden tical future v alue 63 Boutilier, Dean, & Hanks X X Y Y Z Z X Y 0.9 1.0 0.0 0.9 1.0 0.0 X Y Z 0.0 1.0 Time t+1 Time t Figure 28: An example action. with resp ect to the execution of with one stage to go. 43 Unfortunately , nondeterministic actions cannot b e handled in quite this w a y: at an y giv en state, the action migh t lead to sev eral dieren t regions of V 0 with non-zero probabilit y . Ho w ev er, for eac h leaf in the tree represen ting V 0 (i.e., for eac h region of V 0 ), w e can regress the conjunction X describing that region through action to pro duce the conditions under whic h X b ecomes true or false with a sp ecied probabilit y . In other w ords, instead of regressing in the standard fash- ion to determine the conditions under whic h X b ecomes true, w e pro duce a set of distinct conditions under whic h X b ecomes true with dieren t probabilities. By piecing together the regions pro duced for the dieren t lab els in the description of V 0 , w e can construct a set of regions suc h that eac h state in a giv en region: (a) transitions (under action ) to a particular part of V 0 with iden tical probabilit y; and hence (b) has iden tical exp ected future v alue (Boutilier et al., 1995). W e can view this as a generalization of prop ositional goal regression suitable for decision-theoretic problems. Example 5.2 T o illustrate, consider the example action a sho wn in Figure 28 and the v alue function V 0 sho wn to the left of Figure 29. In order to generate the set of regions consisting of states whose future v alue (w.r.t. V 0 ) under a is iden tical, w e pro ceed in t w o steps (see Figure 29). W e rst determine the conditions under whic h a has a xed probabilit y of making Y true (hence w e ha v e a xed probabilit y of mo ving to the left or righ t subtree of V 0 ). These conditions are giv en b y the tree represen ting the CPT for no de Y , whic h mak es up the rst p ortion of the tree represen ting V 1 |see Step 1 of Figure 29. Notice that this tree has lea v es lab eled with the probabilit y of making Y true or (implicitly) false. If a mak es Y true, then w e kno w that its future v alue (i.e., v alue with zero stages to go) is 8.1; but if Y b ecomes false, w e need to kno w whether a mak es Z true (to 43. W e ignore immediate rew ard and cost distinctions within the region so pro duced in our description; recall that the v alue of p erforming at an y state s is giv en b y R ( s ), C ( ; s ) and exp ected future v alue. W e simply fo cus on abstract states whose elemen ts ha v e iden tical future exp ected v alue. Dierences in immediate rew ard and cost can b e added after the fact. 64 Decision-Theoretic Planning: Str uctural Assumptions Step 1 X Y 0.9 Y Y 1.0 Y 0.0 Step 2 X Y 0.9 Y 1.0 Y 0.9 Y Y 0.0 Y 0.0 Y 0.9 Y Z 0.9 Z Z 1.0 Z 0.0 Z Z 1.0 Z 0.0 8.1 9.0 Y Z 0.0 V 0 Figure 29: An iteration of decision-theoretic regression. Step 1 pro duces the p ortion of the tree with dashed lines, while Step 2 pro duces the p ortion with dotted lines. determine whether the future v alue is 0 or 9 : 0). The probabilit y with whic h Z b ecomes true is giv en b y the tree represen ting the CPT for no de Z . In Step 2 in Figure 29, the conditions in that CPT are conjoined to the conditions required for predicting Y 's probabilit y (b y \grafting" the tree for Z to the tree for Y giv en b y the rst step). This grafting is sligh tly dieren t at eac h of the three lea v es of the tree for Y : (a) the full tree for Z is attac hed to the leaf X = t ; (b) the tree for Z is simplied where it is attac hed to to the leaf X = f ^ Y = f b y remo v al of the redundan t test on v ariable Y ; (c) notice that there is no need to attac h the tree for Z to the leaf X = f ^ Y = t , since a mak es Y true with probabilit y 1 under those conditions (and Z is relev an t to the determination of V 0 only when Y is false). A t eac h of the lea v es of the newly formed tree w e ha v e b oth Pr ( Y ) and Pr ( Z ). Eac h of these join t distributions o v er Y and Z (the eect of a and these v ariables is indep en- den t b y the seman tics of the net w ork) tells us the probabilit y of ha ving Y and Z true with zero stages to go giv en that the conditions lab eling the appropriate branc h of the tree hold with one stage to go. In other w ords, the new tree uniquely determines, for an y state with one stage remaining, the probabilit y of making an y of the conditions lab eling the branc hes of V 0 true. The computation of exp ected future v alue obtained b y p erforming a with one stage to go can then b e placed at the lea v es of this tree b y taking exp ectation o v er the v alues at the lea v es of V 0 . 2 The new set of regions pro duced this w a y describ es the function Q 1 , where Q 1 ( s ) is the v alue asso ciated with p erforming at state s with one stage to go and acting optimally thereafter. These functions (for eac h action ) can b e pieced together (i.e., \maxed"|see Section 3.1) to determine V 1 . Of course, the pro cess can b e rep eated some n um b er of times to pro duce V n for some suitable n , as w ell as the optimal p olicy with resp ect to V n . This basic tec hnique can b e used in a n um b er of dieren t w a ys. Dietteric h and Flann (1995) prop ose ideas similar to these, but restrict atten tion to MDPs with goal regions 65 Boutilier, Dean, & Hanks and deterministic actions (represen ted using Strips op erators), th us rendering true goal- regression tec hniques directly applicable. 44 Boutilier et al. (1995) dev elop a v ersion of mo died p olicy iteration to pro duce tree-structured p olicies and v alue functions, while Boutilier and Dearden (1996) dev elop the v ersion of v alue iteration describ ed ab o v e. These algorithms are extended to deal with correlations in action eects (i.e., sync hronic arcs in the 2TBNs) in (Boutilier, 1997). These abstraction sc hemes can b e categorized as non uniform, exact and adaptiv e. The utilit y of suc h exact abstraction tec hniques has not b een tested on real-w orld prob- lems to date. In (Boutilier et al., 1999), results on a series of abstract pro cess-planning examples are rep orted, and the sc heme is sho wn to b e v ery useful, esp ecially for larger problems. F or example, in one sp ecic problem with 1.7 million states, the tree repre- sen tation of the v alue function has only 40,000 lea v es, indicating a tremendous amoun t of regularit y in the v alue function. Sc hemes lik e this exploit suc h regularit y to solv e problems more quic kly (in this example, in m uc h less than half the time required b y mo died p ol- icy iteration) and with m uc h lo w er memory demands. Ho w ev er, these sc hemes do in v olv e substan tial o v erhead in tree construction, and for smaller problems with little regularit y , the o v erhead is not repaid in time sa vings (simple v ector-matrix represen tations metho ds are faster), though they still generally pro vide substan tial memory sa vings. What migh t b e view ed as b est- and w orst-case b eha vior is also describ ed in (Boutilier et al., 1999). In a series of \linear" examples (i.e., problems with v alue functions that can b e represen ted with trees whose size is linear in the n um b er of problem v ariables), the tree-based sc heme solv es problems man y orders of magnitude faster than classical state-based tec hniques. In con- trast, problems with exp onen tially-man y distinct v alues are also tested (i.e., with a distinct v alue at eac h state): here tree-construction metho ds are required to construct a complete decision tree in addition to p erforming the same n um b er of exp ected v alue and maximization computations as classical metho ds. In this w orst case, tree-construction o v erhead mak es the algorithm run ab out 100 times slo w er than standard mo died p olicy iteration. In (Ho ey et al., 1999), a similar algorithm is describ ed that uses algebr aic de cision diagr ams (ADDs) (Bahar, F rohm, Gaona, Hac h tel, Macii, P ardo, & Somenzi, 1993) rather than trees. ADDs are a simple generalization of b o ole an de cision diagr ams (BDDs) (Bry an t, 1986) that allo w terminal no des to b e lab eled with real v alues instead of just b o olean v alues. Essen tially , ADD-based algorithms are similar to the tree-based algorithms except that isomorphic subtrees can b e shared. This lets ADDs pro vide more compact represen tations of certain t yp es of v alue functions. Highly optimized ADD manipulation and ev aluation soft w are dev elop ed in the v erication comm unit y can also b e applied to solving MDPs. Initial results pro vided in (Ho ey et al., 1999) are encouraging, sho wing considerable sa vings o v er tree-based algorithms on the same problems. F or example, the ADD algorithm applied to the 1.7-million-state example describ ed ab o v e rev ealed the v alue function to ha v e only 178 distinct v alues (cf. the 40,000 tree lea v es required) and pro duced an ADD description of the v alue function with less than 2200 in ternal no des. It also solv ed the same problem in sev en min utes, ab out 40 times faster than earlier rep orted timing results using decision trees (though some of this impro v emen t w as due to the use of optimized ADD soft w are pac k ages). Similar results obtain with other problems (problems of up to 268 million states 44. Dietteric h and Flann (1995) also describ e their w ork in the con text of reinforcemen t learning rather than as a metho d for solving MDPs directly . 66 Decision-Theoretic Planning: Str uctural Assumptions w ere solv ed in ab out four hours). Most encouraging is the fact that on the w orst-case (exp onen tial) examples, the o v erhead asso ciated with using ADDs|compared to classical, v ector-based metho ds|is m uc h less than with trees (ab out a factor of 20 compared to \at" mo died p olicy iteration with 12 state v ariables), and lessens as problems b ecome larger. Lik e tree-based algorithms, these metho ds ha v e y et to b e applied to real-w orld problems. With these exact abstraction sc hemes it is clear that, while in some examples the result- ing p olicies and v alue functions ma y b e compact, in others the set of regions ma y get v ery large (ev en reac hing the lev el of individual states Boutilier et al., 1995), th us precluding an y computational sa vings. Boutilier and Dearden (1996) dev elop an appro ximation sc heme that exploits the tree-structured nature of the v alue functions pro duced. A t eac h stage k , the v alue function V k can b e pruned to pro duce a smaller, less accurate tree that appro xi- mates V k . Sp ecically , appro ximate v alue functions are represen ted using trees whose lea v es are lab eled with an upp er and lo w er b ound on the v alue function in that region; decision- theoretic regression is p erformed on these b ounds. Certain subtrees of the v alue tree can b e pruned when lea v es of the subtree are v ery close in v alue or when the tree is to o large giv en computational constrain ts. This sc heme is non uniform, appro ximate and adaptiv e. This appro ximation sc heme can b e tailored to pro vide (roughly) the most accurate v alue function of a giv en maxim um tree size, or the smallest v alue function (with resp ect to tree size) of some giv en minim um accuracy . Results rep orted in (Boutilier & Dearden, 1996) sho w that appro ximation on a small set of examples (including the w orst-case examples for tree-based algorithms) allo ws substan tial reduction in computational cost. F or instance, in a 10-v ariable w orst-case example, a small amoun t of pruning in tro duced an a v erage error of only 0.5% but reduced computation time b y a factor of 50. More aggressiv e pruning tends to increase error and decrease computation time v ery rapidly; making appropriate tradeos in these t w o dimensions is still to b e addressed. This metho d to o remains to b e tested and ev aluated on realistic problems. Structured represen tations and solution algorithms can b e applied to problems other than F OMDPs. Metho ds for solving inuence diagrams (Shac h ter, 1986) exploit structure in a natural w a y; T atman and Shac h ter (1990) explore the connection b et w een inuence dia- grams and F OMDPs and the relationship b et w een inuence diagram solution tec hniques and dynamic programming. Boutilier and P o ole (1996) sho w ho w classic history-indep enden t metho ds for solving POMDPs, based on con v ersion to a F OMDP with b elief states, can ex- ploit the t yp es of structured represen tations describ ed here. Ho w ev er, exploiting structured represen tations of POMDPs remains to b e explored in depth. 5.1.3 Abstra ct Plans One of the diculties with the adaptiv e abstraction sc hemes suggested ab o v e is the fact that dieren t abstractions m ust b e constructed rep eatedly , incurring substan tial compu- tational o v erhead. If this o v erhead is comp ensated b y the sa vings obtained during p olicy construction|e.g., b y reducing the n um b er of bac kups|then it is not problematic. But in man y cases the sa vings can b e dominated b y the time and space required to generate the abstractions, and th us motiv ates the dev elopmen t of c heap er but less accurate appr oximate clustering sc hemes. 67 Boutilier, Dean, & Hanks Another w a y to reduce this o v erhead is to adopt a xed abstraction sc heme so that only one abstraction is ev er pro duced. This approac h has b een adopted in classical plan- ning in hier ar chic al or abstr action-b ase d planners, pioneered b y Sacerdoti's AbStrips sys- tem (Sacerdoti, 1974). A similar form of abstraction is studied b y Knoblo c k (1993) (see also Knoblo c k, T enen b erg, & Y ang, 1991). In this w ork, v ariables (in this case prop ositional) are rank ed according to critic ality (roughly , ho w imp ortan t suc h v ariables are to the solution of the planning problem) and an abstraction is constructed b y deleting from the problem description a set of prop ositions of lo w criticalit y . A solution to this abstract problem is a plan that ac hiev es the elemen ts of the original goal that ha v e not b een deleted. Ho w ev er, preconditions and eects of actions that ha v e b een deleted are not accoun ted for in this so- lution, so it migh t not b e a solution to the original problem. Ev en so, the abstract solution can b e used to restrict searc h for a solution in the underlying concrete space. V ery often hierarc hies of more and more rened abstractions are used and prop ositions are in tro duced bac k in to the domain in stages. This form of abstraction is uniform (prop ositions are deleted uniformly) and xed. Since the abstract solution need not b e a solution to the problem, w e migh t b e tempted to view it as an appro ximate abstraction metho d. Ho w ev er, it is b est not to think of the abstract plan as a solution at all, rather as a form of heuristic information that can help solv e the true problem more quic kly . The in tuitions underlying Knoblo c k's sc heme are applied to DTP b y Boutilier and Dear- den (1994, 1997): v ariables are rank ed according to their degree of inuence on the rew ard function and a subset of the most imp ortan t v ariables is deemed relev an t. Once this subset is determined, those v ariables that inuence the relev an t v ariables through the eects of actions (whic h can b e determined easily using Strips or 2TBN action descriptions) are also deemed relev an t, and so on. All remaining v ariables are deemed irrelev an t and are deleted from the description of the problem (b oth action and rew ard descriptions). This lea v es an abstract MDP with a smaller state space (i.e., few er v ariables) that can b e solv ed b y standard metho ds. Recall that the state space reduction is exp onen tial in the n um b er of v ariables remo v ed. W e can view this metho d as an uniform xed appro ximate abstraction sc heme. Unlik e the output of classical abstraction metho ds, the abstract p olicy pro duced can b e implemen ted and has a v alue. The degree to whic h the optimal abstract p olicy and the true optimal p olicy dier in v alue can b e b ounded a priori once the abstraction is xed. Example 5.3 As a simple illustration, supp ose that the rew ard for satisfying coee requests (or p enalt y for not satisfying them) is substan tially greater than that for k eeping the lab tidy or for deliv ering mail. Supp ose that time pressure requires our agen t to fo cus on a sp ecic subset of ob jectiv es in order to pro duce a small abstract state space. In this case, of the four rew ard-laden v ariables in our problem (see Figure 24), only CR will b e judged to b e imp ortan t. When the action descriptions are used to determine the v ariables that can (directly or indirectly) aect the probabilit y of ac hieving CR , only CR , RHC and L o c will b e deemed relev an t, allo wing T , M , and RHM to b e ignored. The state space is th us reduced from size 400 to size 20. In addition, sev eral of the action descriptions (e.g., Tidy ) b ecome trivial and can b e deleted. 2 68 Decision-Theoretic Planning: Str uctural Assumptions The adv an tage of these abstractions is that they are easily computed and incur little o v erhead. The disadv an tages are that the uniform nature of suc h abstractions is restrictiv e, and the relev an t \rew ard v ariables" are determined b efore the p olicy is constructed and without kno wledge of the agen t's abilit y to con trol these v ariables. As a result, imp ortan t v ariables|those that ha v e a large impact on rew ard|but o v er whic h the agen t has no con trol, ma y b e tak en in to accoun t, while less imp ortan t v ariables that the agen t can actually inuence are ignored. Ho w ev er, a series of suc h abstractions can b e used that tak e in to accoun t ob jectiv es of decreasing imp ortance, and the a p osteriori most v aluable ob jectiv es can b e dealt with once risk and con trollabilit y are tak en in to accoun t (Boutilier et al., 1997). The p olicies generated at more abstract lev els can also b e used to \seed" v alue or p olicy iteration at less abstract lev els, in certain cases reducing the time to con v ergence (Dearden & Boutilier, 1997). It has also b een suggested (Dearden & Boutilier, 1994, 1997) that the abstract v alue function b e used as a heuristic in an online searc h for p olicies that impro v e the abstract p olicy so constructed, as discussed in Section 3.2.2. Th us, the error in the appro ximate v alue function is o v ercome to some exten t b y searc h, and the heuristic function can b e impro v ed b y async hronous up dates. A dieren t use of abstraction is adopted in the DRIPS planner (Hadda wy & Su w andi, 1994; Hadda wy & Doan, 1994). Actions can b e abstracted b y collapsing \branc hes," or p os- sible outcomes, and main taining probabilistic in terv als o v er the abstract, disjunctiv e eects. Actions are also com bined in an decomp osition hierarc h y , m uc h lik e those in hierarc hical task net w orks. Planning is done b y ev aluating abstract plans in the decomp osition net- w ork, pro ducing ranges of utilit y for the p ossible instan tiations of those plans, and rening only those plans that are p ossibly optimal. The use of task net w orks means that searc h is restricted to nite-horizon, op en-lo op plans with action c hoice restricted to p ossible rene- men ts of the net w ork. Suc h task net w orks oer a useful w a y to enco de a priori heuristic kno wledge ab out the structure of go o d plans. 5.1.4 Model Minimiza tion and Reduction Methods The abstraction tec hniques dened ab o v e can b e recast in terms of minimizing a sto c hastic automaton, pro viding a unifying view of the dieren t metho ds and oering new insigh ts in to the abstraction pro cess (Dean & Giv an, 1997). F rom automata theory w e kno w that for an y giv en nite-state mac hine M recognizing a language L there exists a unique minimal nite-state mac hine M 0 that also recognizes L . It could b e that M = M 0 , but it migh t also b e that M 0 is exp onen tially smaller than M . This minimal mac hine, called the minimal mo del for the language L , captures ev ery relev an t asp ect of M and so the mac hines are said to b e equiv alen t. W e can dene similar notions of equiv alence for MDPs. Since w e are primarily concerned with planning, it is imp ortan t that equiv alen t MDPs agree on the v alue functions for all p olicies. F rom a practical standp oin t, it ma y not b e necessary to nd the minimal mo del if w e can nd a r e duc e d mo del that is sucien tly small but still equiv alen t. W e apply the idea of mo del minimization (or mo del reduction) to planning as follo ws: w e b egin b y using an algorithm that tak es as input an implicit MDP mo del in factored form and pro duces (if w e are luc ky) an explicit, reduced mo del whose size is within a p olynomial factor of the size of the factored represen tation. W e then use our fa v orite state-based dynamic programming algorithms to solv e the explicit mo del. 69 Boutilier, Dean, & Hanks W e can think of the dynamic programming tec hniques that rely on structured represen ta- tions discussed earlier as op erating on a reduced mo del without ev er explicitly constructing that mo del. In some cases, building the reduced mo del once and for all ma y b e appropriate; in other cases, one migh t sa v e considerable eort b y explicitly constructing only those parts of the reduced mo del that are absolutely necessary . There are some p oten tial computational problems with the mo del-minimization tec h- niques sk etc hed ab o v e. A small minimal mo del ma y exist, but it ma y b e hard to nd. Instead, w e migh t lo ok for a reduced mo del that is easier to nd but not necessarily mini- mal. This to o could fail, in whic h case w e migh t lo ok for a mo del small enough to b e useful but only appro ximately equiv alen t to the original factored mo del. W e ha v e to b e careful what w e mean b y \appro ximate," but in tuitiv ely t w o MDPs are appro ximately equiv alen t if the corresp onding optimal v alue functions are within some small factor of one another. In order to b e practical, MDP mo del reduction sc hemes op erate directly on the implicit or factored represen tation of the original MDP . Lee and Y annak akis (1992) call this online mo del minimization . Online mo del minimization starts with an initial partition of the states. Minimization then iterativ ely renes the partition b y splitting clusters in to smaller clusters. A cluster is split if and only if the states in the cluster b eha v e dieren tly with resp ect to transitions to states in the same or other clusters. If this lo cal prop ert y is satised b y all clusters in a giv en partition, then the mo del consisting of aggregate states that corresp ond to the clusters of this partition is equiv alen t to the original mo del. In addition, if the initial partition and the metho d of splitting clusters satisfy certain prop erties, 45 then w e are guaran teed to nd the minimal mo del. In the case of MDP reduction, the initial partition groups together states that ha v e the same rew ard, or nearly the same rew ard in the case of appro ximation metho ds. The clusters of the partitions manipulated b y online mo del reduction metho ds are rep- resen ted in tensionally as form ulas in v olving the state v ariables. F or instance, the form ula R H C ^ L o c ( M ) represen ts the set of all states suc h that the rob ot has coee and is lo cated in the mail ro om. The op erations p erformed on these clusters require conjoining, comple- men ting, simplifying, and c hec king for satisabilit y . In the w orst case, these op erations are in tractable, and so the successful application of these metho ds dep ends critically on the problem and the w a y in whic h it is represen ted. W e illustrate the basic idea on a simple example. Example 5.4 Figure 30 depicts a simple v ersion of our running example with a single action. There are three b o olean state v ariables corresp onding to RHC |the rob ot has coee (or not, RHC ), CR |there is an outstanding request for coee (or not, CR ), and, considering only t w o lo cation p ossibilities, L o c ( C )|the rob ot is in the coee ro om (or not, L o c ( C )). Whether there is an outstanding coee request dep ends on whether there w as a request in the previous stage and whether the rob ot w as in the coee ro om. Lo cation dep ends only on the lo cation at the previous stage, and the rew ard dep ends only on whether or not there is an outstanding coee request. 45. The prop ert y required of the initial partition is that, if t w o states are in the same cluster of the partition dening the minimal mo del (recall that the minimal mo del is unique), then they m ust b e in same cluster in the initial partition. 70 Decision-Theoretic Planning: Str uctural Assumptions R S t ( ) 1 if CR 0 else = CR CR Pr CR S t 1 – ( ) Loc C ( ) Loc C ( ) 0.7 0.8 0.9 Pr Loc C ( ) S t 1 – ( ) 0.7 = RHC RHC Pr RHC S t 1 – ( ) Loc C ( ) Loc C ( ) 1.0 0.7 0.5 S t S t 1 – CR R RHC Loc Loc CR RHC Figure 30: F actored mo del illustrating mo del-reduction tec hniques. CR CR CR Loc C ( ) ∧ CR (a) (b) CR Loc C ( ) ∧ Figure 31: Mo dels in v olving aggregate states: (a) the mo del corresp onding to the initial partition and (b) the minimal mo del. The initial partition sho wn in Figure 31(a) is dened in terms of immediate rew ards. W e sa y that all the states in a particular starting cluster b eha v e the same with resp ect to a particular destination cluster if the probabilit y of ending up in the destination cluster is the same for all states in the starting cluster. This prop ert y is not satised for starting cluster CR and destination cluster CR in Figure 31(a), and so w e split the cluster lab eled CR to obtain the mo del in Figure 31(b). No w the prop ert y is satised for all pairs of clusters and the mo del in Figure 31(b) is the minimal mo del. 2 The Lee and Y annak akis algorithm for non-deterministic nite-state mac hines has b een extended b y Giv an and Dean to handle classical Strips planning problems (Giv an & Dean, 1997) and MDPs (Dean & Giv an, 1997). The basic step of splitting a cluster is closely related to goal regression, a relationship explored in (Giv an & Dean, 1997). V arian ts of the mo del reduction approac h apply when the action space is large and represen ted in a factored form (Dean, Giv an, & Kim, 1998); for example, when eac h action is sp ecied b y a set of parameters suc h as those corresp onding to the allo cations of sev eral dieren t resources in an optimization problem. There also exist algorithms for computing appro xi- 71 Boutilier, Dean, & Hanks A R A G B G P B A B C D E (a) (b) (c) Figure 32: Reac habilit y and serial problem decomp osition. mate mo dels (Dean, Giv an, & Leac h, 1997) and ecien t planning algorithms that use these appro ximate mo dels (Giv an, Leac h, & Dean, 1997). 5.2 Reac habilit y Analysis and Serial Problem Decomp osition 5.2.1 Rea chability Anal ysis The existence of goal states can b e exploited in dieren t settings. F or instance, in determin- istic classical planning problems, regression can b e view ed as a form of directed dynamic programming. Without uncertain t y , a certain p olicy either reac hes a goal state or do es not, and the dynamic programming bac kups need b e p erformed only from goal states, not from all p ossible states. Regression, therefore, implicitly exploits certain r e achability c haracter- istics of the domain along with the sp ecial structure of the v alue function. Reac habilit y analysis applied m uc h more broadly forms the basis for v arious t yp es of pr oblem de c omp osition . In decomp osition problem solving, the MDP is brok en in to sev eral subpro cesses that can b e solv ed indep enden tly , or roughly indep enden tly , and the solutions can b e pieced together. If subpro cesses whose solutions in teract marginally are treated as indep enden t, w e migh t exp ect a go o d but nonoptimal global solution to result. F urthermore, if the structure of the problem requires that only a solution to a particular subproblem is needed, then the solutions to other subproblems can b e ignored or need not b e computed at all. F or instance, in regression analysis, the optimal action for states that cannot reac h a goal region is irrelev an t to the solution of a classical AI planning problem. This is sho wn sc hematically in Figure 32(a), where regions A and B are nev er explored in the bac kw ard searc h through state space: only states that can reac h the goal within the searc h horizon are ev er deemed relev an t. While regions A and B ma y b e reac hable from the start state, the fact that they do not reac h the goal state means they are kno wn to b e irrelev an t. Should the system dynamics b e sto c hastic, suc h a sc heme can form the basis of an appro ximately optimal solution metho d: regions A and B can b e ignored if they are unlik ely to transition to the regression of the goal region (region R ). Similar remarks using progression or forw ard searc h from the start state apply , as illustrated in Figure 32(b). 72 Decision-Theoretic Planning: Str uctural Assumptions Sev eral sc hemes ha v e b een prop osed in the AI literature for exploiting suc h reac habilit y constrain ts, apart from the usual forw ard- or bac kw ard-searc h approac hes. P eot and Smith (1993) in tro duce the op er ator gr aph , a structure computed prior to problem solving that cac hes reac habilit y relationships among prop ositions. The graph can b e consulted during the planning pro cess in deciding whic h actions to insert in to the plan and ho w to resolv e threats. The GraphPlan algorithm of Blum and F urst (1995) attempts to blend considerations of b oth forw ard and bac kw ard reac habilit y in a deterministic planning con text. One of the diculties with regression is that w e ma y regress the goal region through a sequence of op erators only to nd ourselv es in a region that cannot b e reac hed from the initial state. In Figure 32(a), for example, not all states in region R ma y b e reac hable from the initial state. GraphPlan constructs a v arian t of the op erator graph called the planning gr aph , in whic h certain forw ard reac habilit y constrain ts are p osted. Regression is then implemen ted as usual, but if the curren t subgoal set violates the forw ard reac habilit y constrain ts at an y p oin t, this subgoal set is abandoned and the regression searc h bac ktrac ks. Conceptually , one migh t think of GraphPlan as constructing a forw ard searc h tree through state space with the initial state as the ro ot, then doing a bac kw ard searc h from the goal region bac kw ard through this tree. Of course, the pro cess is not state-based: instead, constrain ts on the p ossible v ariable v alues that can hold sim ultaneously at dieren t planning stages are recorded, and regression is used to searc h bac kw ard through the planning graph. In a sense, GraphPlan can b e view ed as constructing an abstraction in whic h forw ard-reac hable states are distinguished from unreac hable states at eac h planning stage, and using this distinction among abstract states quic kly to iden tify infeasible regression paths. Note, ho w ev er, the GraphPlan appr oximates this distinction b y o v erestimating the set of reac hable states. Ov erestimation (as opp osed to underestimation) ensures that the regression searc h space con tains all legitimate plans. Reac habilit y has also b een exploited in the solution of more general MDPs. Dean et al. (1995) prop ose an envelop e metho d for solving \goal-based" MDPs appro ximately . Assuming some path can b e generated quic kly from a giv en start state to the goal region, an MDP consisting of the states on this path and p erhaps neigh b oring states is solv ed. T o deal with transitions that lead out of this envelop e , a heuristic metho d estimates a v alue for these states. 46 As time p ermits, the set of neigh b oring states can b e expanded, increasing solution qualit y b y more accurately ev aluating the qualit y of alternativ e actions. Some of the ideas underlying GraphPlan ha v e b een applied to more general MDPs in (Boutilier, Brafman, & Geib, 1998), where the construction of a planning graph is general- ized to deal with the sto c hastic, conditional action represen tation oered b y 2TBNs. Giv en an initial state (or set of initial states), this algorithm disco v ers reac habilit y constrain ts that ha v e a form lik e those in GraphPlan | for instance, that t w o v ariable v alues X = x 1 and Y = y 3 cannot b oth obtain sim ultaneously; that is, no action sequence starting at the giv en initial state can lead to a state in whic h these v alues b oth hold. 47 The reac habilit y constrain ts disco v ered b y this pro cess are then used to simplify the action and rew ard repre- sen tation of an MDP so that it refers only to reac hable states. In this case, an y action that 46. The appro ximate abstraction tec hniques describ ed in Section 5.1.3 migh t b e used to generate suc h heuristic information. 47. General k -ary constrain ts of this t yp e are considered in (Boutilier et al., 1998). 73 Boutilier, Dean, & Hanks requires an unreac hable set of v alues to hold is eectiv ely deleted. In some cases, certain v ariables are disco v ered to b e imm utable giv en the initial conditions and can themselv es b e deleted, leading to m uc h smaller MDPs. This simplied represen tation retains the original prop ositional structure so standard abstraction metho ds can b e applied to the reac hable MDP . It is also suggested that a strong synergy exists b et w een abstraction and reac habil- it y analysis suc h that together these tec hniques reduce the size of the \eectiv e" MDP to b e solv ed m uc h more dramatically than either do es in isolation. Just as reac habilit y con- strain ts can b e used to prune regression paths in deterministic domains, they can b e used to prune v alue function and p olicy estimates generated b y decision-theoretic regression and abstraction algorithms (Boutilier et al., 1998). The results rep orted in (Boutilier et al., 1998) are limited to a single pro cess-planning domain, but sho w that reac habilit y analysis together with abstraction can pro vide substan- tial reductions in the size of the eectiv e MDP that m ust b e solv ed, at least in some domains. In a domain with 31 binary v ariables, reac habilit y considerations generally eliminated on the order of 10 to 15 v ariables (dep ending on the initial state and the arit y|binary or ternary|of the constrain ts considered), reducing the state space from size 2 31 to an ywhere from 2 22 to 2 15 . Incorp orating abstraction on the reac hable MDP pro vided considerably more reduction, reducing the MDP to sizes ranging from 2 8 to eectiv ely zero states. The latter case w ould o ccur if it is disco v ered that no v alues of v ariables that impact rew ard can b e altered|in whic h case ev ery course of action has the same exp ected utilit y and the MDP needn't b e solv ed (or can b e solv ed b y applying n ull actions with zero cost). 5.2.2 Serial Pr oblem Decomposition and Communica ting Str ucture The c ommunic ating or reac habilit y structure of an MDP pro vides a w a y to formalize dif- feren t t yp es of problem decomp osition. W e can classify an MDP according to the Mark o v c hains induced b y the stationary p olicies it admits. F or a xed Mark o v c hain, w e can group states in to maximal recurren t classes and transien t states, as describ ed in Section 2.1. An MDP is r e curr ent if eac h p olicy induces a Mark o v c hain with a single recurren t class. An MDP is unichain if eac h p olicy induces a single recurren t class with (p ossibly) some tran- sien t states. An MDP is c ommunic ating if for an y pair of states s; t , there is some p olicy under whic h s can reac h t . An MDP is we akly c ommunic ating if there exists a closed set of states that is comm unicating plus (p ossibly) a set of states transien t under every p olicy . W e call other MDPs nonc ommunic ating . These notions are crucial in the construction of optimal a v erage-rew ard p olicies, but can also b e exploited in problem decomp osition. Supp ose an MDP is disco v ered to consist of a set of recurren t classes C 1 ; C n (i.e., no matter what p olicy is adopted, the agen t cannot lea v e an y suc h class once it en ters that class) and a set of transien t states. 48 It is clear that optimal p olicy restricted to an y class C i can b e constructed without reference to the p olicy decisions made at an y states outside of C i or ev en their v alues. Essen tially , eac h C i can b e view ed as an indep enden t subpro cess. 48. A simple w a y to view these classes is to think of the agen t adopting a randomized p olicy where eac h action is adopted at an y state with p ositiv e probabilit y . The classes of the induced Mark o v c hain corresp ond to the classes of the MDP . 74 Decision-Theoretic Planning: Str uctural Assumptions This observ ation leads to the follo wing suggestion for optimal p olicy construction: 49 w e solv e the subpro cesses consisting of the recurren t classes for the MDPs; w e then remo v e these states from the MDP , forming a reduced MDP consisting only of the transien t states. W e then break the reduced MDP in to its recurren t classes and solv e these indep enden tly . The k ey to doing this eectiv ely is to use the v alue function for the original recurren t states (computed in solving the indep enden t subproblems in Step 1) to tak e in to accoun t transitions out of the recurren t classes in the reduced MDP . Figure 32(c) sho ws an MDP brok en in to the classes that migh t b e constructed this w a y . In the original MDP , classes C and E are recurren t and can b e solv ed indep enden tly . Once remo v ed from the MDP , class D is recurren t in the reduced MDP . It can, of course, b e solv ed without reference to classes A and B , but do es rely on the v alue of the states that it transitions to in class E . Ho w ev er, the v alue function for E is a v ailable for this purp ose, and can b e used to solv e for D as if D consisted only of j D j states. With this in hand, B can then b e solv ed, and nally A can b e solv ed. Lin and Dean (1995) pro vide a v ersion of this t yp e of decomp osition that also emplo ys a factored represen tation. The factored represen tation allo ws dimensionalit y reduction in dieren t state subspaces b y aggregating states that dier only in the v alues of the irrelev an t v ariables in their subspaces. A k ey to suc h a decomp osition is the disco v ery of the recurren t classes of an MDP . Puterman (1994) suggests an adaptation of the F o x-Landi algorithm (F o x & Landi, 1968) for disco v ering the structure of Mark o v c hains that is O ( N 2 ) (recall N = jS j ). 50 T o alleviate the diculties of algorithms that w ork with an explicit state-based represen tation, Boutilier and Puterman (1995) prop ose a v arian t of the algorithm that w orks with a factored 2TBN represen tation. One dicult y with this form of decomp osition is its reliance on strongly indep enden t subproblems (i.e., recurren t classes) within the MDP . Others ha v e explored exact and ap- pro ximate tec hniques that w ork under less restrictiv e assumptions. One simple metho d of appro ximation is to construct \appro ximately recurren t classes." In Figure 32(c) w e migh t imagine that C and E are nearly indep enden t in the sense that all transitions b et w een them are v ery lo w-probabilit y or high-cost. T reating them as indep enden t migh t lead to appro x- imately optimal p olicies whose error can b e b ounded. If the solutions to C and E in teract strongly enough that the solutions should not b e constructed completely indep enden tly , a dieren t approac h to solving the decomp osed problem can b e tak en. If w e ha v e the optimal v alue function for E then, as p oin ted out, w e can calculate the optimal v alue function for D . The rst thing to note is that w e don't need to kno w the v alue function for all of the states in E , just the v alue of ev ery state in E that is reac hable from some state in D in a single step. The set of all states outside D reac hable in a single step from a state inside D is referred to as the states in the p eriphery of D . The v alues of the states in the in tersection of E and the p eriphery of D summarize the v alue of exiting D and ending up in E . W e refer to the set of all states that are in the p eriphery of some blo c k as the kernel of the MDP . All of the dieren t blo c ks in teract with one another through states in the k ernel. 49. Ross and V aradara jan (1991) mak e a related suggestion for solving a v erage-rew ard problems. 50. A sligh t correction is made to the suggested algorithm in (Boutilier & Puterman, 1995). 75 Boutilier, Dean, & Hanks Loc C ( ) Loc M ( ) Loc L ( ) Loc O ( ) Figure 33: Decomp osition based on lo cation. Loc C ( ) Loc M ( ) Loc L ( ) Loc O ( ) Kernel Figure 34: Kernel-based decomp osition depicting the k ernel states. 76 Decision-Theoretic Planning: Str uctural Assumptions Example 5.5 Spatial features often pro vide a natural dimension along whic h to decom- p ose a domain. In our running example, the lo cation of the rob ot migh t b e used to decomp ose the state space in to blo c ks of states, one blo c k for eac h of the p ossible lo- cations. Figure 33 sho ws suc h a decomp osition sup erimp osed o v er the state-transition diagram for the MDP . States in the k ernel are shaded and migh t corresp ond to the en trances and exits of lo cations. The star-shap ed top ology , induced b y the k ernel decomp osition used in (Kushner & Chen, 1974) and (Dean & Lin, 1995), is illustrated in Figure 34. In Figure 33, the hallw a y lo cation is not explicitly represen ted. This simplication ma y b e reasonable if the hallw a y is only a conduit for mo ving from one ro om to another; in this case the function of the hallw a y is accoun ted for in the dynamics go v erning states in the k ernel. Figures 33 and 34 are idealized in that, giv en the full set of features in our running example, the k ernel w ould con tain man y more states. 2 One tec hnique for computing the optimal p olicy for the en tire MDP in v olv es rep eatedly solving the MDPs corresp onding to the individual blo c ks. The tec hniques w orks as follo ws: initially , w e guess the v alue of ev ery state in the k ernel. 51 Giv en a curren t estimate for the v alues of the k ernel states, w e solv e the comp onen t MDPs; this solution pro duces a new estimate for the states in the k ernel. W e adjust the v alues of the states in the k ernel b y considering the dierence b et w een the curren t and the new estimates and iterate un til this dierence is negligible. This iterativ e metho d for solving a decomp osed MDP is a sp ecial case of the Lagrangian metho d for nding the extrema of a function. The OR literature is replete with suc h metho ds for b oth linear and nonlinear systems of equations (Winston, 1992). It is p ossible to form ulate an MDP as a linear program (D'Ep enoux, 1963; Puterman, 1994). Dan tzig and W olfe (1960) dev elop ed a metho d of decomp osing a system of equations in v olving a v ery large n um b er of v ariables in to a set of smaller systems of equations in teracting through a set of coupling v ariables (v ariables shared b y more t w o or more blo c ks). In the Dan tzig-W olfe decomp osition metho d, the original, v ery large system of equations is solv ed b y iterativ ely solving the smaller systems and adjusting the coupling v ariables on eac h iteration un til no further adjustmen t is required. In the linear programming form ulation of an MDP , the v alues of the states are enco ded as v ariables. Kushner and Chen (1974) exploit the fact that MDPs can b e mo deled as linear programs b y using the Dan tzig-W olfe decomp osition metho d to solv e MDPs in v olving a large n um b er of states. Dean and Lin (1995) describ e a general framew ork for solving decomp osed MDPs p oin ting to the w ork of Kushner and Chen as a sp ecial case, but neither w ork addresses the issue of where the decomp ositions come from. Dean et al. (1995) in v estigate metho ds for decomp osing the state space in to t w o blo c ks: those reac hable in k steps or few er and those not reac hable in k steps (see the discussion of reac habilit y ab o v e). The set of states reac hable in k or few er steps is used to construct an MDP that is the basis for a p olicy that appro ximates the optimal p olicy . As k increases, the size of the blo c k of states reac hable in k steps increases, ensuring a b etter solution; but the amoun t of time required to compute a 51. Ideally w e w ould aggregate k ernel states with the same v alue so as to pro vide a compact represen tation. In the remainder of this section, ho w ev er, w e w on't consider this or an y other opp ortunities for com bining aggregation and decomp osition metho ds. 77 Boutilier, Dean, & Hanks solution also increases. Dean et al. (1995) discuss metho ds for solving MDPs in time-critical problems b y trading o qualit y against time. W e ha v e ignored the issue of ho w to obtain decomp ositions that exp edite our calcu- lations. Ideally , eac h comp onen t of the decomp osition w ould yield to simplication via aggregation and abstraction, reducing the dimensionalit y in eac h comp onen t and thereb y a v oiding explicit en umeration of all the states. Lin (1997) presen ts metho ds for exploiting structure for certain sp ecial cases in whic h the comm unicating structure is rev ealed b y a domain exp ert. In general, ho w ev er, nding a decomp osition so as to minimize the eort sp en t in solving the comp onen t MDPs is quite hard (at least as hard as nding the small- est circuit consisten t with a giv en input-output b eha vior) and so the b est w e can hop e for are go o d heuristic metho ds. Unfortunately , w e are not a w are of an y particularly useful heuristics for nding serial decomp ositions for Mark o v decision pro cesses. Dev eloping suc h heuristics is clearly an area for in v estigation. Related to this form of decomp osition is the dev elopmen t of macr o op er ators for MDPs (Sutton, 1995). Macros ha v e a long history in classical planning and problem solving (Fik es, Hart, & Nilsson, 1972; Korf, 1985), but only recen tly ha v e they b een generalized to MDPs (Hauskrec h t, Meuleau, Kaelbling, Dean, & Boutilier, 1998; P arr, 1998; P arr & Russell, 1998; Precup, Sutton, & Singh, 1998; Stone & V eloso, 1999; Sutton, 1995; Thrun & Sc h w artz, 1995). In most of this w ork, a macro is tak en to b e a lo cal p olicy o v er a region of state space (or blo c k in the ab o v e terminology). Giv en an MDP comprising these blo c ks and a set of macros dened for eac h blo c k, the MDP can b e solv ed b y selecting a macro action for eac h blo c k suc h that the global p olicy induced b y the set of macros so pic k ed is close to optimal, or at the v ery least is the b est com bination of macros from the set a v ailable. In (Sutton, 1995; Precup et al., 1998), macros are treated as temp orally-abstract actions and mo dels are dened b y whic h a macro can b e treated as if it w ere a single action and used in p olicy or v alue iteration (along with concrete actions). In (Hauskrec h t et al., 1998; P arr, 1998; P arr & Russell, 1998), these mo dels are exploited in a hierarc hical fashion, with a high-lev el MDP consisting only of states lying on the b oundaries of blo c ks, and macros the only \actions" that can b e c hosen at these states. The issue of macro generation| constructing a set of macros guaran teed to pro vide the exibilit y to select close to optimal global b eha vior|is addressed in (Hauskrec h t et al., 1998; P arr, 1998). The relationship to serial decomp osition tec hniques is quite close; th us, the problems of disco v ering go o d decomp ositions, constructing go o d sets of macros, and exploiting in tensional represen tations are areas in whic h clearer, comp elling solutions are required. T o date, w ork in this area has not pro vided m uc h computational utilit y in the solution of MDPs|except in cases where go o d, hand-crafted, region-based decomp ositions and macros can b e pro vided|and little of this w ork has tak en in to accoun t the factored nature of man y MDPs. F or this reason, w e do not discuss it in detail. Ho w ev er, the general notion of serial decomp osition con tin ues to dev elop and sho ws great promise. 5.3 Multiattribute Rew ard and P arallel Decomp osition Another form of decomp osition is p ar al lel de c omp osition , in whic h an MDP is brok en in to a set of sub-MDPs that are \run in parallel." Sp ecically , at eac h stage of the (global) decision pro cess, the state of eac h subpro cess in aected. F or instance, in Figure 35, action 78 Decision-Theoretic Planning: Str uctural Assumptions MDP1 MDP2 MDP3 a a a Figure 35: P arallel problem decomp osition. a aects the state of eac h subpro cess. In tuitiv ely , an action is suitable for execution in the original MDP at some state if it is reasonably go o d in eac h of the sub-MDPs. Generally , the sub-MDPs form either a pro duct or join decomp osition of the original state space (con trast this with the union decomp ositions of state space determined b y serial decomp ositions): the state space is formed b y taking the cross pro duct of the sub-MDP state spaces, or the join if certain states in the subpro cesses cannot b e link ed. The subpro cesses ma y ha v e iden tical action spaces (as in Figure 35), or eac h ma y ha v e its o wn action space, with the global action c hoice b eing factored in to a c hoice for eac h subpro cess. In the latter case, the sub-MDPs ma y b e completely indep enden t, in whic h case the (global) MDP can b e solv ed exp onen tially faster. A more c hallenging problem arises when there are constrain ts on the legal action com binations. F or example, if the actions in the subpro cesses eac h require certain shared resources, in teractions in the global c hoice ma y arise. In a parallel MDP decomp osition, w e wish to solv e the sub-MDPs and use the p olicies or v alue functions generated to help construct an optimal or appro ximately optimal solution to the original MDP , highligh ting the need to nd appropriate decomp ositions for MDPs and to dev elop suitable merging tec hniques. Recen t parallel decomp osition metho ds ha v e all in v olv ed decomp osing an MDP in to subpro cesses suitable for distinct ob jectiv es. Since rew ard functions often deal with m ultiple ob jectiv es, eac h asso ciated with an indep enden t rew ard, and whose rew ards can b e summed to determine a global rew ard, this is often a v ery natural w a y to decomp ose MDPs. Th us, ideas from m ultiattribute utilit y theory can b e seen to pla y a role in the solution of MDPs. Boutilier et al. (1997) decomp ose an MDP sp ecied using 2TBNs and an additiv e rew ard function using the abstraction tec hnique describ ed in Section 5.1.3. F or eac h comp onen t of the rew ard function, abstraction is used to generate an MDP referring only to v ariables relev an t to that comp onen t. 52 Since certain state v ariables ma y b e presen t in m ultiple sub-MDPs (i.e., relev an t to more than one ob jectiv e), the original state space in the join of the subspaces. Th us, decomp osition is tac kled automatically . Merging is tac kled in sev eral w a ys. One in v olv es using the sum of the v alue functions obtained b y solving the sub-MDPs as a heuristic estimate of the true v alue function. This heuristic is then used to guide online, state-based searc h (see Section 3.2.1). If the sub-MDPs do not in teract, then this heuristic is p erfect and leads to bac ktrac k-free optimal action selection; if they in teract, searc h is 52. Note that the existence of a factored MDP represen tation is crucial for this abstraction metho d. 79 Boutilier, Dean, & Hanks required to detect conicts. Note that eac h sub-MDP has iden tical sets of actions. If the action space is large, the branc hing factor of the searc h pro cess ma y b e prohibitiv e. Singh and Cohn (1998) also deal with parallel decomp osition, though they assume the global MDP is sp ecied explicitly as a set of parallel MDPs, th us generating decomp ositions of a global MDP is not at issue. The global MDP is giv en b y the cross pro duct of the state and action spaces of these sub-MDPs and the rew ard functions are summed. Ho w ev er, constrain ts on the feasible action com binations couple the solutions of these sub-MDPs. T o solv e the global MDP , the sum of the sub-MDP v alue functions is used as an upp er b ound on the optimal global v alue function, while the maxim um of these (at an y global state) is used as a lo w er b ound. These b ounds then form the basis of an action-elimination pro cedure in a v alue-iteration algorithm for solving the global MDP . 53 Unfortunately , v alue iteration is run o v er the explicit state space of the global MDP . Since the action space is also a cross pro duct, this is a p oten tial computational b ottlenec k for v alue iteration, as w ell. Meuleau et al. (1998) use parallel decomp osition to appro ximate the solution of sto c has- tic resource allo cation problems with v ery large state and action spaces. Muc h lik e Singh and Cohn (1998), an MDP is sp ecied in terms of a n um b er of indep enden t MDPs, eac h in v olving a distinct ob jectiv e, whose action c hoices are link ed through shared resource con- strain ts. The v alue functions for the individual MDPs are constructed oine and then used in set of online action-selection pro cedures. Unlik e man y of the appro ximation pro cedures w e ha v e discussed, this approac h mak es no attempt to construct a p olicy explicitly (and is similar to real-time searc h or R TDP in this resp ect) nor to construct the v alue function explicitly . This metho d has b een applied to v ery large MDPs, with state spaces of size 2 1000 and actions spaces that are ev en larger, and can solv e suc h problems in roughly half an hour. The solutions pro duced are appro ximate, but the size of the problem precludes exact solution; so go o d estimates of solution qualit y are hard to deriv e. Ho w ev er, when the same metho d is applied to smaller problems of the same nature whose exact solution can b e computed, the appro ximations ha v e v ery high qualit y (Meuleau et al., 1998). While able to solv e v ery large MDPs (with large, but factored, state and action spaces), the mo del relies on somewhat restrictiv e assumptions ab out the nature of the lo cal v alue functions that ensure go o d solution qualit y . Ho w ev er, the basic approac h app ears to b e generalizable, and oers great promise for solving v ery large factored MDPs. The algorithms in b oth (Singh & Cohn, 1998) and (Meuleau et al., 1998) can b e seen to rely at least implicitly on structured MDP represen tations in v olving almost indep enden t subpro cesses. It seems lik ely that suc h approac hes could tak e further adv an tage of automatic MDP decomp osition algorithms suc h as that of (Boutilier et al., 1997), where factored represen tations explicitly pla y a part. 5.4 Summary W e ha v e seen a n um b er of w a ys in whic h in tensional represen tations can b e exploited to solv e MDPs eectiv ely without en umeration of the state space. These include tec hniques for abstraction of MDPs, including those based on relev ance analysis, goal regression and decision-theoretic regression; tec hniques relying on reac habilit y analysis and serial decom- p osition; and metho ds for parallel MDP decomp osition exploiting the m ultiattribute nature 53. Singh and Cohn (1998) also incorp orate metho ds for remo ving unreac hable states during v alue iteration. 80 Decision-Theoretic Planning: Str uctural Assumptions of rew ard functions. Man y of these metho ds can, in fortunate circumstances, oer exp onen- tial reduction is solution time and space required to represen t a p olicy and v alue function; but none come with guaran tees of suc h reductions except in certain sp ecial cases. While most of the metho ds describ ed pro vide appro ximate solutions (often with error b ounds pro- vided), some of them oer optimalit y guaran tees in general, and most can pro vide optimal solutions under suitable assumptions. One a v en ue that has not b een explored in detail is the relationship b et w een the struc- tured solution metho ds dev elop ed for MDPs describ ed ab o v e and tec hniques used for solving Ba y esian net w orks. Since man y of the algorithms discussed in this section rely on the struc- ture inheren t in the 2TBN represen tation of the MDP , it is natural to ask whether they em b o dy some of the in tuitions that underlie solution algorithms for Ba y es nets, and th us whether the solution tec hniques for Ba y es nets can b e (directly or indirectly) applied to MDPs in w a ys that giv e rise to algorithms similar to those discussed here. This remains an op en question at this p oin t, but undoubtedly some strong ties exist. T atman and Shac h ter (1990) ha v e explored the connections b et w een inuence diagrams and MDPs. Kjaerul (1992) has in v estigated computational considerations in v olv ed in applying join tree metho ds for reasoning tasks suc h as monitoring and prediction in temp oral Ba y es nets. The abstrac- tion metho ds discussed in Section 5.1.2 can b e in terpreted as a form of v ariable elimination (Dec h ter, 1996; Zhang & P o ole, 1996). Elimination of v ariables o ccurs in temp oral order, but go o d orderings within a time slice m ust also exploit the tree or graph structure of the CPTs. Appro ximation sc hemes based on v ariable elimination (Dec h ter, 1997; P o ole, 1998) ma y also b e related to certain of the appro ximation metho ds dev elop ed for MDPs. The indep endence-based decomp ositions of MDPs discussed in Section 5.3 can clearly b e view ed as exploiting the indep endence relations made explicit b y \unrolling" a 2TBN. The dev elop- men t of these and other connections to Ba y es net inference algorithms will no doubt pro v e v ery useful in enhancing our understanding of existing metho ds, increasing their range of applicabilit y and p oin ting to new algorithms. 6. Concluding Remarks The searc h for eectiv e algorithms for con trolling automated agen ts has a long and imp or- tan t history , and the problem will only con tin ue to gro w in imp ortance as more decision- making functionalit y is automated. W ork in sev eral disciplines, among them AI, decision analysis, and OR, has addressed the problem, but eac h has carried with it dieren t prob- lem denitions, dieren t sets of simplifying assumptions, dieren t viewp oin ts, and hence dieren t represen tations and algorithms for problem solving. More often than not, the as- sumptions seem to ha v e b een made for historical reasons or reasons of con v enience, and it is often dicult to separate the essen tial assumptions from the acciden tal. It is imp ortan t to clarify the relationships among problem denitions, crucial assumptions, and solution tec hniques, b ecause only then can a meaningful syn thesis tak e place. In this pap er w e analyzed v arious approac hes to a particular class of sequen tial deci- sion problems that ha v e b een studied in the OR, decision analysis, and AI literature. W e started with a general, reasonably neutral statemen t of the problem, couc hed, for con v e- nience, in the language of Mark o v decision pro cesses. F rom there w e demonstrated ho w v arious disciplines dene the problem (i.e., what assumptions they mak e), and the eect 81 Boutilier, Dean, & Hanks of these assumptions on the w orst-case time complexit y of solving the problem so dened. Assumptions regarding t w o main factors seem to distinguish the most commonly studied classes of decision problems: observ ation or sensing: do es sensing tend to b e fast, c heap, and accurate or lab orious, costly and noisy? the incen tiv e structure for the agen t: is its b eha vior ev aluated on its abilit y to p erform a particular task, or on its abilit y to con trol a system o v er an in terv al of time? Mo ving b ey ond the w orst-case analysis, it is generally assumed that, although patho- logical cases are inevitably dicult, the agen t should b e able to solv e \t ypical" or \easy" cases eectiv ely . T o do so, the agen t needs to b e able to iden tify structure in the problem and to exploit that structure algorithmically . W e iden tied three w a ys in whic h structural regularities can b e recognized, represen ted, and exploited computationally . The rst is structure induced b y domain-lev el simplifying assumptions lik e full observ abilit y , goal satisfaction or time-separable v alue functions, and so on. The second is structure exploited b y compact domain-sp ecic enco dings of states, actions, and rew ards. The designer can use these tec hniques to mak e structure explicit, and decision-making algorithms can then exploit the structural regularities as they apply to the particular problem at hand. The third in v olv es aggregation, abstraction and decomp osi- tion tec hniques, whereb y structural regularities can b e disco v ered and exploited during the problem-solving pro cess itself. In dev eloping this framew ork|one that allo ws comparison of domains, assumptions, problems, and tec hniques dra wn from dieren t disciplines|w e disco v er the essen tial problem structure required for sp ecic represen tations and algorithms to pro v e eectiv e; and w e do so in suc h a w a y that the insigh ts and tec hniques dev elop ed for certain problems, or within certain disciplines, can b e ev aluated and p oten tially applied to new problems, or within other disciplines. A main fo cus of this w ork has b een the elucidation of v arious forms of structure in decision problems and of ho w eac h can b e exploited represen tationally or computationally . F or the most part, w e ha v e fo cused on prop ositional structure, whic h is most commonly as- so ciated with planning in AI circles. A more complete treatmen t w ould also ha v e included other compact represen tations of dynamics, rew ards, p olicies, and v alue functions often considered in con tin uous, real-v alued domains. F or instance, w e ha v e not discussed linear dynamics and quadratic cost functions, often used in con trol theory (Caines, 1988), or the use of neural-net w ork represen tations of v alue functions, as frequen tly adopted within the reinforcemen t learning comm unit y (Bertsek as & Tsitsiklis, 1996; T esauro, 1994), 54 nor ha v e w e discussed the partitioning of con tin uous state spaces often addressed in reinforcemen t learning (Mo ore & A tk eson, 1995). Neither ha v e w e addressed the relational or quan tica- tional structure used in rst-order planning represen tations. Ho w ev er, ev en these tec hniques can b e cast within the framew ork describ ed here; for example, the use of piecewise-linear v alue functions can b e seen as a form of abstraction in whic h dieren t linear comp onen ts are applied to dieren t regions or clusters of state space. 54. Bertsek as and Tsitsiklis (1996) pro vide an in-depth treatmen t of neural net w ork and linear function appro ximators for MDPs and reinforcemen t learning. 82 Decision-Theoretic Planning: Str uctural Assumptions Although in certain cases w e ha v e indicated ho w to devise metho ds that exploit sev eral t yp es of structure at once, researc h along these lines has b een limited. T o some exten t, man y of the represen tations and algorithms describ ed in this pap er are complemen tary and should p ose few obstacles to com bination. It remains to b e seen ho w they in teract with tec hniques dev elop ed for other forms of structure, suc h as those used for con tin uous state and action spaces. So our analysis raises opp ortunities and c hallenges: b y understanding the assumptions, the tec hniques, and their relationships, a designer of decision-making agen ts has man y more to ols with whic h to build eectiv e problem solv ers; and the c hallenges lie in the dev elopmen t of additional to ols and the in tegration of existing ones. Ac kno wledgmen ts Man y thanks to the careful commen ts of the referees. Thanks to Ron P arr and Rob ert St-Aubin for their commen ts on an earlier draft of this pap er. The studen ts taking CS3710 (Spring 1999) taugh t b y Martha P ollac k at the Univ ersit y of Pittsburgh and CPSC522 (Win ter 1999) at the Univ ersit y of British Colum bia also deserv e thanks for their detailed commen ts. Boutilier w as supp orted b y NSER C Researc h Gran t OGP0121843, and the NCE IRIS- I I program Pro ject IC-7. Dean w as supp orted in part b y a National Science F oundation Presiden tial Y oung In v estigator Aw ard IRI-8957601 and b y the Air F orce and the Adv anced Researc h Pro jects Agency of the Departmen t of Defense under Con tract No. F30602-91-C- 0041. Hanks w as supp orted in part b y ARP A / Rome Labs Gran t F30602{95{1{002 4 and in part b y NSF gran t IRI{9523649. References Allen, J., Hendler, J., & T ate, A. (Eds.). (1990). R e adings in Planning . Morgan-Kaufmann, San Mateo. Astr om, K. J. (1965). Optimal con trol of Mark o v decision pro cesses with incomplete state estimation. J. Math. A nal. Appl. , 10 , 174{205. Bacc h us, F., Boutilier, C., & Gro v e, A. (1996). Rew arding b eha viors. In Pr o c e e dings of the Thirte enth National Confer enc e on A rticial Intel ligenc e , pp. 1160{1167 P ortland, OR. Bacc h us, F., Boutilier, C., & Gro v e, A. (1997). Structured solution metho ds for non- Mark o vian decision pro cesses. In Pr o c e e dings of the F ourte enth National Confer enc e on A rticial Intel ligenc e , pp. 112{117 Pro vidence, RI. Bacc h us, F., & Kabanza, F. (1995). Using temp oral logic to con trol searc h in a forw ard c haining planner. In Pr o c e e dings of the Thir d Eur op e an Workshop on Planning (EWSP'95) Assisi, Italy . Av ailable via the URL ftp://logos.u w aterlo o.ca:/pub/tlplan/tlplan.ps.Z. Bacc h us, F., & T eh, Y. W. (1998). Making forw ard c haining relev an t. In Pr o c e e dings of the F ourth International Confer enc e on AI Planning Systems , pp. 54{61 Pittsburgh, P A. 83 Boutilier, Dean, & Hanks Bahar, R. I., F rohm, E. A., Gaona, C. M., Hac h tel, G. D., Macii, E., P ardo, A., & Somenzi, F. (1993). Algebraic decision diagrams and their applications. In International Con- fer enc e on Computer-A ide d Design , pp. 188{191. IEEE. Bak er, A. B. (1991). Nonmonotonic reasoning in the framew ork of the situation calculus. A rticial Intel ligenc e , 49 , 5{23. Barto, A. G., Bradtk e, S. J., & Singh, S. P . (1995). Learning to act using real-time dynamic programming. A rticial Intel ligenc e , 72 (1{2), 81{138. Bellman, R. (1957). Dynamic Pr o gr amming . Princeton Univ ersit y Press, Princeton, NJ. Bertsek as, D. P ., & Castanon, D. A. (1989). Adaptiv e aggregation for innite horizon dynamic programming. IEEE T r ansactions on A utomatic Contr ol , 34 (6), 589{598. Bertsek as, D. P . (1987). Dynamic Pr o gr amming . Pren tice-Hall, Englew o o d Clis, NJ. Bertsek as, D. P ., & Tsitsiklis, J. N. (1996). Neur o-dynamic Pr o gr amming . A thena, Belmon t, MA. Blac kw ell, D. (1962). Discrete dynamic programming. A nnals of Mathematic al Statistics , 33 , 719{726. Blum, A. L., & F urst, M. L. (1995). F ast planning through graph analysis. In Pr o c e e dings of the F ourte enth International Joint Confer enc e on A rticial Intel ligenc e , pp. 1636{ 1642 Mon treal, Canada. Bonet, B., & Gener, H. (1998). Learning sorting and decision trees with POMDPs. In Pr o c e e dings of the Fifte enth International Confer enc e on Machine L e arning , pp. 73{81 Madison, WI. Bonet, B., Lo erincs, G., & Gener, H. (1997). A robust and fast action selection mec hanism. In Pr o c e e dings of the F ourte enth National Confer enc e on A rticial Intel ligenc e , pp. 714{719 Pro vidence, RI. Boutilier, C. (1997). Correlated action eects in decision theoretic regression. In Pr o c e e d- ings of the Thirte enth Confer enc e on Unc ertainty in A rticial Intel ligenc e , pp. 30{37 Pro vidence, RI. Boutilier, C., Brafman, R. I., & Geib, C. (1997). Prioritized goal decomp osition of Mark o v decision pro cesses: T o w ard a syn thesis of classical and decision theoretic planning. In Pr o c e e dings of the Fifte enth International Joint Confer enc e on A rticial Intel ligenc e , pp. 1156{1162 Nago y a, Japan. Boutilier, C., Brafman, R. I., & Geib, C. (1998). Structured reac habilit y analysis for Mark o v decision pro cesses. In Pr o c e e dings of the F ourte enth Confer enc e on Unc ertainty in A rticial Intel ligenc e , pp. 24{32 Madison, WI. Boutilier, C., & Dearden, R. (1994). Using abstractions for decision-theoretic planning with time constrain ts. In Pr o c e e dings of the Twelfth National Confer enc e on A rticial Intel ligenc e , pp. 1016{1022 Seattle, W A. 84 Decision-Theoretic Planning: Str uctural Assumptions Boutilier, C., & Dearden, R. (1996). Appro ximating v alue trees in structured dynamic programming. In Pr o c e e dings of the Thirte enth International Confer enc e on Machine L e arning , pp. 54{62 Bari, Italy . Boutilier, C., Dearden, R., & Goldszmidt, M. (1995). Exploiting structure in p olicy con- struction. In Pr o c e e dings of the F ourte enth International Joint Confer enc e on A rticial Intel ligenc e , pp. 1104{1111 Mon treal, Canada. Boutilier, C., Dearden, R., & Goldszmidt, M. (1999). Sto c hastic dynamic programming with factored represen tations. (man uscript). Boutilier, C., F riedman, N., Goldszmidt, M., & Koller, D. (1996). Con text-sp ecic indep en- dence in Ba y esian net w orks. In Pr o c e e dings of the Twelfth Confer enc e on Unc ertainty in A rticial Intel ligenc e , pp. 115{123 P ortland, OR. Boutilier, C., & Goldszmidt, M. (1996). The frame problem and Ba y esian net w ork action represen tations. In Pr o c e e dings of the Eleventh Biennial Canadian Confer enc e on A rticial Intel ligenc e , pp. 69{83 T oron to. Boutilier, C., & P o ole, D. (1996). Computing optimal p olicies for partially observ able decision pro cesses using compact represen tations. In Pr o c e e dings of the Thirte enth National Confer enc e on A rticial Intel ligenc e , pp. 1168{1175 P ortland, OR. Boutilier, C., & Puterman, M. L. (1995). Pro cess-orien ted planning and a v erage-rew ard op- timalit y . In Pr o c e e dings of the F ourte enth International Joint Confer enc e on A rticial Intel ligenc e , pp. 1096{1103 Mon treal, Canada. Brafman, R. I. (1997). A heuristic v ariable-grid solution metho d for POMDPs. In Pr o- c e e dings of the F ourte enth National Confer enc e on A rticial Intel ligenc e , pp. 727{733 Pro vidence, RI. Bry an t, R. E. (1986). Graph-based algorithms for b o olean function manipulation. IEEE T r ansactions on Computers , C-35 (8), 677{691. Bylander, T. (1994). The computational complexit y of prop ositional STRIPS planning. A rticial Intel ligenc e , 69 , 161{204. Caines, P . E. (1988). Line ar sto chastic systems . Wiley , New Y ork. Cassandra, A. R., Kaelbling, L. P ., & Littman, M. L. (1994). Acting optimally in partially observ able sto c hastic domains. In Pr o c e e dings of the Twelfth National Confer enc e on A rticial Intel ligenc e , pp. 1023{1028 Seattle, W A. Cassandra, A. R., Littman, M. L., & Zhang, N. L. (1997). Incremen tal pruning: A sim- ple, fast, exact metho d for p omdps. In Pr o c e e dings of the Thirte enth Confer enc e on Unc ertainty in A rticial Intel ligenc e , pp. 54{61 Pro vidence, RI. Chapman, D. (1987). Planning for conjunctiv e goals. A rticial Intel ligenc e , 32 (3), 333{377. 85 Boutilier, Dean, & Hanks Chapman, D., & Kaelbling, L. P . (1991). Input generalization in dela y ed reinforcemen t learning: An algorithm and p erformance comparisons. In Pr o c e e dings of the Twelfth International Joint Confer enc e on A rticial Intel ligenc e , pp. 726{731 Sydney , Aus- tralia. Dan tzig, G., & W olfe, P . (1960). Decomp osition principle for dynamic programs. Op er ations R ese ar ch , 8 (1), 101{111. Dean, T., Allen, J., & Aloimonos, Y. (1995). A rticial Intel ligenc e: The ory and Pr actic e . Benjamin Cummings. Dean, T., & Giv an, R. (1997). Mo del minimization in Mark o v decision pro cesses. In Pr o c e e dings of the F ourte enth National Confer enc e on A rticial Intel ligenc e , pp. 106{ 111 Pro vidence, RI. AAAI. Dean, T., Giv an, R., & Kim, K.-E. (1998). Solving planning problems with large state and action spaces. In Pr o c e e dings of the F ourth International Confer enc e on AI Planning Systems , pp. 102{110 Pittsburgh, P A. Dean, T., Giv an, R., & Leac h, S. (1997). Mo del reduction tec hniques for computing ap- pro ximately optimal solutions for Mark o v decision pro cesses. In Pr o c e e dings of the Thirte enth Confer enc e on Unc ertainty in A rticial Intel ligenc e , pp. 124{131 Pro vi- dence, RI. Dean, T., Kaelbling, L., Kirman, J., & Nic holson, A. (1993). Planning with deadlines in sto c hastic domains. In Pr o c e e dings of the Eleventh National Confer enc e on A rticial Intel ligenc e , pp. 574{579. Dean, T., Kaelbling, L., Kirman, J., & Nic holson, A. (1995). Planning under time con- strain ts in sto c hastic domains. A rticial Intel ligenc e , 76 (1-2), 3{74. Dean, T., & Kanaza w a, K. (1989). A mo del for reasoning ab out p ersistence and causation. Computational Intel ligenc e , 5 (3), 142{150. Dean, T., & Lin, S.-H. (1995). Decomp osition tec hniques for planning in sto c hastic do- mains. In Pr o c e e dings of the F ourte enth International Joint Confer enc e on A rticial Intel ligenc e , pp. 1121{1127. Dean, T., & W ellman, M. (1991). Planning and Contr ol . Morgan Kaufmann, San Mateo, California. Dearden, R., & Boutilier, C. (1994). In tegrating planning and execution in sto c hastic domains. In Pr o c e e dings of the T enth Confer enc e on Unc ertainty in A rticial Intel li- genc e , pp. 162{169 W ashington, DC. Dearden, R., & Boutilier, C. (1997). Abstraction and appro ximate decision theoretic plan- ning. A rticial Intel ligenc e , 89 , 219{283. Dec h ter, R. (1996). Buc k et elimination: A unifying framew ork for probabilistic inference. In Pr o c e e dings of the Twelfth Confer enc e on Unc ertainty in A rticial Intel ligenc e , pp. 211{219 P ortland, OR. 86 Decision-Theoretic Planning: Str uctural Assumptions Dec h ter, R. (1997). Mini-buc k ets: A general sc heme for generating appro ximations in automated reasoning in probabilistic inference. In Pr o c e e dings of the Fifte enth Inter- national Joint Confer enc e on A rticial Intel ligenc e , pp. 1297{1302 Nago y a, Japan. D'Ep enoux, F. (1963). Sur un probl eme de pro duction et de sto c k age dans l'al eatoire. Management Scienc e , 10 , 98{108. Dietteric h, T. G., & Flann, N. S. (1995). Explanation-based learning and reinforcemen t learning: A unied approac h. In Pr o c e e dings of the Twelfth International Confer enc e on Machine L e arning , pp. 176{184 Lak e T aho e, NV. Drap er, D., Hanks, S., & W eld, D. (1994a). A probabilistic mo del of action for least- commitmen t planning with information gathering. In Pr o c e e dings of the T enth Con- fer enc e on Unc ertainty in A rticial Intel ligenc e , pp. 178{186 W ashington, DC. Drap er, D., Hanks, S., & W eld, D. (1994b). Probabilistic planning with information gather- ing and con tingen t execution. In Pr o c e e dings of the Se c ond International Confer enc e on AI Planning Systems , pp. 31{36. Etzioni, O., Hanks, S., W eld, D., Drap er, D., Lesh, N., & Williamson, M. (1992). An approac h to planning with incomplete information. In Pr o c e e dings of the Thir d Inter- national Confer enc e on Principles of Know le dge R epr esentation and R e asoning , pp. 115{125 Boston, MA. Fik es, R., Hart, P ., & Nilsson, N. (1972). Learning and executing generalized rob ot plans. A rticial Intel ligenc e , 3 , 251{288. Fik es, R., & Nilsson, N. J. (1971). STRIPS: A new approac h to the application of theorem pro ving to problem solving. A rticial Intel ligenc e , 2 , 189{208. Finger, J. (1986). Exploiting Constr aints in Design Synthesis . Ph.D. thesis, Stanford Uni- v ersit y , Stanford. Flo yd, R. W. (1962). Algorithm 97 (shortest path). Communic ations of the A CM , 5 (6), 345. F o x, B. L., & Landi, D. M. (1968). An algorithm for iden tifying the ergo dic sub c hains and transien t states of a sto c hastic matrix. Communic ations of the A CM , 2 , 619{621. F renc h, S. (1986). De cision The ory . Halsted Press, New Y ork. Geiger, D., & Hec k erman, D. (1991). Adv ances in probabilistic reasoning. In Pr o c e e dings of the Seventh Confer enc e on Unc ertainty in A rticial Intel ligenc e , pp. 118{126 Los Angeles, CA. Giv an, R., & Dean, T. (1997). Mo del minimization, regression, and prop ositional STRIPS planning. In Pr o c e e dings of the Fifte enth International Joint Confer enc e on A rticial Intel ligenc e , pp. 1163{1168 Nago y a, Japan. 87 Boutilier, Dean, & Hanks Giv an, R., Leac h, S., & Dean, T. (1997). Bounded-parameter Mark o v decision pro cesses. In Pr o c e e dings of the F ourth Eur op e an Confer enc e on Planning (ECP'97) , pp. 234|246 T oulouse, F rance. Goldman, R. P ., & Bo ddy , M. S. (1994). Represen ting uncertain t y in simple planners. In Pr o c e e dings of the F ourth International Confer enc e on Principles of Know le dge R epr esentation and R e asoning , pp. 238{245 Bonn, German y . Hadda wy , P ., & Doan, A. (1994). Abstracting probabilistic actions. In Pr o c e e dings of the T enth Confer enc e on Unc ertainty in A rticial Intel ligenc e , pp. 270{277 W ashington, DC. Hadda wy , P ., & Hanks, S. (1998). Utilit y Mo dels for Goal-Directed Decision-Theoretic Planners. Computational Intel ligenc e , 14 (3). Hadda wy , P ., & Su w andi, M. (1994). Decision-theoretic renemen t planning using inheri- tence abstraction. In Pr o c e e dings of the Se c ond International Confer enc e on AI Plan- ning Systems , pp. 266{271 Chicago, IL. Hanks, S. (1990). Pro jecting plans for uncertain w orlds. Ph.D. thesis 756, Y ale Univ ersit y , Departmen t of Computer Science, New Ha v en, CT. Hanks, S., & McDermott, D. V. (1994). Mo deling a dynamic and uncertain w orld I: Sym b olic and probabilistic reasoning ab out c hange. A rticial Intel ligenc e , 66 (1), 1{55. Hanks, S., Russell, S., & W ellman, M. (Eds.). (1994). De cision The or etic Planning: Pr o- c e e dings of the AAAI Spring Symp osium . AAAI Press, Menlo P ark. Hansen, E. A., & Zilb erstein, S. (1998). Heuristic searc h in cyclic AND/OR graphs. In Pr o c e e dings of the Fifte enth National Confer enc e on A rticial Intel ligenc e , pp. 412{ 418 Madison, WI. Hauskrec h t, M. (1997). A heuristic v ariable-grid solution metho d for POMDPs. In Pr o- c e e dings of the F ourte enth National Confer enc e on A rticial Intel ligenc e , pp. 734{739 Pro vidence, RI. Hauskrec h t, M. (1998). Planning and Contr ol in Sto chastic Domains with Imp erfe ct Infor- mation . Ph.D. thesis, Massac h usetts Institute of T ec hnology , Cam bridge. Hauskrec h t, M., Meuleau, N., Kaelbling, L. P ., Dean, T., & Boutilier, C. (1998). Hierarc hical solution of Mark o v decision pro cesses using macro-actions. In Pr o c e e dings of the F ourte enth Confer enc e on Unc ertainty in A rticial Intel ligenc e , pp. 220{229 Madison, WI. Ho ey , J., St-Aubin, R., Hu, A., & Boutilier, C. (1999). SPUDD: Sto c hastic planning using decision diagrams. In Pr o c e e dings of the Fifte enth Confer enc e on Unc ertainty in A rticial Intel ligenc e Sto c kholm. T o app ear. Ho w ard, R. A. (1960). Dynamic Pr o gr amming and Markov Pr o c esses . MIT Press, Cam- bridge, Massac h usetts. 88 Decision-Theoretic Planning: Str uctural Assumptions Ho w ard, R. A., & Matheson, J. E. (1984). Inuence diagrams. In Ho w ard, R. A., & Math- eson, J. E. (Eds.), The Principles and Applic ations of De cision A nalysis . Strategic Decisions Group, Menlo P ark, CA. Kam bhampati, S. (1997). Renemen t planning as a unifying framew ork for plan syn thesis. AI Magazine , Summer 1997 , 67{97. Kearns, M., Mansour, Y., & Ng, A. Y. (1999). A sparse sampling algorithm for near- optimal planning in large mark o v decision pro cesses. In Pr o c e e dings of the Sixte enth International Joint Confer enc e on A rticial Intel ligenc e Sto c kholm. T o app ear. Keeney , R. L., & Raia, H. (1976). De cisions with Multiple Obje ctives: Pr efer enc es and V alue T r ade os . John Wiley and Sons, New Y ork. Kjaerul, U. (1992). A computational sc heme for reasoning in dynamic probabilistic net- w orks. In Pr o c e e dings of the Eighth Confer enc e on Unc ertainty in AI , pp. 121{129 Stanford. Knoblo c k, C. A. (1993). Gener ating A bstr action Hier ar chies: A n A utomate d Appr o ach to R e ducing Se ar ch in Planning . Klu w er, Boston. Knoblo c k, C. A., T enen b erg, J. D., & Y ang, Q. (1991). Characterizing abstraction hier- arc hies for planning. In Pr o c e e dings of the Ninth National Confer enc e on A rticial Intel ligenc e , pp. 692{697 Anaheim, CA. Ko enig, S. (1991). Optimal probabilistic and decision-theoretic planning using Mark o vian decision theory . M.sc. thesis UCB/CSD-92-685, Univ ersit y of California at Berk eley , Computer Science Departmen t. Ko enig, S., & Simmons, R. (1995). Real-time searc h in nondeterministic domains. In Pr o c e e dings of the F ourte enth International Joint Confer enc e on A rticial Intel ligenc e , pp. 1660{1667 Mon treal, Canada. Korf, R. (1985). Macro-op erators: A w eak metho d for learning. A rticial Intel ligenc e , 26 , 35{77. Korf, R. E. (1990). Real-time heuristic searc h. A rticial Intel ligenc e , 42 , 189{211. Kushmeric k, N., Hanks, S., & W eld, D. (1995). An Algorithm for Probabilistic Planning. A rticial Intel ligenc e , 76 , 239{286. Kushner, H. J., & Chen, C.-H. (1974). Decomp osition of systems go v erned b y Mark o v c hains. IEEE T r ansactions on A utomatic Contr ol , 19 (5), 501{507. Lee, D., & Y annak akis, M. (1992). Online minimization of transition systems. In Pr o c e e dings of 24th A nnual A CM Symp osium on the The ory of Computing , pp. 264{274 Victoria, BC. Lin, F., & Reiter, R. (1994). State constrain ts revisited. Journal of L o gic and Computation , 4 (5), 655{678. 89 Boutilier, Dean, & Hanks Lin, S.-H. (1997). Exploiting Structur e for Planning and Contr ol . Ph.D. thesis, Departmen t of Computer Science, Bro wn Univ ersit y . Lin, S.-H., & Dean, T. (1995). Generating optimal p olicies for high-lev el plans with con- ditional branc hes and lo ops. In Pr o c e e dings of the Thir d Eur op e an Workshop on Planning (EWSP'95) , pp. 187{200. Littman, M. L. (1997). Probabilistic prop ositional planning: Represen tations and complex- it y . In Pr o c e e dings of the F ourte enth National Confer enc e on A rticial Intel ligenc e , pp. 748{754 Pro vidence, RI. Littman, M. L., Dean, T. L., & Kaelbling, L. P . (1995). On the complexit y of solving Mark o v decision problems. In Pr o c e e dings of the Eleventh Confer enc e on Unc ertainty in A rticial Intel ligenc e , pp. 394{402 Mon treal, Canada. Littman, M. L. (1996). Algorithms for sequen tial decision making. Ph.D. thesis CS{96{09, Bro wn Univ ersit y , Departmen t of Computer Science, Pro vidence, RI. Lo v ejo y , W. S. (1991a). Computationally feasible b ounds for partially observ ed Mark o v decision pro cesses. Op er ations R ese ar ch , 39 (1), 162{175. Lo v ejo y , W. S. (1991b). A surv ey of algorithmic metho ds for partially observ ed Mark o v decision pro cesses. A nnals of Op er ations R ese ar ch , 28 , 47{66. Luen b erger, D. G. (1973). Intr o duction to Line ar and Nonline ar Pr o gr amming . Addison- W esley , Reading, Massac h usetts. Luen b erger, D. G. (1979). Intr o duction to Dynamic Systems: The ory, Mo dels and Applic a- tions . Wiley , New Y ork. Madani, O., Condon, A., & Hanks, S. (1999). On the undecidabilit y of probabilistic planning and innite-horizon partially observ able Mark o v decision problems. In Pr o c e e dings of the Sixte enth National Confer enc e on A rticial Intel ligenc e Orlando, FL. T o app ear. Mahadev an, S. (1994). T o discoun t or not to discoun t in reinforcemen t learning: A case study in comparing R-learning and Q-learning. In Pr o c e e dings of the Eleventh Inter- national Confer enc e on Machine L e arning , pp. 164{172 New Brunswic k, NJ. McAllester, D., & Rosen blitt, D. (1991). Systematic nonlinear planning. In Pr o c e e dings of the Ninth National Confer enc e on A rticial Intel ligenc e , pp. 634{639 Anaheim, CA. McCallum, R. A. (1995). Instance-based utile distinctions for reinforcemen t learning with hidden state. In Pr o c e e dings of the Twelfth International Confer enc e on Machine L e arning , pp. 387{395 Lak e T aho e, Nev ada. McCarth y , J., & Ha y es, P . J. (1969). Some philosophical problems from the standp oin t of articial in telligence. Machine Intel ligenc e , 4 , 463{502. 90 Decision-Theoretic Planning: Str uctural Assumptions Meuleau, N., Hauskrec h t, M., Kim, K., P eshkin, L., Kaelbling, L., Dean, T., & Boutilier, C. (1998). Solving v ery large w eakly coupled Mark o v decision pro cesses. In Pr o c e e dings of the Fifte enth National Confer enc e on A rticial Intel ligenc e , pp. 165{172 Madison, WI. Mo ore, A. W., & A tk eson, C. G. (1995). The parti-game algorithm for v ariable resolution reinforcemen t learning in m ultidimensional state spaces. Machine L e arning , 21 , 199{ 234. P apadimitriou, C. H., & Tsitsiklis, J. N. (1987). The complexit y of Mark o v c hain decision pro cesses. Mathematics of Op er ations R ese ar ch , 12 (3), 441{450. P arr, R. (1998). Flexible decomp osition algorithms for w eakly coupled Mark o v decision pro cesses. In Pr o c e e dings of the F ourte enth Confer enc e on Unc ertainty in A rticial Intel ligenc e , pp. 422{430 Madison, WI. P arr, R., & Russell, S. (1995). Appro ximating optimal p olicies for partially observ able sto c hastic domains. In Pr o c e e dings of the F ourte enth International Joint Confer enc e on A rticial Intel ligenc e , pp. 1088{1094 Mon treal. P arr, R., & Russell, S. (1998). Reinforcemen t learning with hierarc hies of mac hines. In Jordan, M., Kearns, M., & Solla, S. (Eds.), A dvanc es in Neur al Information Pr o c essing Systems 10 , pp. 1043{1049. MIT Press, Cam bridge. P earl, J. (1988). Pr ob abilistic R e asoning in Intel ligent Systems: Networks of Plausible Infer enc e . Morgan Kaufmann, San Mateo. P ednault, E. (1989). ADL: Exploring the middle ground b et w een STRIPS and the situa- tion calculus. In Pr o c e e dings of the First International Confer enc e on Principles of Know le dge R epr esentation and R e asoning , pp. 324{332 T oron to, Canada. P en b erth y , J. S., & W eld, D. S. (1992). UCPOP: A sound, complete, partial order planner for ADL. In Pr o c e e dings of the Thir d International Confer enc e on Principles of Know le dge R epr esentation and R e asoning , pp. 103{114 Boston, MA. P eot, M., & Smith, D. (1992). Conditional Nonlinear Planning. In Pr o c e e dings of the First International Confer enc e on AI Planning Systems , pp. 189{197 College P ark, MD. P erez, M. A., & Carb onell, J. G. (1994). Con trol kno wledge to impro v e plan qualit y . In Pr o c e e dings of the Se c ond International Confer enc e on AI Planning Systems , pp. 323{ 328 Chicago, IL. P o ole, D. (1995). Exploiting the rule structure for decision making within the indep enden t c hoice logic. In Pr o c e e dings of the Eleventh Confer enc e on Unc ertainty in A rticial Intel ligenc e , pp. 454{463 Mon treal, Canada. P o ole, D. (1997a). The indep enden t c hoice logic for mo delling m ultiple agen ts under un- certain t y . A rticial Intel ligenc e , 94 (1{2), 7{56. 91 Boutilier, Dean, & Hanks P o ole, D. (1997b). Probabilistic partial ev aluation: Exploiting rule structure in probabilistic inference. In Pr o c e e dings of the Fifte enth International Joint Confer enc e on A rticial Intel ligenc e , pp. 1284{1291 Nago y a, Japan. P o ole, D. (1998). Con text-sp ecic appro ximation in probabilistic inference. In Pr o c e e dings of the F ourte enth Confer enc e on Unc ertainty in A rticial Intel ligenc e , pp. 447{454 Madison, WI. Precup, D., Sutton, R. S., & Singh, S. (1998). Theoretical results on reinforcemen t learning with temp orally abstract b eha viors. In Pr o c e e dings of the T enth Eur op e an Confer enc e on Machine L e arning , pp. 382{393 Chemnitz, German y . Pry or, L., & Collins, G. (1993). CASSANDRA: Planning for con tingencies. T ec hnical rep ort 41, North w estern Univ ersit y , The Institute for the Learning Sciences. Puterman, M. L. (1994). Markov De cision Pr o c esses . John Wiley & Sons, New Y ork. Puterman, M. L., & Shin, M. (1978). Mo died p olicy iteration algorithms for discoun ted Mark o v decision problems. Management Scienc e , 24 , 1127{1137. Ross, K. W., & V aradara jan, R. (1991). Multic hain Mark o v decision pro cesses with a sample-path constrain t: A decomp osition approac h. Mathematics of Op er ations R e- se ar ch , 16 (1), 195{207. Russell, S., & Norvig, P . (1995). A rticial Intel ligenc e: A Mo dern Appr o ach . Pren tice Hall, Englew o o d Clis, NJ. Sacerdoti, E. D. (1974). Planning in a hierarc h y of abstraction spaces. A rticial Intel ligenc e , 5 , 115{135. Sacerdoti, E. D. (1975). The nonlinear nature of plans. In Pr o c e e dings of the F ourth International Joint Confer enc e on A rticial Intel ligenc e , pp. 206{214. Sc hopp ers, M. J. (1987). Univ ersal plans for reactiv e rob ots in unpredictable en vironmen ts. In Pr o c e e dings of the T enth International Joint Confer enc e on A rticial Intel ligenc e , pp. 1039{1046 Milan, Italy . Sc h w artz, A. (1993). A reinforcemen t learning metho d for maximizing undiscoun ted re- w ards. In Pr o c e e dings of the T enth International Confer enc e on Machine L e arning , pp. 298{305 Amherst, MA. Sc h w eitzer, P . L., Puterman, M. L., & Kindle, K. W. (1985). Iterativ e aggregation- disaggregation pro cedures for discoun ted semi-Mark o v rew ard pro cesses. Op er ations R ese ar ch , 33 , 589{605. Shac h ter, R. D. (1986). Ev aluating inuence diagrams. Op er ations R ese ar ch , 33 (6), 871{ 882. Shimon y , S. E. (1993). The role of relev ance in explanation I: Irrelev ance as statistical indep endence. International Journal of Appr oximate R e asoning , 8 (4), 281{324. 92 Decision-Theoretic Planning: Str uctural Assumptions Simmons, R., & Ko enig, S. (1995). Probabilistic rob ot na vigation in partially observ able en vironmen ts. In Pr o c e e dings of the F ourte enth International Joint Confer enc e on A rticial Intel ligenc e , pp. 1080{1087 Mon treal, Canada. Singh, S. P ., & Cohn, D. (1998). Ho w to dynamically merge Mark o v decision pro cesses. In A dvanc es in Neur al Information Pr o c essing Systems 10 , pp. 1057{1063. MIT Press, Cam bridge. Singh, S. P ., Jaakk ola, T., & Jordan, M. I. (1994). Reinforcemen t learning with soft state aggregation. In Hanson, S. J., Co w an, J. D., & Giles, C. L. (Eds.), A dvanc es in Neur al Information Pr o c essing Systems 7 . Morgan-Kaufmann, San Mateo. Smallw o o d, R. D., & Sondik, E. J. (1973). The optimal con trol of partially observ able Mark o v pro cesses o v er a nite horizon. Op er ations R ese ar ch , 21 , 1071{1088. Smith, D., & P eot, M. (1993). P ostp oning threats in partial-order planning. In Pr o c e e dings of the Eleventh National Confer enc e on A rticial Intel ligenc e , pp. 500{506 W ashing- ton, DC. Sondik, E. J. (1978). The optimal con trol of partially observ able Mark o v pro cesses o v er the innite horizon: Discoun ted costs. Op er ations R ese ar ch , 26 , 282{304. Stone, P ., & V eloso, M. (1999). T eam-partitioned, opaque-transition reinforcemen t learning. In Asada, M. (Ed.), Rob oCup-98: R ob ot So c c er World Cup II . Springer V erlag, Berlin. Sutton, R. S. (1995). TD mo dels: Mo deling the w orld at a mixture of time scales. In Pr o c e e dings of the Twelfth International Confer enc e on Machine L e arning , pp. 531{ 539 Lak e T aho e, NV. Sutton, R. S., & Barto, A. G. (1998). R einfor c ement L e arning: A n Intr o duction . MIT Press, Cam bridge, MA. T ash, J., & Russell, S. (1994). Con trol strategies for a sto c hastic planner. In Pr o c e e dings of the Twelfth National Confer enc e on A rticial Intel ligenc e , pp. 1079{1085 Seattle, W A. T atman, J. A., & Shac h ter, R. D. (1990). Dynamic programming and inuence diagrams. IEEE T r ansactions on Systems, Man, and Cyb ernetics , 20 (2), 365{379. T esauro, G. J. (1994). TD-Gammon, a self-teac hing bac kgammon program, ac hiev es master- lev el pla y . Neur al Computation , 6 , 215{219. Thrun, S., F o x, D., & Burgard, W. (1998). A probabilistic approac h to concurren t mapping and lo calization for mobile rob ots. Machine L e arning , 31 , 29{53. Thrun, S., & Sc h w artz, A. (1995). Finding structure in reinforcemen t learning. In T esauro, G., T ouretzky , D., & Leen, T. (Eds.), A dvanc es in Neur al Information Pr o c essing Systems 7 Cam bridge, MA. MIT Press. W arren, D. (1976). Generating conditional plans and programs. In Pr o c e e dings of AISB Summer Confer enc e , pp. 344{354 Univ ersit y of Edin burgh. 93 Boutilier, Dean, & Hanks W atkins, C. J. C. H., & Da y an, P . (1992). Q-learning. Machine L e arning , 8 , 279{292. W eld, D. S. (1994). An in tro duction to least commitmen t planning. AI Magazine , Winter 1994 , 27{61. White I I I, C. C., & Sc herer, W. T. (1989). Solutions pro cedures for partially observ ed Mark o v decision pro cesses. Op er ations R ese ar ch , 37 (5), 791{797. Williamson, M. (1996). A v alue-directed approac h to planning. Ph.D. thesis 96{06{03, Univ ersit y of W ashington, Departmen t of Computer Science and Engineering. Williamson, M., & Hanks, S. (1994). Optimal planning with a goal-directed utilit y mo del. In Pr o c e e dings of the Se c ond International Confer enc e on AI Planning Systems , pp. 176{180 Chicago, IL. Winston, P . H. (1992). A rticial Intel ligenc e, Thir d Edition . Addison-W esley , Reading, Massac h usetts. Y ang, Q. (1998). Intel ligent Planning : A De c omp osition and A bstr action Base d Appr o ach . Springer V erlag. Zhang, N. L., & Liu, W. (1997). A mo del appro ximation sc heme for planning in partially observ able sto c hastic domains. Journal of A rticial Intel ligenc e R ese ar ch , 7 , 199{230. Zhang, N. L., & P o ole, D. (1996). Exploiting causal indep endence in Ba y esian net w ork inference. Journal of A rticial Intel ligenc e R ese ar ch , 5 , 301{328. 94
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment