Accelerating Reinforcement Learning through Implicit Imitation

Journal of Artiﬁcial In telligence Research 19 ( 2003) 569-629 Submitted 05/01; published 12/03 Accelerating Reinforcemen t Learning through Implicit Imitation Bob Price price@cs.ubc.ca Dep artment of Computer Scienc e University of British Columbia V anc ouver, B.C., Canada V6T 1Z4 Craig Boutilier cebl y@cs.toronto.edu Dep artment of Computer Scienc e University of T or onto T or onto, ON, Canada M5S 3H5 Abstract Imitation can b e viewed as a means of enhancing lea rning in multiagen t en viro nmen ts. It augmen ts an agent’s ability to learn useful b ehaviors by making in telligent use of the knowledge implicit in b eha vior s demonstra ted by co op erative teachers or o ther more ex - per ienced a gen ts. W e prop ose a nd study a forma l mo del o f implicit imitation that can accelerate reinforcement lear ning dramatically in certain cas es. Roughly , by obse r ving a men tor, a reinforce ment-learning age n t can extra ct information ab out its own ca pabilities in, and the rela tiv e v alue of, unvisited par ts of the state space. W e study tw o sp eciﬁc instant iations of this mo del, one in which the lea rning agent and the mentor ha ve iden tical abilities, and o ne designed to deal with ag en ts and mentors with diﬀeren t action s e ts. W e illustrate the beneﬁts of implicit imitatio n by integrating it with prioritized sweeping, a nd demonstrating improv ed pe r formance and conv ergence through o bserv ation of single and m ultiple mentors. Though we make some stringe nt as sumptions rega r ding obser v abilit y and p ossible interactions, we brieﬂy comment on extensio ns of the mo del that r elax these restricitions. 1. In tro duction The app lic ation of reinforcemen t learning to m ultiagen t systems oﬀers unique opp ortunities and c hallenges. When agen ts are viewe d as ind epend en tly trying to ac h iev e their own ends, in teresting issues in the in teraction of agen t p olicie s (Littman, 1994) must b e resolve d (e.g., b y app eal to equilibrium concepts). Ho w ev er, the fact that agen ts ma y sh are information for mutual gain (T an, 1993) or distribu te their searc h for optimal p olicies and communi- cate reinforcement signals to one another (Mataric, 1998) oﬀers intrig uing p ossibilities for accele rating reinforcemen t learning and enh ancing agent p erformance. Another wa y in w hic h individual agen t p erformance can b e impro ve d is by havi ng a no vice agen t learn reasonable b ehavio r from an exp ert mentor . Th is type of learning can b e b rough t ab out through explicit te aching or demonstration (A tk eson & Sc haal, 1997; Lin, 1992; Whitehead, 1991a) , by sh aring of pr ivileged information (Mataric, 1998) , or through an explicit cognitiv e rep r esen tation of imitation (Bakk er & K uniy osh i, 1996). In imitatio n, the agen t’s own exploration is used to ground its observ ations of other agen ts’ c  2003 AI Access F ounda tion and Morgan Kaufmann Publishers. All rights reserv ed. Price & Boutilier b eha viors in its o wn capabilities and resolv e an y am biguities in observ ations arising from partial observ abilit y and noise. A common thread in all of this w ork is the use of a mentor to guide the explor ation of th e observ er. T ypically , guidance is ac hiev ed thr ough some f orm of explicit comm unication b et we en men tor and observe r. A less direct form of teac hing in v olv es an observe r extracting inf ormat ion from a mentor without the men tor making an explicit attempt to demonstrate a sp eciﬁc b eha vior of in terest (Mitc hell, Mahadev an, & Stein b erg, 1985). In this pap er we d ev elop an imitation mo del w e call implicit imitation that allo ws an agen t to accelerat e the reinforcement learning pro cess through the observ ation of an exp ert men tor (or men tors). The agen t observes the state transitions indu ced by the men tor’s actions and uses the information gleaned f r om these observ ations to up date the estimated v alue of its o wn states and actions. W e w ill distinguish tw o settings in w hic h implicit imitatio n can o ccur: homo gene ous settings , in which the learning agen t and the ment or ha v e id en tical actions; and heter o gene ous settings, where their capabilities may diﬀer. In the homogeneous setting, the learner can use the obs erved ment or transitions d irect ly to up date its o wn estimated mo del of its actions, or to up date its v alue function. In addition, a men tor can provide hint s to the observer ab out the p arts of the state space on whic h it ma y b e w orth fo cusing at ten tion. The observer’s atten tion to an area migh t tak e the form of additional exploration of the area or add itio nal computation b rough t to b ear on the agent ’s prior b eliefs ab out the area. In the heterogeneous setting, similar b eneﬁts accrue, b ut with the p oten tial for an agen t to b e misled by a mentor that p ossesses abilities diﬀeren t from its own. In this case, the learner needs some mechanism to d ete ct such situations and to mak e eﬀorts to temp er the inﬂu ence of these observ ations. W e deriv e sev eral new tec h n iques to sup p ort implicit imitation that are largely indep en- den t of any sp eciﬁc reinforcemen t learning algorithm, though th ey are b est suited for use with mo del-based metho ds. T hese include mo del extraction, augmen ted bac kups, feasibilit y testing, and k -step repair. W e ﬁ rst d escribe implicit imitation in homogeneous domains, then w e d escribe the extension to heterogeneous settings. W e illustrate its eﬀectiv eness empirically b y incorp orating it in to Moore and At ke son’s (1993) prioritized swe eping algo- rithm. The implicit imitation mo del h as sev eral adv an tages o ve r more direct forms of imitation and teac hing. It do es not require an y agen t to explicitly p la y the role of mento r or teac her. Observe rs learn simply by watc hing the b eha vior of other agen ts; if an observed “men tor” shares certain subtasks with the observ er, the obs erv ed b eha vior can b e in co rp orated (indi- rectly) b y the observ er to improv e its estimate of its o wn v alue function. T his is imp ortan t b ecause there are m an y situations in whic h an observer can learn fr om a men tor that is unwil ling or un able to alter its b eha vior to accommo date the observer, or eve n communicat e information to it. F or example, common communicati on proto cols ma y b e un a v ailable to agen ts designed by diﬀerent dev elop ers (e.g., Inte rnet agent s); agen ts ma y ﬁ n d themselve s in a comp etitiv e situation in whic h there is disincen tiv e to share information or skills; or there ma y simply b e no incen tiv e for one agen t to provi de information to another. 1 Another key adv anta ge of our approac h—wh ic h arises from formalizing imitation in the reinforcemen t learning con text—is the fact that the observ er is not constrained to dir ectly 1. F or reasons of consistency , we will use the term “men tor” to d escri b e any agent from which an observ er can learn, even if the mentor is an unwilling or unwitting participant. 570 Implicit Imit a tion imitate (i.e., duplicate the actions of ) the mentor. Th e learner can decide wh et her such “explicit imitation” is worth wh ile . Implicit imitation can thus b e seen as blending the adv an tages of explicit teac hin g or explicit knowledge transfer with those of indep endent learning. In addition, b ecause an agen t learns by observ ation, it can exploit the existence of multiple m en tors, essen tially distributing its searc h. Finally , we do not assume that the observ er knows the actual actions tak en b y the mento r, or that the m entor sh ares a rew ard fun ct ion (or goals) w ith the ment or. Again, this stands in sharp con trast with many existing mo dels o f teac hing, imitation, and b eha vior learning b y observ ation. While w e make some strict assump tions in this pap er with r espect to observ abilit y , complete kno wledge of rew ard f unctions, and the existence of mappings b et we en agen t state sp aces, the mo del can b e generalize d in int eresting w a ys. W e will elab orate on some of these generaliza tions n ear the end of the p ap er. The remainder of the p aper is structured as follo ws. W e provide th e necessary bac k- ground on Marko v decision pro cesses and reinforcement learning f or th e d ev elopmen t of our implicit imitation mo del in Section 2. In S ect ion 3, we d escrib e a general formal framew ork for the study of implicit imitation in reinforcemen t learning. Two sp eciﬁc instan tiations of this framewo rk are then deve lop ed. In Section 4, a mo del for homogeneous agen ts is dev elop ed. Th e mo del extr action tec hn ique is explained and the augmente d Bel lm an b ackup is pr op osed as a mec hanism for incorp orating observ ations in to mo del-based r einforce ment learning algorithms. Mo del conﬁdence testing is then int ro duced to ensure that misleading information do es not ha v e undu e inﬂuence on a learner’s exploration p olicy . Th e use of men tor observ ations to to fo cus atten tion on inte resting parts of the state space is also in tro duced. S ection 5 develo ps a mo del f or heterogeneous agen ts. The mo del extends the h omog eneous mo del through f e asibility testing , a d evice by whic h a learner can detect whether the men tor’s abilities are similar to its o wn, and k -step r ep air , wh ereb y a learner can attempt to “mimic” the tra jec tory of a mento r that cannot b e du plicat ed exactly . Both of these tec hniques p ro ve crucial in heterogeneous settings. Th e eﬀectiv eness of these mo dels is demonstrated on a num b er of carefully c hosen navig ation problems. S ec tion 6 examines conditions under whic h imp licit imitation will and will n ot w ork we ll. Section 7 describ es sev eral pr omising extensions to the mo del. Section 8 examines the implicit imitation mo del in the con text of related work and Section 9 considers f uture w ork b efore dra wing some general conclusions ab out implicit imitation and the ﬁ el d of computational imitation more broadly . 2. Reinforcemen t Learning Our aim is to provi de a formal mo del of implicit imitation, w hereb y an agen t can learn ho w to act optimally b y com b ining its o w n exp erience with its observ ations of the b eha vior of an exp ert ment or. Before doing so, we describ e in this section the standard mo del of reinforcemen t learning used in artiﬁcial inte lligence. Our mo del will bu ild on this single- agen t view of learning h o w to act. W e b egin by reviewing Marko v d eci sion p rocesses, w h ic h pro vide a m odel for s equ ential decision making under u ncerta int y , an d then mo v e on to describ e reinforcemen t learning, with an emphasis on mo del-based metho ds. 571 Price & Boutilier 2.1 Mark o v Decision Pro cesses Mark o v decision pro cesses (MDPs) ha v e p ro ve n very useful in mo deling sto c hastic sequen- tial d ec ision p r oblems, and h a ve b een widely u sed in decision-theoretic plann in g to mo del domains in which an agen t’s actions ha ve uncertain eﬀects, an agen t’s kno wledge of the en- vironmen t is uncertain, and the agen t can ha v e multiple, p ossibly conﬂ ic ting ob jectiv es. In this section, we d escribe the basic MDP mo del and consider one classical solution p rocedur e. W e do n ot consider action costs in our formula tion of MDPs, though these p ose no sp ecial complicati ons. Final ly , we mak e the assump ti on of ful l observability . Partial ly observable MDPs (POMDPs) (Cassand ra, Kaelbling, & Littman, 1994; Lo ve jo y , 1991; Smallw o od & Sondik, 1973) are muc h more computationally demandin g than fully observ able MDPs. Ou r imitatio n mo del will b e b ased on a fully observ able mo del, though some of the generaliza- tions of our mo del ment ioned in the concluding section bu ild on POMDPs. W e refer the reader to Bertsek as (1987); Boutilier, Dean and Hanks (1999 ); and Puterman (1994) for further material on MDPs. An MDP can b e view ed as a sto c h asti c automaton in whic h actions ind uce transitions b et ween s ta tes, and r ew ards are obtained dep ending on the states visited b y an agen t. F orm ally , an MDP can b e d eﬁned as a tuple hS , A , T , R i , wh ere S is a ﬁnite set of states or p ossible wo rlds, A is a ﬁnite set of actions , T is a state tr ansition function , and R is a r ewar d function . The agen t can con trol the state of the system to some exten t by p erformin g actio ns a ∈ A that cause state tr ansitions , mo v ement from the cur ren t s tate to some new state. Actions are sto c h asti c in that the actual transition caused cannot generally b e predicted with certain t y . The transition fu nctio n T : S × A → ∆( S ) d escrib es the eﬀects of eac h action at eac h state. T ( s i , a ) is a pr obab ilit y distribu tion o v er S ; sp eciﬁcally , T ( s i , a )( s j ) is the probabilit y of ending up in state s j ∈ S wh en action a is p erform ed at state s i . W e will den ote this quantit y by Pr( s i , a, s j ). W e requir e that 0 ≤ Pr( s i , a, s j ) ≤ 1 for all s i , s j , and that for all s i , P s j ∈S Pr( s i , a, s j ) = 1. The comp onents S , A and T d ete rmine the dynamics of the system b eing con trolled. T he assu mption that the system is fully observ able means that the agent knows the true state at eac h time t (once that stage is reac h ed), and its decisions can b e b ased solely on this knowledge . Thus, un certaint y lies only in the prediction of an action’s eﬀects, not in determining its actual eﬀect after its execution. A (deterministic, stationary, Markovian) p olicy π : S → A describ es a course of action to b e adopted by an agen t con trolling the system. An agen t adopting such a p olicy p erforms action π ( s ) wh enev er it ﬁnds itself in state s . Poli cies of this form are Mark o vian sin ce the action choice at any state do es n ot dep end on the system h isto ry , and are stationary since action c hoice do es not dep end on the stage of the decision p roblem. F or th e prob lems we consider, optimal stationary Mark o v ian p olicies alwa ys exist. W e assume a b oun d ed, real-v alued reward fu nctio n R : S → ℜ . R ( s ) is the instan- taneous rew ard an agen t receiv es for o ccup yin g state s . A n umb er of optimalit y criteria can b e adopted to measure the v alue of a p olicy π , all measuring in some wa y the reward accum ulated by an agen t as it trav erses the state space thr ough the execution of π . In this w ork, we f ocus on disc ounte d inﬁnite-horizon p roblems: the curr ent v alue of a reward re- ceiv ed t stages in the fu ture is discounted by some factor γ t (0 ≤ γ < 1). This allo ws s imp ler 572 Implicit Imit a tion computational metho ds to b e used, as d isco unted total r ew ard will b e ﬁnite. Discounting can b e ju stiﬁed on other (e.g., economic) groun ds in man y situations as w ell. The value function V π : S → ℜ r eﬂ ects the v alue of a p olicy π at an y s ta te s ; this is simply the exp ected sum of d iscounted fu ture rew ards obtained b y executing π b eginning at s . A p olicy π ∗ is optimal if, for all s ∈ S and all p olicies π , w e ha v e V π ∗ ( s ) ≥ V π ( s ). W e are guarantee d that su ch optimal (stationary) p olicies exist in our setting (Puterman, 1994) . The (optimal) v alue of a state V ∗ ( s ) is its v alue V π ∗ ( s ) u n der an y optimal p olicy π ∗ . By solving an MDP , w e refer to the problem of constructing an optimal p olicy . V alue iter ation (Bellman, 1957) is a simple iterativ e approxima tion algorithm for optimal p olicy construction. Giv en some arb itrary estimate V 0 of the true v alue function V ∗ , we iterativ ely impro v e this estimate as follo ws: V n ( s i ) = R ( s i ) + max a ∈A { γ X s j ∈S Pr( s i , a, s j ) V n − 1 ( s j ) } (1) The computation of V n ( s ) giv en V n − 1 is known as a Bel lma n b ackup . Th e sequence of v alue functions V n pro duced by v alue iteration con v erges linearly to V ∗ . E ach iteration of v alue iteratio n requires O ( |S | 2 |A| ) computation time, and the n umb er of iterations is p olynomial in |S | . F or some ﬁnite n , the actions a that maximize the right- hand side of Equation 1 f orm an optimal p olicy , and V n appro ximates its v alue. V arious termin at ion criteria can b e applied; for example, one migh t terminate the algorithm when k V i +1 − V i k ≤ ε (1 − γ ) 2 γ (2) (where k X k = max {| x | : x ∈ X } denotes the su p rem u m norm). This ensures the resulting v alue fu nctio n V i +1 is within ε 2 of the optimal function V ∗ at an y state, and that th e indu ce d p olicy is ε -optimal (i.e., its v alue is within ε of V ∗ ) (Pu te rman, 1994). A concept that w ill b e u seful later is that of a Q-function . Giv en an arbitrary v alue function V , w e deﬁne Q V a ( s i ) as Q V a ( s i ) = R ( s i ) + γ X s j ∈S Pr( s i , a, s j ) V ( s j ) (3) In tuitiv ely , Q V a ( s ) d enote s the v alue of p erforming action a at state s and then acting in a manner that has v alue V (W atkins & Da y an, 1992). I n particular, we deﬁn e Q ∗ a to b e the Q-function deﬁn ed with r espect to V ∗ , and Q n a to b e the Q -function d eﬁ n ed with resp ect to V n − 1 . In this manner, we can r ewrite Equation 1 as: V n ( s ) = max a ∈A { Q n a ( s ) } (4) W e d eﬁne an ergo dic MDP as an MDP in whic h every state is reac hable from an y other state in a ﬁnite n umber of steps with non-zero p robabilit y . 573 Price & Boutilier 2.2 Mo del-based Reinforcemen t Learning One diﬃcult y w ith the use of MDPs is th at the construction of an optimal p olicy requir es that the agen t kno w the exact transition probabilities Pr and reward m odel R . In the sp eci- ﬁcation of a decision problem, these requirement s, esp ecial ly the detailed sp eciﬁcatio n of the domain’s d ynamics, can imp ose an undu e bu rden on the agen t’s designer. R einfor c ement le arning can b e viewe d as solving an MDP in whic h the full details of the mo del, in partic- ular Pr and R , are not kno wn to the agen t. In s te ad, the agen t learns ho w to act optimally through exp erience with its en vironment. W e p ro vide a brief o verview of r einforceme nt learning in this section (with an emphasis on mo del-base d approac hes). F or further details, please refer to the texts of Sutton and Barto (1998) and Bertsek as and T sitsikl is (1996), and the survey of K aelbling, Littman and Moore (1996). In the general mo del, we assume that an agen t is con trolling an MDP hS , A , T , R i and initially kn o ws its s tate and action sp ac es, S and A , but n ot the transition mo del T or rew ard fu nctio n R . The agen t acts in its en vironment, and at eac h s tage of the pro cess mak es a “transition” h s, a, r , t i ; that is, it take s action a at state s , receiv es rewa rd r and mo v es to state t . Based on rep eated exp eriences of this t yp e it can determine an optimal p olicy in one of t wo w a ys: (a) in mo del-b ase d r ei nfor c ement le arning , th ese exp eriences can b e us ed to learn the true nature of T and R , and the MDP can b e solv ed using standard metho ds (e.g., v alue iteration); or (b) in mo del-fr e e r e infor c ement le arning , these exp eriences can b e u sed to d irect ly up date an estimate of th e optimal v alue function or Q-function. Probably the simp lest mo del-based reinforcemen t learning sc heme is the c ertainty e quiv- alenc e appr o ach . Int uitiv ely , a learning agen t is assum ed to hav e some current estimated transition mo del b T of its environmen t consisting of estimated probabilities c Pr( s, a, t ) and an estimated r ewards mo del b R ( s ). With eac h exp erience h s, a, r , t i th e agen t u p date s its es- timated m o dels, solv es the estimated MDP c M to obtain an p olicy b π that w ould b e optimal if its estimated mo dels were correct, and acts according to that p olicy . T o make the certain ty equiv alence approac h p recise , a sp eciﬁc form of estimated mo del and u p date p rocedur e m ust b e adopted. A common appr oa c h is to used the empirical dis- tribution of observed state transitions and rew ards as the estimated mo del. F or instance, if action a has b een attempted C ( s, a ) times at state s , and on C ( s, a, t ) of those o cca sions state t has b een reac hed, then the estimate c Pr( s, a, t ) = C ( s, a, t ) /C ( s, a ). If C ( s, a ) = 0, some prior estimate is used (e.g., one migh t assume all state transitions are equ ip robable). A Ba yesia n approac h (Dearden, F riedman, & Andre, 1999) uses an explicit p rior distribution o v er the parameters of the trans iti on distribution Pr( s, a, · ), and th en up dates these with eac h exp erienced transition. F or instance, we migh t assu me a Diric hlet (Generalized Beta) distribution (DeGro ot, 1975) with p aramet ers n ( s, a, t ) asso ciate d with eac h p ossible suc- cessor state t . The Diric hlet parameters are equal to th e exp erience- based coun ts C ( s, a, t ) plus a “prior count” P ( s, a, t ) represent ing the agen t’s pr io r b eliefs ab out the distribution (i.e., n ( s, a, t ) = C ( s, a, t ) + P ( s, a, t )). The e xp e cte d transition probabilit y Pr( s, a, t ) is then n ( s, a, t ) / P t ′ n ( s, a, t ′ ). Assuming parameter indep endence, the MDP c M can b e solv ed using th ese exp ected v alues. F urthermore, the mo del can b e up d ate d w ith ease, s imp ly increasing n ( s, a, t ) by one with eac h observ ation h s, a, r , t i . Th is mo del has the adv an tage o v er a counte r-based appr oa c h of allo wing a ﬂ exible prior mo del and generally do es n ot 574 Implicit Imit a tion assign probabilit y zero to unobserve d transitions. W e will adopt this Ba y esian p ersp ectiv e in our imitation mo del. One diﬃcult y with the certain t y equiv alence app roac h is the computational burd en of r e- solving an MDP c M with eac h up date of the mo dels b T and b R (i.e., with eac h exp erience). On e could circum v ent this to some extent b y b atc hing exp eriences and u p d ati ng (and r e- solving) the mo del only p erio dically . Alternativ ely , one could use computational eﬀort judiciously to apply Bellman b ackups only at those states whose v alues (or Q-v alues) are lik ely to c h ange the most give n a c hange in the mo del. Mo ore and A tk eson’s (1993) prioritize d swe eping algorithm d oes just this. When b T is up dated by c hanging c Pr( s, a, t ), a Bellman backup is applied at s to up date its estimated v alue b V , as well as the Q-v alue b Q ( s, a ). Su pp ose the magnitude of the c hange in b V ( s ) is giv en by ∆ b V ( s ). F or an y p redecesso r w , the Q-v alues b Q ( w, a ′ )—hence v alues b V ( w )—c an c hange if c Pr( w, a ′ , s ) > 0. T he magnitude of the c hange is b ounded by c Pr( w, a ′ , s )∆ b V ( s ). All such predecessors w of s are placed in a priorit y queue with c Pr( w, a ′ , s )∆ b V ( s ) serving as the priorit y . A ﬁ xed num b er of Bellman bac kup s are applied to states in the order in wh ic h they app ear in the queue. With eac h b ac kup, an y change in v alue can cause new predecessors to b e inserted in to the queue. In this w a y , computational eﬀort is fo cused on those states where a Bellman b ac kup has the greatest impact due to the mo del change . F urthermore, the backups are applied only to a subset of states, and are generally only applied a ﬁxed num b er of times. By wa y of cont rast, in the certain t y equiv alence approac h, bac kups are app lied u n til con v ergence. Thus p rioritiz ed sw eeping can b e viewed as a sp eciﬁc form of asynchronous v alue iteration, and h as app ealing computational p r operties (Moore & A tk eson, 1993). Under certain t y equiv alence, the agent acts as if the cur ren t appro ximation of the mo del is correct, ev en though the mo del is like ly to b e inaccurate early in the learning pro cess. If the optimal p olicy for this inac cur ate mo del prev ent s the agen t fr om exploring the transi- tions wh ic h form part of the optimal p olicy for the true mo del, then the agent will fail to ﬁnd the optimal p olicy . F or th is reason, explicit explor ation p olicies are inv ariably used to ensure th at eac h action is tried at eac h state suﬃ cie ntl y often. By acting randomly (assum- ing an ergo dic MDP), an agen t is assured of sampling eac h actio n at eac h state inﬁnitely often in the limit. Unfortunately , the actions of suc h an agen t w ill fail to exploit (in fact, will b e completely uninﬂu enced b y) its kno wledge of the optimal p olicy . Th is explor ation- exploitatio n tr ade oﬀ refers to the tension b et w een trying new actions in order to ﬁnd out more ab out the environmen t and executing actions b eliev ed to b e optimal on the basis of the curren t estimated mo del. The most common metho d for exploration is the ε –greedy metho d in wh ic h the agent c ho oses a random action a fraction ε of the time, where 0 < ε < 1. Typicall y , ε is deca yed o v er time to increase the agen t’s exploitat ion of its kn owledge . In the Boltzma nn approac h, eac h action is selected with a pr obabili t y prop ortional to its v alue: Pr s ( a ) = e Q ( s,a ) /τ P a ′ ∈A e Q ( s,a ′ ) /τ (5) The p rop ortio nalit y can b e adju sted nonlinearly with the temp erature parameter τ . As τ → 0 the pr obab ilit y of selecting the action with the h ighest v alue tends to 1. Typica lly , τ is started high so th at actions are randomly explored d uring the early stages of learning. As the agen t gains knowledge ab out the eﬀects of its actions and the v alue of these eﬀects, 575 Price & Boutilier the p arame ter τ is deca y ed so that the agen t sp ends more time exploiting actions kno wn to b e v aluable and less time randomly exploring actions. More soph istic ated metho ds attempt to us e information ab out mo del conﬁ dence an d v alue magnitudes to plan a utilit y-maximizing exploration plan. An early approxima tion of this sc heme can b e found in the int erv al estimation metho d (Kaelbling, 1993). Bay esian metho ds hav e also b een used to calculate the exp ected v alue of information to b e gained from exploration (Meuleau & Bour gine, 1999; Dearden et al., 1999). W e concen trate in this pap er on mo del-based approac hes to reinforcement learning. Ho wev er, we should p oin t out that mo del-fr e e metho ds —those in whic h an estimate of the optimal v alue fu n cti on or Q-fu n cti on is learned d irect ly , without r ecourse to a domain mo del—ha ve attracte d muc h atten tion. F or example, T D-metho ds (Sutton, 1988) and Q-learning (W atkins & Da yan, 1992) h a ve b oth pr ov en to b e among the more p opular metho ds for reinforcemen t learning. Our m et ho ds can b e m odiﬁed to deal with mo del-free approac hes, as w e discuss in the concludin g section. W e also fo cus on s o-called table- b ase d (or explicit) r epr esentations of mo dels and v alue fun cti ons. When state and action spaces are large, table-based approac hes b ecome unwieldy , and the asso ciat ed algorithms are generally int ractable. In these situations, appro ximators are often used to estimate the v alues of states. W e will discuss wa ys in w hic h our tec hn iqu es can b e extended to allo w for function app r o ximation in the concluding section. 3. A F ormal F ramew ork for Implicit Imitation T o mo del the inﬂu en ce that a mentor agen t can hav e on the decision pro cess or the learning b eha vior of an obs erver, w e must extend the single-agen t decision mo del of MDPs to accoun t for the actio ns and ob jec tiv es of multiple agen ts. In th is section, we introdu ce a formal framew ork for studying implicit imitation. W e b egin b y introducing a general mo del for sto c hastic games (Shapley , 1953; My erson, 1991) , and th en imp ose v arious assumptions and r estric tions on this general m odel that allo w u s to f o cus on the ke y asp ects of imp lic it imitatio n. W e note th at the framewo rk p r oposed h ere is useful for the study of other forms of kno wledge transfer in m ultiagen t sy s tems, and w e b rieﬂy p oin t out v arious extensions of the f ramew ork that wo uld p ermit implicit imitation, and other forms of kno wledge transfer, in more general settings. 3.1 Non-In teracting Sto c hastic Games Sto chastic games can b e v iewed as a multia gen t extension of Mark o v decision pro cesses. Though Shapley’s (1953) original formulatio n of sto c hastic games inv olv ed a zero-sum (fully comp etit ive ) assu m ption, v arious generalizations of the mo del ha v e b een prop osed allo w in g for arb itra ry relationships b et w een agen ts’ utilit y functions (My erson, 1991). 2 F orm ally , an n -agen t sto c hastic game hS , {A i : i ≤ n } , T , { R i : i ≤ n }i comprises a set of n agen ts (1 ≤ i ≤ n ), a set of states S , a set of actions A i for eac h agen t i , a state transition function T , and a rew ard function R i for eac h agen t i . Unlik e an MDP , individual agen t actions do not determine state transitions; rather it is the joint action tak en by the collection of agent s that determines ho w the system ev olv es at any p oin t in time. Let A = A 1 × · · · × A n b e 2. F or example, see the fully coop erativ e multiagen t MDP mod el prop osed by Boutilier (1999). 576 Implicit Imit a tion the set of joint actions; then T : S × A → ∆( S ), with T ( s i , a )( s j ) = Pr( s i , a, s j ) denoting the probabilit y of ending up in state s j ∈ S when joint action a is p erformed at s tate s i . F or conv enience, w e in tro duce the notation A − i to den ote the s et of joint actions A 1 × · · · × A i − 1 × A i +1 × · · · × A n in v olving all agen ts except i . W e use a i · a − i to denote the (full) j oint action obtained by conjoining a i ∈ A i with a − i ∈ A − i . Because the inte rests of the individu al agen ts may b e at o dds, strategic reasoning and notions of equilibrium are generally inv olve d in the s olution of sto c hastic games. Because our aim is to stud y ho w a reinforcement agen t might learn by observing the b eha vior of an exp ert mento r, w e wish to restrict the mo del in suc h a w a y that strategic in teractions need not b e considered: w e wan t to fo cus on settings in which the actions of th e observer and the ment or d o not intera ct. F ur thermore, we wan t to assum e that the reward functions of the agen ts do not conﬂict in a wa y that requires strategic reasoning. W e deﬁne noninter acting sto chastic games by app ealing to the notion of an agen t pr o- je ction function whic h is used to extract an agen t’s lo c al state from the un derlying game. In these games, an agent ’s local state determines all asp ects of the global state that are relev an t to its d ec ision making pro cess, w hile the pro jection fun cti on d ete rmines wh ic h global states are ident ical from an agen t’s lo cal p ersp ectiv e. F ormally , for eac h agen t i , we assume a lo cal state space S i , and a p ro jection function L i : S → S i . F or any s, t ∈ S , we wr ite s ∼ i t iﬀ L i ( s ) = L i ( t ). T his equiv alence relation partitions S into a set of equiv alence classes suc h that the element s within a sp eciﬁc class (i.e., L − 1 i ( s ) for some s ∈ S i ) need not b e distinguished by agen t i for the pur p oses of individual decision making. W e say a sto c hastic game is noninter acting if th ere exists a lo cal state space S i and pr o jection fu nctio n L i for eac h agen t i s uc h that: 1. If s ∼ i t , then ∀ a i ∈ A i , a − i ∈ A − i , w i ∈ S i w e ha v e X { Pr( s, a i · a − i , w ) : w ∈ L − 1 i ( w i ) } = X { Pr( t, a i · a − i , w ) : w ∈ L − 1 i ( w i ) } 2. R i ( s ) = R i ( t ) if s ∼ i t In tuitiv ely , condition 1 ab o v e imp oses t wo distinct requirements on the game from the p ersp ectiv e of agen t i . First, if we ignore the existence of other agen ts, it pr ovides a notion of state space abstraction suitable for agent i . Sp eciﬁcally , L i clusters together states s ∈ S only if eac h state in an equiv alence class has identic al dyn amic s with resp ect to the abstraction induced b y L i . T his t yp e of abstraction is a form of bisimulatio n of the t yp e studied in automaton minimization (Hartmanis & Stearns, 1966; Lee & Y annak akis, 1992) and automatic abstraction metho ds deve lop ed for MDPs (Dearden & Boutilier, 1997; Dean & Giv an, 1997). It is not h ard to sho w—ignoring the presence of other agen ts—that the underlying system is Mark o vian with resp ect to the abstraction (or equiv alen tly , w.r.t. S i ) if condition 1 is met. The qu an tiﬁcation o v er all a − i imp oses a strong nonin teraction requiremen t, namely , th at the dynamics of the game from the p ersp ectiv e of agent i is indep enden t of the strategie s of the other agen ts. Condition 2 simp ly requires that all states within a giv en equiv alence class for agen t i h a ve the same reward for agen t i . This means that no states within a class need to b e d istinguished—eac h lo ca l s ta te can b e view ed as atomic. 577 Price & Boutilier A noninterac ting game induces an MDP M i for eac h agen t i where M i = hS i , A i , Pr i , R i i where Pr i is giv en by condition (1) ab o v e. Sp eciﬁcally , for eac h s i , t i ∈ S i : Pr i ( s i , a i , t i ) = P { Pr( s, a i .a − i , t ) : t ∈ L − 1 i ( t i ) } where s is any state in L − 1 i ( s i ) and a − i is any elemen t of A − i . Let π i : S a → A i b e an optimal p olicy for M i . W e can extend this to a strategy π G i : S → A i for th e underlying sto c hastic game by simply app lying π i ( s i ) to every state s ∈ S suc h that L i ( s ) = s i . The follo win g prop osition shows that the term “nonin teracting” ind eed provi des an appropriate description of such a game. Prop osition 1 L et G b e a noninter acting sto chastic game, M i the induc e d MD P for agent i , and π i some optimal p olicy for M i . The str ate g y π G i extending π i to G is dominant for agent i . Th us eac h agent can solv e the n onin teracting game by ab s tr acting a wa y irrelev ant as- p ects of the state space, ignoring other agen t actions, and solving its “p ersonal” MDP M i . Giv en an arb itrary sto c hastic game, it can generally b e quite d iﬃcu lt to d isco v er whether it is nonint eracting, requiring the construction of app ropriate pro jection fu nctio ns. In wh at follo ws, w e will simply assume that the und erlying multia gen t system is a noninteract ing game. Rather than sp ecifying the game and p ro jection functions, we will sp ecify the in - dividual MDPs M i themselv es. The noninte racting game ind uced b y the set of individu al MDPs is simp ly the “cross p rod uct” of the ind ividual MDPs. Suc h a v iew is often quite natural. Consider the example of three rob ots movi ng in some tw o-dimensional oﬃce do- main. If we are able to n eg lect the p ossibilit y of interac tion—for example, if the rob ots can o ccup y the same 2-D p osition (at a suitable lev el of granula rit y) and d o not require the same resour ces to ac hiev e their tasks—then we might sp ecify an individu al MDP for eac h rob ot. T he lo cal state might b e determined b y the rob ot’s x, y -p osition, orien tation, and the status of its o wn tasks. Th e global state space w ould b e the cross pro duct S 1 × S 2 × S 3 of the lo cal spaces. The ind ividual comp onen ts of any join t action would aﬀect only the lo ca l state, and eac h agen t wo uld care (through its rew ard fu nctio n R i ) only ab out its lo cal state. W e note that the pro ject ion function L i should n ot b e viewe d as equiv alen t to an ob- serv ation fun ction. W e do not assume that agent i can only distinguish element s of S i —in fact, observ ations of other agen ts’ states will b e crucial for imitatio n. Rather the existence of L i simply means that, fr om the p oint of view of de cision making with a known mo del , the agen t need not w orry ab out d istinct ions other than those made b y L i . Assuming no computational limitations, an agen t i need only solve M i , but may use observ ations of other agen ts in order to improv e its kno wledge ab out M i ’s dyn amic s. 3 3.2 Implicit Imitation Despite the very indep enden t nature of the agen t s ubpro cesses in a noninterac ting m ultia- gen t system, there are circumstances in whic h th e b eha vior of one agen t ma y b e relev ant to 3. W e elaborate on the condition of computational limitations b elo w. 578 Implicit Imit a tion another. T o k eep th e discussion simple, we assum e the existence of an exp ert mentor agen t m , which is implemen ting some stationary (and pr esu mably optimal) p olicy π m o v er its lo ca l MDP M m = hS m , A m , Pr m , R m i . W e also assume a second agen t o , the observer , w ith lo ca l MDP M o = hS o , A o , Pr o , R o i . While nothing ab out the mento r’s b eha vior is relev an t to the observ er if it knows its o w n MDP (and can solv e it without computational diﬃcult y), the situation can b e quite diﬀeren t if o is a reinforcemen t learner w ithout complete knowl- edge of the mo del M o . It ma y w ell b e that the observe d b eha vior of the ment or provi des v aluable information to the observ er in its qu est to learn ho w to act optimally within M o . T o tak e an extreme case, if men tor’s MDP is ident ical to the observ er’s, and the ment or is an exp ert (in the sense of acting optimally), then the b eha vior of the men tor indicates exactly what the observe r should do. Eve n if the mento r is not acting optimally , or if the men tor and observer hav e diﬀerent rewa rd f unctions, ment or state transitions observe d by the learner can provi de v aluable information ab out the dynamics of the d omai n. Th us we see that wh en one agen t is learning how to act, the b eha vior of another can p oten tially b e relev ant to th e learner, ev en if the underlying multia gen t s ys tem is noninter- acting. Similar remarks, of course, apply to the case where the observe r knows the MDP M o , but computational restrictions mak e solving th is diﬃcult—observ ed men tor transitions migh t pro vide v aluable information ab out where to fo cus computational eﬀort. 4 The main motiv ation un d erlying our mo del of implicit imitation is that the b ehavio r of an exp ert men tor can pr o vide hint s as to appr opriat e courses of action for a reinforcemen t learning agen t. In tuitiv ely , implicit imitation is a mec h anism b y wh ich a learning agen t attempts to incorp orate th e observed exp erience of an exp ert mento r agen t int o its learning pro cess. Lik e more classical form s of learning by imitation, the learner considers the eﬀects of the men tor’s action (or action sequence) in its o wn conte xt. Unlik e direct imitation, h o we ve r, w e do not assu me that the learner must “ph ysically” attempt to duplicate the ment or’s b eha vior, nor do we assum e that the ment or’s b eha vior is necessarily appropr iate for the observ er. I nstead, the inﬂuence of the mento r is on th e agen t’s tr ansition mo del and its estimate of value of v arious s tates and actions. W e elab orate on these p oin ts b elo w. In what follo ws , w e assume a ment or m and asso ciated MDP M m , and a learner or observ er o and asso ciate d MDP M o , as describ ed ab ov e. Th ese MDPs are fully observ able. W e fo cus on the r einf orcement learning pr oblem f ac ed by agen t o . The extension to m ultiple men tors is straight forward and will b e discussed b elo w, but for clarit y we assume only one men tor in our description of the abstract framewo rk. It is clear that certain conditions m ust b e met for the observ er to extract useful inform ation f r om the m entor. W e list a n umber of assumptions that w e mak e at d iﬀe rent p oin ts in th e dev elopmen t of our mo del. Observ a bility: W e m ust assume that the learner can observe certain asp ects of the men- tor’s b eha vior. In this work, we assume that state of the ment or’s MDP is ful ly observable to th e learner. Equiv alen tly , w e inte rpr et this as full ob s erv abilit y of the underlying noninterac ting game, together with kn owledge of the m en tor’s p ro jection 4. F or instance, algorithms like asynchronous dynamic programming and prioritized swe eping can b eneﬁt from such guidance. In deed, the distinction b et ween reinforcement learning and solving MDPs is viewed by some as rather b lurry (Sutton & Barto, 1998; Bertsek as & Tsitsiklis, 1996). Our focus is on the case of an un k no wn mo del (i.e., the classi cal reinforcement learning problem) as opp osed to one where computational issues are key . 579 Price & Boutilier function L m . A more general partially observ able mo del would require the sp eciﬁca- tion of an observ ation or signal set Z and an observ ation fun cti on O : S o × S m → ∆( Z ), where O ( s o , s m )( z ) d enote s th e prob ability with which the observ er obtains signal z when the lo cal states of the observer and mentor are s o and s m , resp ectiv ely . W e do not p ursue such a mo del here. It is imp ortan t to n ote that we do not assume that the observ er has access to the action tak en by m at an y p oint in time. Since actions are sto c hastic, the state (ev en if f u lly observ able) that resu lt s from the mentor in vo king a sp eciﬁc control s ignal is generally ins u ﬃcien t to d ete rmine that signal. Thus it seems m uc h more reasonable to assume that states (and transitions) are obs erv able th an the actions th at ga ve r ise to th em. Analogy: If the observe r and the men tor are acting in diﬀeren t lo cal state spaces, it is clear that obs erv ations made of the men tor’s state tran s itions can oﬀer no u seful information to the observ er un less there is some relationship b et w een the tw o state spaces. T here are several wa ys in which this relationship can b e sp eciﬁed. Dautenhahn and Nehaniv (1998 ) use a homomorph ism to deﬁne the relationship b et w een men tor and observ er for a sp e c iﬁc family of tra jectories (see Section 8 for further d iscussion). A sligh tly diﬀerent notion migh t in v olv e the use of some analogical m apping h : S m → S o suc h that an observed state transition s → t pro vides some information to the observ er ab out the dyn amic s or v alue of state h ( s ) ∈ S o . In certain circumstances, we migh t require th e m apping h to b e homomorph ic with resp ect to Pr( · , a, · ) (for some, or all, a ), and p erhap s ev en with resp ect to R . W e discuss these issues in further detail b elo w. In order to simplify our m o del and a vo id und ue atten tion to the (admittedly imp ortan t) topic of constructing suitable analogical mappings, we will simply assu m e that the men tor and the obs erver hav e “iden tical” state s p ac es; that is, S m and S o are in some sense isomorph ic . The precise s en s e in whic h the s p ace s are isomorph ic —or in some cases, pr esume d to b e isomorphic until prov en otherwise—is elab orated b elo w when we discuss the relationship b et wee n agen t abilities. T h u s from this p oin t w e simply refer to the state S without distinguishing the ment or’s lo cal space S m from the observ er’s S o . Abilities: Even with a mapping b et w een states, observ ations of a ment or’s state transitions only tell the observe r something ab out the men tor’s abilities, n ot its o wn. W e must assume that the observ er can in s ome wa y “du p lica te” the actions tak en b y the mentor to ind uce analogous transitions in its own lo cal state sp ace . In other w ords, there m ust b e some presump tion that the mentor and the observer ha ve similar abilities. It is in this sense th at the analogical mapping b et w een state spaces can b e tak en to b e a homomorphism. Sp eciﬁcally , we might assume that the ment or and the observe r ha v e the same actions av ailable to th em (i.e., A m = A o = A ) and that h : S m → S o is homomorphic with resp ect to P r( · , a, · ) for all a ∈ A . T his requirement can b e w eak ened substan tially , without d imin ishing its utilit y , by requiring only that the observ er b e able to implement the actions actual ly taken by the mento r at a giv en state s . Finally , we migh t ha v e an observer that assumes that it can duplicate the actions tak en by the men tor until it ﬁ n ds evidence to the con trary . In this case, there is a pr esume d homomorph ism b etw een the state spaces. In wh at follo w s, we will d istinguish b et w een implicit imitation in homo g ene ous action settings —d omains 580 Implicit Imit a tion in which the analogical mapp ing is indeed homomorph ic—a nd heter o gene ous action settings —where the mappin g ma y not b e a homomorph ism. There are more general w a ys of deﬁn in g similarit y of abilit y , for example, by assuming that the observer ma y b e able to mo ve through state space in a similar fash io n to the men tor without follo wing the s ame tra jec tories (Nehaniv & Dautenhahn, 1998). F or instance, the men tor ma y hav e a w a y of mo ving directly b et w een key lo cations in state space, while the observe r ma y b e able to mov e b et w een analogous lo catio ns in a less direct fash io n. In suc h a case, the analogy b et ween states m a y not b e determined by single actions, b ut r at her by sequences of actions or lo cal p olicies. W e will su gg est w a ys for dealing with restricted forms of analogy of this t yp e in S ect ion 5. Ob jectiv es: Even wh en the observ er and mento r ha ve similar or identic al abilities, the v alue to the observ er of the information gleaned fr om the men tor m ay dep end on the actual p olicy b eing implemen ted by the m entor. W e might supp ose th at the more closely related a mento r’s p olicy is to the optimal p olicy of the observer, the more u seful the information will b e. Thus, to s ome exten t, we exp ect that the more closely aligned the ob jec tiv es of the ment or and the observer are, the more v aluable the guidance p ro vided b y the mento r. Un lik e in existing teac hin g mo dels, we do not supp ose that the mento r is making any explicit eﬀorts to instruct th e observer. And b ecause their ob jecti ve s ma y not b e ident ical, we d o not force the observ er to (attempt to) explicitly imitate the b eha vior of the mento r. I n general, we w ill mak e no exp lic it assump tio ns ab out the relationship b et w een the ob jectiv es of the ment or and the observ er. How eve r, w e will see that, to some exten t, the “closer” they are, the more utilit y can b e derive d from implicit imitatio n. Finally , we r emark on an imp ortan t assumption we mak e throughout the remainder of this pap er: the observ er kno ws its r ew ard function R o ; that is, for eac h state s , the observ er can ev aluate R o ( s ) without h a ving visited state s . This view is consisten t with view of reinforcement learning as “automatic p rogramming.” A user ma y easily sp ecify a reward function (e.g., in the form of a set of p redica tes that can b e ev aluated at an y state) prior to learning. It ma y b e more d iﬃcu lt to sp ecify a domain m odel or optimal p olicy . In such a setting, the only unknown comp onen t of the MDP M o is the transition function Pr o . W e b eliev e this app r oa c h to reinforcemen t learnin g is, in fact, more common in practice than the approac h in wh ic h the rew ard function must b e sampled. T o reiterate, our aim is to describ e a mec hanism by which the observe r can accelerate its learning; but w e emphasize our p osition that implicit imitation—in contrast to e xp licit imitatio n—is n ot merely replicating the b eha viors (or state tra ject ories) observed in another agen t, n or ev en attempting to reac h “similar states”. W e b eliev e the agen t must learn ab out its o wn capabilities and adapt the information con tained in observ ed b eha vior to these. Agen ts must also explore the appropr iat e application (if an y) of observ ed b eha viors, in tegrating these w ith their o wn, as appropr iat e, to ac h iev e their own end s. W e therefore see imitation as an inte ractiv e p rocess in whic h the b eha vior of one agen t is u sed to guide the learning of another. 581 Price & Boutilier Giv en this setting, we can list p ossible wa ys in which an observer an d a men tor can (and cannot) inte ract, con trasting along the wa y our p ersp ectiv e and assum p tio ns with those of existing mo dels in the literature. 5 First, the observe r could attempt to directly infer a p olicy from its observ ations of ment or state-action pairs. This mo del h as a conceptual simplicit y and intuiti ve app eal, and f orm s the basis of the b eha vioral cloning paradigm (Samm ut, Hurst, Kedzier, & Mic hie, 1992; Urbancic & Bratk o, 1994). Ho w ev er, it assumes that the observ er and ment or share the same r ew ard function and action capabilities. It also assu mes that complete and unam biguous tra j ectories ( including action c hoices) can b e observ ed. A related approac h attempts to deduce constraint s on the v alue function from the inferred action preferences of the men tor agen t (Utgoﬀ & C louse, 1991; ˇ Suc & Bratk o, 1997) . Again, how eve r, this app roac h assumes congruity of ob jecti ve s. Our m odel is also distinct from mo dels of explicit teac h ing (Lin, 1992; Whitehead, 1991b): we d o not assume that the men tor h as an y incen tiv e to mo ve through its en vironmen t in a w a y that explicitly guides the learner to explore its o wn environmen t and actio n space m ore eﬀect ive ly . Instead of trying to dir ectly learn a p olicy , an observ er could attempt to use observ ed state transitions of other agen ts to imp ro ve its own envi ronment mo del Pr o ( s, a, t ). With a more accurate mo del and its own reward fu n cti on, the observ er could calculat e more accurate v alues f or states. The state v alues could then b e used to guide the agen t to w ards distan t r ew ards and reduce the need for random exploration. This insight forms the core of our imp licit imitation mo del. T his approac h has not b een deve lop ed in the literature, and is appropr ia te un d er the conditions listed ab o v e, s p eciﬁcally , und er conditions w here the ment or’s actions are unobs erv able, and the ment or and observ er hav e d iﬀe rent r eward functions or ob jecti ve s. T hus, this approac h is applicable u nder more general conditions than many existing mo dels of imitation learning and teac hing. In addition to mo del information, men tors ma y also comm unicate information ab out the relev ance or ir r ele v ance of regions of the state space for certain classes of r ew ard f unctions. An obs erver can use the set of states visited by the men tor as h euristic guidance ab out where to p erform b ackup computations in the state space. In the n ext t w o sections, w e deve lop sp eciﬁc algorithms from our ins ig hts ab out h o w agen ts can use observ ations of others to b oth imp ro ve their o wn mo dels and assess the relev ance of regions within their state s pace s. W e ﬁrst f o cus on the h omo geneous action case, th en extend the mo del to d ea l with heterogeneous actions. 4. Implicit Imitation in Homogeneous Settings W e b egin by describin g implicit imitation in h omog eneous action settings—the extension to heterogeneous settings will b uild on the insight s d ev elop ed in this section. W e devel op a tec hnique called implicit imitation thr ough w hic h observ ations of a men tor can b e u sed to accelerate reinforcemen t learning. First, we d eﬁ n e the homogeneous setting. Then w e dev elop the imp lic it imitation algorithm. Finally , we demonstrate ho w implicit imitatio n w orks on a n umb er of simp le p roblems designed to illustrate the role of the v arious mec ha- nisms w e describ e. 5. W e will describ e other mo dels in more detail in Section 8. 582 Implicit Imit a tion 4.1 Homogeneous Actions The homogeneous action setting is deﬁned as follo ws. W e assume a sin gle mentor m and observ er o , with individual MDPs M m = hS , A m , Pr m , R m i and M o = hS , A o , Pr o , R o i , resp ectiv ely . Note that the agen ts share the same state sp ace (more precisely , we assume a trivial isomorph ic mapping that allo ws us to identi fy their lo cal states). W e also assume that the mento r is executing some stationary p olicy π m . W e will often treat this p olicy as deterministic, bu t most of our r emarks apply to s to c hastic p olicies as wel l. Let the supp ort set Supp ( π m , s ) for π m at state s b e the set of actio ns a ∈ A m accorded nonzero p robabilit y b y π m at state s . W e assume that the observ er h as the same abilities as the mentor in the follo wing sens e: ∀ s, t ∈ S , a m ∈ Supp ( π m , s ), there exists an action a o ∈ A o suc h that Pr o ( s, a o , t ) = Pr m ( s, a m , t ). In other words, the observe r is able to du plica te (in a the sense of inducing the same d istrib ution o v er successor states) the actual b ehavior of the mento r; or equiv alen tly , the agen ts’ lo ca l state spaces are isomorphic with resp ect to th e actions actually tak en by the mentor at the s ubset of states where those actions migh t b e tak en. This is m uc h wea ke r than requiring a full homomorph ism fr om S m to S o . Of course, the existence of a full h omomorph ism is suﬃcien t from our p ersp ectiv e; but our results do not require this. 4.2 The Implicit Imitat ion Algorithm The implicit imitation algorithm can b e und erstoo d in terms of its comp onent pro cesses. First, we extract action mo dels from a mentor. Then w e in tegrate this inf orm at ion into the observer’s o wn v alue estimates by augmenti ng the u sual Bellman b ackup with mentor action mo dels. A conﬁdence testing p rocedur e ensur es that w e only u se this augmented mo del wh en the observe r’s mo del of the men tor is more reliable th an the observe r’s mo del of its o wn b eha vior. W e also extract o ccupancy information from the observ ations of ment or tra jectories in order to fo cus the observe r’s computational eﬀort (to some exten t) in sp eciﬁc parts of the state space. Finally , w e augmen t our action selection pro cess to c ho ose actions that will explore high-v alue regions rev ealed by the men tor. T he remainder of this section expands up on eac h of these pro cesses and how they ﬁ t toget her. 4.2.1 Mode l Extraction The information av ailable to the observ er in its quest to learn ho w to act optimally can b e divided int o tw o categories. First, with eac h action it tak es, it receiv es an exp erience tuple h s, a, r, t i ; in fact, we will often ignore the samp le d rew ard r , since w e assume the r ew ard function R is kno wn in adv ance. As in standard mo del-based learning, eac h suc h exp erience can b e u sed to u p d ate its o wn transition mo del Pr o ( s, a, · ). Second, with eac h m en tor transition, the observ er obtains an exp erience tuple h s, t i . Note again that the observe r do es not ha ve d irect access to the action tak en by th e mentor, only th e induced state transition. Assume the m entor is imp lementi ng a deterministic, stationary p olicy π m , with π m ( s ) denoting the mento r’s c hoice of action at state s . This p olicy induces a Mark o v c hain Pr m ( · , · ) o ve r S , with Pr m ( s, t ) = Pr( s, π m ( s ) , t ) denoting 583 Price & Boutilier the p r obabilit y of a transition from s to t . 6 Since the learner obs erves the men tor’s state transitions, it can construct an estimate c Pr m of th is chain: c Pr m ( s, t ) is simply estimated by the relativ e observ ed frequency of ment or transitions s → t (w.r.t. all transitions tak en from s ). If the observer h as some pr io r ov er the p ossible ment or transitions, standard Ba y esian up date tec hn iqu es can b e us ed instead. W e use the term mo del extr action for this pro cess of estimating the men tor’s Mark o v c hain. 4.2.2 Augmented Bellman Ba ckups Supp ose the observ er h as constructed an estimate c Pr m of the mentor’s Mark o v c hain. By the h omo geneit y assum p tio n, the actio n π m ( s ) can b e replicate d exactly by the observ er at state s . Thus, the p olicy π m can, in principle, b e duplicated by the observer (w ere it able to identify the actual actions used). As su c h, we can deﬁne the v alue of the mentor’s p olicy fr om the observer’s p ersp e ctive : V m ( s ) = R o ( s ) + γ X t ∈ S Pr m ( s, t ) V m ( t ) (6) Notice that Equation 6 uses the ment or’s dynamics but the observ er’s reward function. Letting V d enote the optimal (observer’s) v alue fun ct ion, clearly V ( s ) ≥ V m ( s ), so V m pro vides a low er b oun d on the observ er’s v alue f unction. More imp ortant ly , the terms making up V m ( s ) can b e integ rated dir ectly int o the Bell- man equation for the ob s erver’s MDP , forming th e augmente d Be l lman e quation : V ( s ) = R o ( s ) + γ max ( max a ∈ A o ( X t ∈ S Pr o ( s, a, t ) V ( t ) ) , X t ∈ S Pr m ( s, t ) V ( t ) ) (7) This is the usu al Bellman equation w ith an extra term added, namely , the second summation, P t ∈ S Pr m ( s, t ) V ( t ) d enoti ng the exp ected v alue of duplicating the men tor’s action a m . Since this (un kno wn ) action is identica l to one of the observ er’s actions, the term is redun dan t and the augmented v alue equation is v alid. Of course, the observer using the augmen ted bac kup op eration must rely on estimates of these quantiti es. If the observ er exploration p olicy ensu res that eac h state is visited inﬁ nitel y often, the estimates of the Pr o terms will con v erge to their tru e v alues. If the mento r’s p olicy is ergo dic ov er state space S , then Pr m will also con v erge to its true v alue. If th e mentor’s p olicy is restricted to a subset of states S ′ ⊆ S (those forming the b asis of its Marko v c hain), then the estimates of Pr m for the sub s et will con v erge correctly with r espect to S ′ if the c hain is ergo dic. The states in S − S ′ will remain unvi sited and the estimates will remain uninformed by d ata . Since the m en tor’s p olicy is not under the con trol of the observe r, there is no wa y for th e observ er to inﬂuence the distribution of samples attained for Pr m . An observe r must therefore b e able to reason ab out the accuracy of the estimated mo del Pr m for any s and restrict the application of the augmented equation to those states where Pr m is kn o wn w ith suﬃcient accuracy . 6. This is somewhat imprecise, since the initial distribution of the Mark ov chain is u nkno wn. F or our purp oses, it is only the dynamics th at are relev ant to th e observer, so only the transition probabilities are used. 584 Implicit Imit a tion While Pr m cannot b e us ed indiscriminately , we argue that it can b e highly inf ormati ve early in the learning pro cess. Assuming that the mento r is pursu ing an optimal p olicy (or at least is b eha ving in some wa y so that it tends to visit certain states more frequen tly), there will b e m an y states for whic h the observ er has m uc h more accurate estimates of Pr m ( s, t ) than it do es for Pr o ( s, a, t ) for any sp eciﬁc a . Since the observer is learning, it must explore b oth its state space—causing less f r equen t visits to s —and its action space—th us sp reading its exp erience at s o ve r all actions a . This generally ensures that the sample size up on whic h Pr m is b ased is greater than that for Pr o for any action that forms p art of the m en tor’s p olicy . Apart fr om b eing more accurate, the use of Pr m ( s, t ) can often giv e more informed v alue estimates at state s , sin ce prior action mo dels are often “ﬂat” or uniform, and only b ecome distinguishable at a giv en state wh en the observer h as suﬃ cie nt exp erience at state s . W e note that th e reasoning ab o v e holds ev en if the men tor is implementi ng a (station- ary) sto c hastic p olicy (since the exp ected v alue of sto c h asti c p olicy for a fully-observ able MDP cannot b e greater than that of an optimal deterministic p olicy). While the “direc- tion” oﬀered by a mento r implement ing a d et erministic p olicy tend s to b e more fo cuse d , empirically we h av e found that men tors oﬀer br o ader guidance in mo derately sto c h asti c en vironment s or w hen they imp lemen t sto c h ast ic p olicies, since they tend to visit more of the state space. W e note that the extension to m ultiple mento rs is straigh tforwa rd—eac h men tor mo del can b e incorp orated in to the augmen ted Bellman equation w ithout diﬃcult y . 4.2.3 Mode l Confidence When the men tor’s Marko v c hain is not ergo dic, or if the mixing rate 7 is suﬃcien tly lo w, the men tor may visit a certain state s r ela tiv ely infrequently . The estimated men tor trans ition mo del corresp onding to a state that is rarely (or nev er) visited by th e ment or ma y pro vide a v ery misleading estimate—based on the small sample or the prior for the ment or’s c hain—of the v alue of the men tor’s (un kno wn ) action at s ; and since the mentor’s p olicy is not u nder the control of the observer, this misleading v alue ma y p ersist for an extended p erio d. Since the augmente d Bellman equation do es not consider r ela tiv e reliabilit y of the mentor and observ er mo dels, th e v alue of s uc h a state s may b e o v erestimated; 8 that is, the observ er can b e tric k ed into ov erv aluing the mento r’s (un kno wn ) action, and consequently o v erestimating the v alue of state s . T o ov ercome th is, we incorp orate an estimate of mo del conﬁden ce in to our augmen ted bac kups. F or b oth the men tor’s Marko v chain and the observe r’s action transitions, w e assume a Diric hlet pr io r ov er the parameters of eac h of these multinomial distribu tio ns (DeGroot, 1975). These r eﬂect the observ er’s initial un certa int y ab out the p ossible tran- sition probabilities. F rom sample coun ts of men tor and observer trans iti ons, we up date these d istributions. With this information, we could attempt to p erform optimal Ba y esian estimatio n of the v alue function; but w h en the sample counts are small (and normal approx- imations are not appropriate), there is n o simp le, closed f o rm expr essio n for the resultant distributions o v er v alues. W e could attempt to employ sampling metho ds, but in the in - 7. The mixing rate refers to how quickly a Marko v chain approaches its stationary distribution. 8. Note that underestimates based on suc h considerations are not problematic, since the augmen ted Bellman equation then reduces to the usual Bellman equ atio n. 585 Price & Boutilier V M - V O - V O V M Figure 1: Lo w er b ounds on action v alues incorp orate uncertain t y p enalt y terest of simplicit y we hav e emplo y ed an app ro ximate metho d for combining information sources insp ir ed b y Kaelbling’s (1993) int erv al estimation metho d. Let V d enote the curren t estimated augmen ted v alue function, and Pr o and Pr m denote the estimated observ er and ment or transition mo dels. W e let σ 2 o and σ 2 m denote the v ariance in these mo del parameters. An augmen ted Bellman bac ku p with resp ect to V usin g conﬁd ence testing pro ceeds as follo w s. W e ﬁ r st compute the obs erver’s optimal action a ∗ o based on the estimated augmen ted v alues for eac h of th e observ er’s actions. Let Q ( a ∗ o , s ) = V o ( s ) denote its v alue. F or the b est action, we use the mo del uncertain t y enco ded b y the Diric hlet distrib u tio n to construct a lo w er b ound V − o ( s ) on the v alue of the state to the observ er using the mo del (at state s ) deriv ed from its own b eha vior (i.e., ignoring its observ ations of the mento r). W e emplo y transition count s n o ( s, a, t ) and n m ( s, t ) to denote the n umb er of times th e observ er has made the transition from state s to state t when the action a w as p erformed, and the num b er of times the ment or wa s observed making th e transition from state s to t , r esp ectiv ely . F rom these count s, we estimate the uncertain t y in the mo del using the v ariance of a Diric hlet distribution. Let α = n o ( s, a, t ) and β = P t ′ ∈S − t n o ( s, a, t ′ ). Then the mo del v ariance is: σ 2 mo del ( s, a, t ) = α + β ( α + β ) 2 + ( α + β + 1) (8) The v ariance in the Q-v alue of an action due to the un ce rtain t y in the lo cal mo del can b e found by simple app lica tion of the r ule for com bining linear combinatio ns of v ari- ances, V ar ( cX + d Y ) = c 2 V ar ( X ) + d 2 V ar ( Y ) to the exp r ession for th e Bellman bac kup, V ar ( R ( s ) + γ P t P r ( t | s, a ) V ( t ). Th e result is: σ 2 ( s, a ) = γ 2 X t σ 2 mo del ( s, a, t ) v ( t ) 2 (9) Using Ch eb ychev’ s in equ al it y , 9 w e can obtain a conﬁdence lev el ev en th ou gh the Dirichle t distributions for small sample coun ts are highly n on-normal. The lo wer b ound is then V − o ( s ) = V o ( s ) − cσ o ( s, a ∗ o ) for some s u ita ble constan t c . O ne may int erpret this as p enalizing 9. Chebyc hev’s inequality states t h at 1 − 1 k 2 of the probability mass for an arbitrary d is tribution will b e within k standard deviations of the mean. 586 Implicit Imit a tion FUNCTION augmentedBac kup( V ,Pr o , σ 2 omodel ,Pr m , σ 2 mmodel , s ,c) a ∗ = arg max a ∈A o P t ∈S Pr( s, a, t ) V ( t ) V o ( s ) = R o ( s ) + γ P t ∈S Pr o ( s, a ∗ , t ) V ( t ) V m ( s ) = R o ( s ) + γ P t ∈S Pr m ( s, t ) V ( t ) σ 2 o ( s, a ∗ ) = γ 2 P t ∈S σ 2 omodel ( s, a ∗ , t ) V ( t ) 2 σ 2 m ( s ) = γ 2 P t ∈S σ 2 mmodel ( s, t ) V ( t ) 2 V − o ( s ) = V o ( s ) − c ∗ σ o ( s, a ∗ ) V − m ( s ) = V m ( s ) − c ∗ σ m ( s ) IF V o ( s ) − > V m ( s ) − THEN V ( s ) = V o ( s ) ELSE V ( s ) = V m ( s ) END T able 1: Implicit Bac kup the v alue of a state by subtr acting its “uncertain t y” from it (see Figure 1). 10 The v alue V m ( s ) of the mento r’s action π m ( s ) is estimated similarly and an analogous lo w er b ound V − m ( s ) on it is also constructed. If V − o ( s ) > V − m ( s ), then w e say that V o ( s ) su p ersedes V m ( s ) and we w rite V o ( s ) ≻ V m ( s ). When V o ( s ) ≻ V m ( s ) then either the mentor-i nsp ired mo del has, in fact, a lo we r exp ected v alue (within a sp eciﬁed degree of conﬁ dence) and uses a nonop timal action (from the observ er’s p ersp ectiv e), or the mento r-inspir ed mo del has lo w er conﬁdence. In either case, we reject the information p ro vided b y the ment or and use a standard Bellman b ac kup using the action mo del deriv ed solely from the observe r’s exp erience (th us su ppressing the augmente d bac kup)—the b ac k ed u p v alue is V o ( s ) in this case. An algorithm for computing an augmente d bac kup usin g this conﬁdence test is sho wn in T able 1. The algorithm parameters include the current estimate of the augmen ted v alue function V , the current estimated mo del Pr o and its asso ciated lo cal v ariance σ 2 omodel , and the mo del of the ment or’s Marko v chai n Pr m and its asso ciate d v ariance σ 2 mmodel . It calculate s lo w er b ounds and returns the mean v alue, V o or V m , with the greatest lo wer b ound. The p aramet er c determines the width of the conﬁdence in terv al u s ed in the men tor rejection test. 4.2.4 Focusing The augmen ted Bellman bac kups imp ro ves the accuracy of the observer’s mo del. A second w a y in whic h an observe r can exploit its observ ations of the men tor is to fo cus atten tion on the states visited by the mento r. In a mo del-based appr oa c h, the sp eciﬁc fo cusing me cha- 10. Id eal ly , we w ould like to take not only the u ncertain t y of the mo del at the current state into account, but also the uncertaint y of future states as w ell (Meuleau & Bourgine, 1999). 587 Price & Boutilier nism we adopt is to require the observ er to p erform a (p ossibly augmen ted) Bellman bac kup at state s whenev er the mentor m ak es a transition fr om s . This has three eﬀects. First, if the men tor tends to visit interest ing regions of space (e.g., if it sh ares a certain rew ard struc- ture w ith the observer), th en the signiﬁcan t v alues bac ke d up from men tor-visited states will b ias the observ er’s exploration to w ards these regions. Second, computational eﬀort will b e concent rated to ward parts of state space where the estimated mo del c Pr m ( s, t ) c hanges, and hence where the estimated v alue of one of the observer’s actions m a y c hange. Third, computation is fo cused w here the mo del is lik ely to b e more accurate (as d iscussed ab o v e). 4.2.5 Action Selection The in tegration of exploration tec hn iques in the action selection p olicy is imp ortan t for an y reinforcemen t learning algorithm to guarant ee con v ergence. In imp lic it imitatio n, it pla ys a second, cru cial role in helpin g the agen t exploit the inf orma tion extracted from the mentor. Our impr o ve d con v ergence resu lts rely on the greedy q u ali t y of th e exploration strategy to bias an observer to wards the h igher-v alued tra jectories revea led by the mentor. F or exp ediency , we h a ve adopted the ε -greedy action selection metho d, u sing an ex- ploration rate ε that deca ys o v er time. W e could easily ha v e emplo y ed other semi-greedy metho ds su c h as Boltzmann exploration. In the presence of a men tor, greedy action selec- tion b ecomes more complex. The observer examines its o wn actions at state s in the u s u al w a y and obtains a b est action a ∗ o whic h has a corresp ondin g v alue V o ( s ). A v alue is also calculate d for the mento r’s action V m ( s ). If V o ( s ) ≻ V m ( s ), then th e observer’s o w n action mo del is used and the greedy action is deﬁn ed exactly as if th e mento r were not present . If, how ev er, V m ( s ) ≻ V o ( s ) then we would lik e to deﬁne the greedy action to b e the action dictated by the men tor’s p olicy at state s . Unfortunately , the observe r do es not kno w whic h action this is, so w e deﬁne the greedy action to b e the observe r’s action “closest” to the men tor’s action according to the observ er’s current mo del estimates at s . More precisely , the action most similar to the ment or’s at state s , d enote d κ m ( s ), is that whose outcome distribution has minim um Ku llback- Leibler div ergence from the mentor’s action outcome distribution: κ m ( s ) = argmin a ( − X t Pr o ( s, a, t ) log Pr m ( s, t ) ) (10) The observer’s o wn exp erience-based action mo dels will b e p o or early in training, so there is a chance th at the closest action computation will select the wrong action. W e rely on the exploration p olicy to ensure that eac h of the observe r’s actions is sampled appr opriat ely in the long ru n. 11 In our p resen t w ork we ha v e assumed that the state space is large and that the agent will th erefore not b e able to completely up date the Q-function ov er the w hole space. (The in tractabilit y of up dating the en tire state space is one of the motiv ations for using imitation tec hn iques). In the absence of information ab out the state’s true v alues, we would lik e to bias the v alue of the states along the ment or’s tra j ectories so that they lo ok worth while to explore. W e do this b y assuming b ound s on the rew ard fu nctio n and setting the initial Q - v alues o v er the ent ire s pace b elo w this b oun d. In our simple examples, r ew ards are strictly 11. If the mentor is ex ecut ing a stochastic p olicy , th e test based on K L-div ergence can mislead the learner. 588 Implicit Imit a tion p ositiv e s o we set the b ounds to zero. If men tor tra jectories intersec t any states v alued by the observing agen t, bac kups will cause the states on these tra ject ories to ha ve a higher v alue than the s u rrounding states. Th is causes the greedy step in the exploration metho d to pr efer actions that lead to ment or-visited states o ver actions for whic h the agen t h as no information. 4.2.6 Mode l Extraction in Specific Reinf orcement Learning Algorithms Mo del extraction, augmented bac kups, the fo cusing mec hanism, and our extended notion of the greedy action selection, can b e inte grated in to mo del-based reinforcemen t learning al- gorithms with relativ e ease. Generically , our implicit imitation algorithm r equ ir es that: (a) the observer m ai nta in an estimate c Pr m ( s, t ) of the Marko v c hain ind uced b y the men tor’s p olicy—this estimate is up dated with ev ery obs erved transition; and (b) that all b ac kups p erformed to estimate its v alue fu nction use the augmente d bac kup (Equation 7) with conﬁ- dence testing. Of cours e, these bac kups are implemen ted usin g estimated mo dels c Pr o ( s, a, t ) and c Pr m ( s, t ). In addition, the fo cusing mechanism requires that an augmen ted bac kup b e p erformed at an y state visited by the mentor. W e demonstrate the generalit y of these mec hanisms by com bining them with the w ell- kno wn and eﬃcien t prioritize d swe eping algorithm (Mo ore & Atk eson, 1993). As outlined in Section 2.2, prioritized swe eping w orks b y main taining an estimated transition mo del c Pr and reward m o del b R . Whenev er an exp erience tuple h s, a, r , t i is s amp led, the estimated mo del at state s can c hange; a Bellman bac kup is p erformed at s to incorp orate the revised mo del and some (usu al ly ﬁxed) num b er of additional b ackups are p erformed at selected states. States are selected us in g a priority that estimates the p oten tial c hange in their v alues based on the c hanges precipitated by earlier bac kups. Eﬀectiv ely , computational r eso ur ces (bac ku ps) are fo cused on those states that can most “b eneﬁt” from th ose bac kups. Incorp orating our ideas in to pr iorit ized sweeping simp ly requires the f ollo wing c hanges: • With eac h transition h s, a, t i the ob s erver tak es, the estimated mo del c Pr o ( s, a, t ) is up- dated and an augmen ted bac kup is p erformed at state s . Augmen ted bac kups are then p erformed at a ﬁxed n umber of states using the usual priorit y queu e implement ation. • With eac h observ ed men tor transition h s, t i , the estimated mo del c Pr m ( s, t ) is up dated and an augmen ted backup is p erformed at s . Augmen ted bac kups are then p erformed at a ﬁxed num b er of states using the usu al priorit y queue implemen tation. Keeping samples of m en tor b eha vior imp leme nts m odel extractio n. Augmen ted backups in tegrate this information int o the observer’s v alue f u ncti on, and p erforming augmen ted bac kups at observ ed transitions (in addition to exp erienced transitions) incorp orates our fo cusing mec h an ism . T he observer is not forced to “follo w” or otherwise mimic the actions of the men tor directly . But it do es bac k up v alue information along the ment or’s tra jectory as if it had. Ultimatel y , the observer m ust mov e to those states to disco v er w hic h actions are to b e used; in the mean time, imp ortant v alue information is b eing propagated that can guide its exploration. Implicit imitation do es not alter the long run theoretical conv ergence prop erties of the underlying reinforcemen t learning algorithm. The implicit imitation framew ork is orthogo- nal to ε -greedy exploration, as it alters only the deﬁnition of the “greedy” action, not when 589 Price & Boutilier the greedy action is tak en. Giv en a theoretical ly appropriate d ec a y factor, the ε -greedy strategy will th us ensure that the d istrib utions for the action mo dels at eac h state are sampled inﬁnitely often in the limit and conv erge to their true v alues. Since the extracted mo del from the m en tor corresp onds to one of the observer’s own actions, its eﬀect on the v alue fun cti on calculati ons is no diﬀerent than the eﬀect of the observ er’s o wn sampled action mo dels. T h e conﬁd ence mec hanism ensures that the mo del with more samples will ev en tually come to d ominat e if it is, in fact, b ette r. W e can therefore b e su re that the con- v ergence p rop ertie s of reinforcemen t learning with implicit imitation are iden tical to that of the und erlying reinforcement learning algorithm. The b eneﬁt of implicit imitation lies in the wa y in which the mod els extracted f r om the men tor allo w the observ er to calculate a lo we r b oun d on the v alue function and us e this lo wer b ound to c ho ose its greedy actions to mo v e the agen t to wa rds h igher-v alued r egi ons of state space. Th e r esu lt is quick er conv ergence to optimal p olicies and b etter short-term practical p erformance with r esp ect to accum ulated discounted rewa rd while learning. 4.2.7 Extens ions The implicit imitation mo del can easily b e extended to extract mo del inform at ion from m ultiple mento rs, mixing and matc hing pieces extracte d fr o m eac h mentor to ac hiev e goo d results. It do es this by searching, at eac h state, th e set of men tors it knows ab out to ﬁnd the men tor with the highest v alue estimate. T h e v alue estimate of the “b est” men tor is then compared using the conﬁdence test describ ed ab o ve with the observer’s own v alue estimate. The form al expr essio n of the algorithm is giv en b y the multi-augmente d Bel lman e quation : V ( s ) = R o ( s ) + γ max ( max a ∈A o ( X t ∈S Pr o ( s, a, t ) V ( t ) ) , max m ∈M X t ∈S Pr m ( s, t ) V ( t ) ) (11) where M is the set of candidate men tors. Ideally , conﬁdence estimates should b e tak en in to accoun t when comparing mentor estimates with eac h other, as we ma y get a mento r with a high mean v alue estimate but large v ariance. If th e observ er has an y exp erience with the state at all, this ment or will lik ely b e rejected as h aving p o orer qu ali t y in formatio n than the obs erver already has from its own exp erience. The observ er migh t ha ve b een b etter oﬀ pic king a mentor with a low er m ean bu t more conﬁdent estimate that w ould ha v e succeeded in the test against th e observ er’s own mo del. In the inte rests of simp lic it y , ho w ev er, we inv estigate m ultiple men tor com bination without conﬁdence testing. Up to now, we ha ve assumed no action costs (i.e., the agen t’s rew ards dep end only on the state and not on the action selected in the state); how eve r, w e can use more general rew ard functions (e.g., where rewa rd has the form R ( s, a )). The diﬃculty lies in bac king up action costs when th e mento r’s c hosen action is unknown. In Section 4.2.5 we deﬁned the closest actio n fun ct ion κ . The κ function can b e used to c ho ose the appr opr iat e rewa rd. The augmented Bellman equation with generalized rewa rds tak es the follo w ing form : V ( s ) = max ( max a ∈A o ( R o ( s, a ) + γ X t ∈S Pr o ( s, a, t ) V ( t ) ) , 590 Implicit Imit a tion R o ( s, κ ( s )) + γ X t ∈S Pr m ( s, t ) V ( t ) ) W e note that Ba yesian metho ds could b e used could b e used to estimate action costs in the men tor’s chain as w ell. In an y case, the generalized reward augmented equation can readily b e amended to use conﬁdence estimates in a similar fashion to the transition mo del. 4.3 Empirical Demonstrations The f ol lo wing empirical tests incorp orate m o del extractio n and our fo cusing m ec hanism into prioritized sw eeping. The results illustrate the types of pr oble ms and scenarios in which implicit imitation can provide adv an tages to a reinforcemen t learning agen t. In eac h of th e exp erimen ts, an exp ert men tor is introduced into the exp erimen t to serv e as a m o del f or the observ er. In eac h case, the mentor is follo w ing an ε -greedy p olicy w ith a ve ry small ε (on the order of 0.01). This tends to cause the mento r’s tra jecto ries to lie w ithin a “cluster” surround ing optimal tra ject ories (and reﬂect go od if not optimal p olicies). Ev en with a small amount of exploration and some environmen t sto c hasticit y , mento rs generally do n ot “co ver” the en tire state sp ace, so conﬁd ence testing is imp ortant . In all of these exp erimen ts, p rioriti zed swe eping is u sed with a ﬁxed n umb er of bac k- ups p er observed or exp erienced sample. 12 ε -greedy exploration is used w ith deca ying ε . Observe r agen ts are giv en uniform Diric hlet priors and Q-v alues are initialized to zero. Ob - serv er agen ts are compared to con trol agen ts that do not b eneﬁt from a ment or’s exp erience, but are otherwise identica l (implemen ting pr iorit ized sweeping with similar parameters and exploration p olicies). The tests are all p erformed on sto c hastic grid w orld d omai ns, since these mak e it clear to w h at extent the observer’s and ment or’s optimal p olicie s o v erlap (or fail to). In Figure 2, a simple 10 × 10 example sho ws a start and end state on a grid. A typica l optimal mento r tra j ec tory is illustrated by the solid line b et we en the start and end states. Th e dotted line sho ws that a t ypical m en tor-inﬂuenced tra jectory will b e quite similar to the observe d mento r tra j ec tory . W e assume eigh t-connecti vit y b et ween cells so that any state in the grid has nine n eig hbors includin g itself, but agen ts h av e only four p ossible actions. In most exp erimen ts, the four actions mo v e the agen t in the compass directions (North, South, East and W est), although the agen t will not initially kno w which action do es which. W e fo cus pr imarily on whether imitation impr ov es p erf orm ance d uring learning, since the learner will con ve rge to an optimal p olicy w hether it uses imitation or not. 4.3.1 Experimen t 1: The Imit a tion Effe ct In our ﬁ rst exp eriment we compare the p erformance of an observe r using mo del extraction and an exp ert mento r with the p erformance of a con trol agen t using in dep enden t reinforce- men t learning. Giv en the un iform nature of this grid world and the lac k of in termediate rew ards, conﬁdence testing is not required. Both agen ts attempt to learn a p olicy that maximizes discounte d return in a 10 × 10 grid world. They start in the upp er-left corner and seek a goal with v alue 1 . 0 in the lo we r-right corner. Up on reac hing the goal, the agen ts 12. Generally , the number of backups was set to b e roughly equal to the length of the optimal “noise-free” path. 591 Price & Boutilier S X Figure 2: A simple grid w orld with start state S and goal state X 0 500 1000 1500 2000 2500 3000 3500 4000 4500 −10 0 10 20 30 40 50 Obs Ctrl Delta FA Series Average Reward per 1000 Steps Simulation Steps Figure 3: Basic observe r and con trol agen t comparisons are r esta rted in the u pp er-left corner. Generally the men tor will follo w a similar if not iden- tical tra jec tory eac h run , as the men tors were trained us in g a greedy strategy that lea v es one path sligh tly more highly v alued than the rest. Action dynamics are n oi sy , with the “in tended” direction b eing r ealized 90% of the time, and one of the other directions tak en otherwise (uniformly). Th e d isco unt factor is 0 . 9. In Figure 3, we plot the cumulativ e n umber of goals obtained o v er the previous 1000 time steps for the observer “Obs” and con trol “Ctrl” agen ts (results are a v eraged o ve r ten run s ). The obs erv er is able to qu ickly incorp orate a p olicy learned from the m entor into its v alue estimates. This resu lts in a steep er learning curve. In contrast, the cont rol agen t slo wly explores the space to bu ild a mo del ﬁ rst. The “Delta” curve sho ws the diﬀerence in p erforman ce b et wee n the agen ts. Both agen ts conv erge to the same optimal v alue fu n cti on. 592 Implicit Imit a tion 0 1000 2000 3000 4000 5000 6000 −5 0 5 10 15 20 25 30 Basic Scale Stoch Average Reward per 1000 Steps Simulation Steps Figure 4: Delta curves showing the inﬂ uence of domain size and noise 4.3.2 Experimen t 2: Scaling and Noise The next exp eriment illustrates the sensitivit y of imitation to the size of the state space and action noise lev el. Again, the observer uses mo del-extractio n b ut n ot conﬁdence testing. In Figure 4, we plot the Delta cur v es (i.e., diﬀer e nc e in p erform ance b et w een observ er and con trol agen ts) for the “Basic” scenario just describ ed, the “Scale” scenario in whic h th e state space size is in crea sed 69 p ercen t (to a 13 × 13 grid), and the “Sto c h” scenario in whic h the noise lev el is increased to 40 p ercen t (results are a v eraged o v er ten run s ). The total gain represente d by the area u nder the curves for the observer and th e n on-imitating prioritized swee ping agen t increases w ith the state space size. Th is reﬂects Whitehead’s (1991 a) observ ation that for grid wo rlds, exploration requiremen ts can increase quic kly with state sp ace size, but th at the optimal p at h length increases only linearly . Here we see that the guidance of the ment or can help more in larger state sp ac es. Increasing th e noise level r educes the observer’s abilit y to act u p on the in f ormati on receiv ed from the mento r and th erefo re ero des its adv an tage o ve r the control agen t. W e note, how eve r, that the b eneﬁt of imitation degrades gracefully with increased n oise and is presen t even at this relativ ely extreme noise level . 4.3.3 Experimen t 3: Confidence Testing Sometimes the observ er’s prior b eliefs ab out the transition p robabiliti es of the mento r can mislead the observ er and cause it to generate inapprop r iat e v alues. The conﬁd ence mec h- anism prop osed in the previous section can prev ent the observer fr om b eing fo oled by misleading priors o ver the men tor’s transition probabilities. T o demonstrate the role of the conﬁdence mec hanism in implicit imitation, we designed an exp erimen t based on the sce- nario illustrated in Figure 5. Again, the agent ’s task is to na vigate from the top-left corner to the b ottom-righ t corner of a 10 × 10 grid in order to attain a reward of +1. W e ha v e cre- 593 Price & Boutilier +5 +5 +5 +5 Figure 5: An environmen t with misleading priors ated a pathologica l s ce nario in wh ic h islands of high rew ard (+5) are enclosed by obstacles. Since the observe r’s priors r eﬂec t eight- connectivit y and are un iform, the h igh-v alued cells in the midd le of eac h island are b eliev ed to b e reac h able fr om the states diagonally adjacen t with some small prior pr obabilit y . In r ea lit y , how eve r, the agent ’s action set precludes this and the agen t will therefore nev er b e able to realize this v alue. T he four islands in this scenario thus create a fairly large region in the cente r of the space with a high estimated v alue, whic h could p oten tially trap an observer if it p ersisted in its prior b eliefs. Notice that a stand ard r einf orcement learner will “quic kly” learn that none of its actions tak e it to the r ew arding island s ; in con trast, an implicit imitator using augmen ted bac kups could b e fo oled by its prior ment or mo del. If the men tor do es not visit the states neigh b oring the island, the observer w ill not ha v e an y evidence up on w hic h to change its p rior b elief that the m en tor actions are equally like ly to tak e one in an y of th e eigh t p ossible directions. The imitator m ay f al sely conclude on the basis of the men tor action mo del that an action do es exist which would allo w it to access the islands of v alue. T he observer therefore n eeds a conﬁdence m echanism to detect when the men tor mo del is less reliable th an its o wn mo del. T o test the conﬁd ence mec hanism, w e ha v e the mentor f ollo ws a path around the outside of the obstacles so that its path cannot lead the obs erver out of the trap (i.e., it provides no evidence to the observer that the diagonal mov es in to the islands are not feasible). The com bination of a high initial exploration rate and the abilit y of prioritized swee ping to spread v alue across large distances then virtually guaran tees that the obs erver will b e led to the trap. Giv en this scenario, w e r an t w o observ er agen ts and a con trol. T he ﬁ r st observer used a conﬁdence in terv al with width give n by 5 σ , w h ic h, according to th e Chebyc hev rule, should co v er approximat ely 96 p ercent of an arb itrary distribu tion. Th e second obs erver w as giv en a 0 σ inte rv al, whic h eﬀectiv ely d isable s conﬁd ence testing. The observer with no conﬁdence testing consisten tly b ecame stuck. Examination of the v alue f unction rev ealed consisten t p eaks within the trap region, and insp ection of the agen t state tra jectories sho w ed that it was stuc k in the trap. The observe r with conﬁdence testing consisten tly escap ed the trap. Observ ation of its v alue function o v er time s h o ws that the trap f orm ed, but faded a w a y as the observe r gained enough exp erience to with its o w n actions to allo w it to ignore 594 Implicit Imit a tion 0 2000 4000 6000 8000 10000 12000 −5 0 5 10 15 20 25 30 35 40 45 Ctrl Obs Delta CR Series Average Reward per 1000 Steps Simulation Steps Figure 6: Misleading priors ma y degrade p erformance o v ercome erroneous priors ov er the mento r actions. I n Figure 6, the p erformance of the observ er with conﬁ d ence testing is sh o wn with the p erformance of the control agen t (results are a v eraged o v er 10 runs). W e see that th e observ er’s p erformance is only slight ly d egraded from that of the u naugmen ted con trol agen t ev en in this p atho logical case. 4.3.4 Experimen t 4: Qualit a tive Di fficul ty The next exp erimen t demonstrates ho w the p oten tial gains of imitation can in crease w ith the (qualitativ e) diﬃcult y of the problem. Th e observe r emplo ys b oth mo del extraction and conﬁdence testing, though conﬁd ence testing will n ot pla y a signiﬁcant role here. 13 In the “maze” s cenario, we introduce obstacles in order to increase the diﬃculty of the learning problem. The maze is set on a 25 × 25 grid (Figure 7) with 286 obstacles complicating th e agen t’s j ourney f rom the top-left to the b ottom-righ t corner. The optimal s olution tak es the form of a sn aking 133-step path, with d istrac ting p aths (up to length 22) br anc hing oﬀ from the solution path necessitating fr equen t bac ktrac king. The discount factor is 0.98. With 10 p ercen t noise, th e optimal goal-att ainment r at e is ab out six goals p er 1000 steps. F r om th e graph in Figure 8 (with results a v eraged o ve r ten ru ns), we see th at the cont rol agen t tak es on the ord er of 200,0 00 steps to build a decent v alue function that r el iably leads to the goal. A t this p oin t, it is only achie ving f ou r goals p er 1000 steps on a v erage, as its exploration rate is still reasonably high (un fortunately , decreasing exploration more quickl y leads to slo wer v alue function formation). Th e imitation agen t is able to tak e adv ant age of the men tor’s exp ertise to build a reliable v alue f u nctio n in ab out 20,000 steps. Since the con trol agen t has b een unable to reac h the goal at all in the ﬁ rst 20,000 steps, the Delta b et ween the con trol an d the imitator is simply equal to the imitator’s p erformance. T he 13. The mento r do es not provide evidence ab out some path choice s in th is p roblem, but there are no interme diate rewa rds which wo uld cause t h e observer to make use of the misleading mentor priors at these states. 595 Price & Boutilier Figure 7: A complex maze 0 0.5 1 1.5 2 2.5 x 10 5 0 1 2 3 4 5 6 7 Ctrl Obs Delta CMB Series Average Reward per 1000 Steps Simulation Steps Figure 8: Imitation in a complex sp ace imitator can quic kly ac hiev e the optimal goal attainmen t r at e of six goals p er 1000 steps, as its exploration rate d eca ys m uch more quic kly . 4.3.5 Experimen t 5: Impro ving Suboptimal Policies by Imit a tion The augmen ted bac kup rule do es not require that the reward structure of the men tor and observ er b e identi cal. There are man y usefu l scenarios where rewa rds are dissimilar b ut the v alue f unctions and p olicies induced share some structure. In this exp eriment , w e demonstrate one in teresting scenario in whic h it is relativ ely easy to ﬁnd a sub optimal solution, but diﬃcult to ﬁn d the optimal solution. Once the observe r ﬁ nds this s ub optimal path, ho w ev er, it is able to exploit its observ ations of the men tor to see that there is a 596 Implicit Imit a tion * * * * * * * * * * * * * * * * * * 1 4 3 5 2 Figure 9: A maze with a p erilous shortcut shortcut that signiﬁcantly shortens th e path to the goal. The structure of the scenario is sho wn in Figure 9. The s u b optimal solution lies on the path from lo cat ion 1 aroun d the “scenic r ou te” to lo catio n 2 and on to the goal at lo cation 3. The ment or tak es the v ertical path f rom lo cation 4 to lo cation 5 thr ough the shortcut. 14 T o discourage the use of the shortcut b y no vice agent s, it is lined with cells (mark ed “*”) f rom whic h the agen t immediately ju m ps bac k to the start state. It is therefore diﬃcult for a no vice agen t executing r andom exploratory mov es to make it all the wa y to the end of the shortcut and obtain the v alue whic h would reinforce its f uture use. Both the observ er and con trol therefore generally ﬁnd the scenic route ﬁrs t. In Figure 10, the p erformance (measured u sing goals reac h ed o ve r the previous 1000 steps) of the con trol and observ er are compared (a ve raged o v er ten r u ns), ind ica ting the v alue of these observ ations. W e see that the observe r and con trol agen t b oth ﬁnd the longer scenic route, though the cont rol agen t take s longer to ﬁ nd it. The obs erv er go es on to ﬁn d the s hortcut and increases its return to almost doub le the goal rate. This exp erimen t s h o ws that men tors can imp ro ve observ er p olicies ev en when the observer’s goals are not on the men tor’s path. 4.3.6 Experimen t 6: Mul tiple Mentors The ﬁnal exp erimen t illustrates how m o del extraction can b e readily extended so that the observ er can extract mo dels from m ultiple mento rs and exploit the most v aluable p arts of eac h. Again, the observ er emplo ys mo del extraction and conﬁdence testing. I n Figure 11, the learner must m o ve from start lo cat ion 1 to goal lo cation 4. Two exp ert agen ts with diﬀeren t start and goal states serve as p oten tial men tors. One m en tor rep eatedly mo v es from lo ca tion 3 to lo cati on 5 along the dotted line, wh ile a second m en tor departs from lo cation 2 and ends at lo cati on 4 along the d ashed line. In this exp erimen t, the observer must 14. A mentor proceeding from 5 to 4 would not p rovide guidance without prior knowledge that actions are revers ible. 597 Price & Boutilier 0 0.5 1 1.5 2 2.5 3 x 10 4 0 5 10 15 20 25 30 35 Ctrl Obs Delta CSB Series Simulation Steps Average Reward per 1000 Steps Figure 10: T ransfer w ith non-iden tical rew ards 2 4 5 1 3 Figure 11: Multiple m entors scenario com bine th e information fr om the examples pro vided by the tw o men tors with indep endent exploration of its o wn in ord er to solve the problem. In Figure 12, we see th at th e observer successfully pulls together these information sources in order to learn muc h more quic kly than the con trol agen t (results are a v eraged o v er 10 runs). W e s ee that the use of a v alue-based tec hnique allo ws the observ er to c h oose whic h m entor’s inﬂ uence to u s e on a state-b y-state basis in ord er to get the b est solution to the pr oblem. 598 Implicit Imit a tion 0 1000 2000 3000 4000 5000 6000 0 10 20 30 40 50 60 Ctrl Obs Delta CMM Series Average Reward per 1000 Steps Simulation Steps Figure 12: Learning fr om m ultiple men tors 5. Implicit Imitation in Heterogeneous Settings When the homogeneit y assump tion is violated, the imp licit imitation f ramew ork describ ed ab o ve can cause the learner’s con ve rgence rate to slo w dramatically and, in some cases, cause the learner to b ecome stuc k in a small n eig hborh oo d of state space. In particular, if the learner is u nable to make the same state transition (or a transition with the same probabilit y) as the mento r at a giv en state, it ma y drastically o v erestimate the v alue of that state. The inﬂated v alue estimate causes the learner to return rep eatedly to this state ev en though its exploration w ill nev er pr odu ce a feasible action that attains the in ﬂated estimated v alue. Th ere is no mec h anism for remo ving the in ﬂuence of the m entor’s Mark o v c hain on v alue estimates—the observ er can b e extremely (and correctly) conﬁdent in its observ ations ab out the mento r’s mo del. The problem lies in the fact that the augmen ted Bellman backup is justiﬁed by the assumption that the observ er can d uplicate e very m entor action. Th at is, at eac h state s , there is some a ∈ A such that Pr o ( s, a, t ) = Pr m ( s, t ) for all t . When an equiv alen t action a do es not exist, there is no guaran tee that the v alue calculate d using th e mentor action m o del can, in fact, b e ac hiev ed. 5.1 F e a sibilit y T esting In such heterogeneous settings, we can preven t “lock-up” and p o or con v ergence thr ough the use of an explicit action fe asibility test : b efore an augment ed backup is p erformed at s , the observ er tests whether the mentor’s action a m “diﬀers” from eac h of its actions at s , giv en its curr ent estimated mo dels. If so, the augmen ted bac kup is sup pressed and a standard Bellman bac kup is used to up date the v alue fu nctio n. 15 By default, mentor actio ns are 15. The decision is binary; but we could envision a smo other d ecisi on criterion th at measures t h e extent to whic h the mentor’s action can b e dup licated. 599 Price & Boutilier assumed to b e feasible for the observ er; h o wev er, once the observer is reasonably conﬁdent that a m is infeasible at state s , augmen ted bac kups are supp r essed at s . Recall that uncertain t y ab out the agen t’s true transition probabilities are captured by a Diric h let distribu tio n derive d from s amp led transitions. Comp aring a m with a o is eﬀected by a diﬀerence of means test with resp ect to the corresp onding Diric hlets. This is complicate d b y the fact that Dirichle ts are highly n on-normal for small parameter v alues and transition distributions are m ultinomial. W e deal with the non-normalit y by requiring a minim um n umber of samples and using robu st Chebyc hev b ound s on the p o oled v ariance of the distributions to b e compared. C onceptual ly , we will ev aluate Equ ation 12: | Pr o ( s, a o , t ) − Pr m ( s, t ) | r n o ( s,a o ,t ) σ 2 omodel ( s,a o ,t )+ n m ( s,t ) σ 2 mmodel ( s,t ) n o ( s,a o ,t )+ n m ( s,t ) > Z α/ 2 (12) Here Z α/ 2 is the critical v alue of the test. The p arameter α is the signiﬁcance of the test, or the probabilit y that we will falsely r ejec t tw o actions as b eing diﬀerent when they are actually the same. Giv en our highly non-normal distributions early in th e trainin g pro cess, the appropr iate Z v alue for a giv en α can b e computed from Chebyc hev’s b ound by solving 2 α = 1 − 1 Z 2 for Z α/ 2 . When we ha ve to o few samples to d o an accurate test, we p ersist with augmen ted bac kups (embo dying our d efault assu mption of h omog eneit y). If the v alue estimate is inﬂated by these bac kups, the agen t will b e b iase d to obtain additional s amples, which will then allo w the agen t to p erform the requir ed f ea sibilit y test. Our assumption is therefore self-correcting. W e d eal with the multiv ariate complicatio ns by p erforming the Bonferr oni test (Seb er, 1984), wh ic h has b een sho wn to giv e go od results in practice (Mi & Sampson, 1993) , is eﬃcien t to compute, and is kno wn to b e robust to dep endence b et w een v ariables. A Bonferroni hyp othesis test is obtained by conjoining seve ral single v ariable tests. Supp ose the actions a o and a m result in r p ossible successor states, s 1 , · · · , s r (i.e., r transition probabilities to compare). F or eac h s i , the hyp othesis E i denotes that a o and a m ha v e the same transition probabilit y to successor state s i ; that is Pr( s, a m , s i ) = Pr( s, a o , s i ). W e let ¯ E i denote the complemen tary h yp othesis (i.e., that the tran s ition pr obabilit ies diﬀer). The Bonferroni inequalit y states: Pr " r \ i =1 E i # ≥ 1 − r X i =1 Pr  ¯ E i  Th us we can test the joint hyp othesis T r i =1 E i —the t w o action mo dels are the same—by testing eac h of th e r complemen tary h yp otheses ¯ E i at conﬁdence lev el α/r . If we reject an y of the hypotheses we r ejec t the notion that the tw o actions are equal with conﬁdence at least α . Th e men tor action a m is deemed infeasible if for every observer action a o , the m ultiv ariate Bonferroni test just describ ed rejects the hyp othesis th at the action is the same as th e men tor’s. Pseudo-co de for the Bonferroni comp onen t of the feasibilit y test app ears in T able 2. It assumes a su ﬃcien t num b er of samples. F or eﬃciency reasons, we cac he the r esu lts of the feasibilit y testing. When the d uplicat ion of the m en tor’s actio n at state s is ﬁrst determined to b e infeasible, we set a ﬂ ag f or state s to this eﬀect. 600 Implicit Imit a tion FUNCTION feasible(m,s) : Boolean F OR each a in A o do allSuccessorPr obsSimilar = true F OR each t in successors( s ) do µ ∆ = | P r o ( s, a, t ) − P r m ( s, t ) | z ∆ = µ ∆ / q n o ( s,a,t ) ∗ σ 2 omodel ( s,a,t )+ n m ( s,t ) σ 2 mmodel ( s,t ) n o ( s,a,t )+ n m ( s,t ) IF z ∆ > z α/ 2 r allSuccessorPr obsSimilar = false END FOR IF allSuccessorPro bsSimilar return tru e END FOR RETURN false T able 2: Action F easibilit y T esting 5.2 k -step Similarity and Repair Action feasibilit y testing essentia lly mak es a strict d ec ision as to whether the agen t can duplicate th e men tor’s action at a sp eciﬁc state: once it is decided that the ment or’s action is infeasible, augmen ted bac kups are sup p ressed and all p oten tial guidance oﬀered is eliminated at that state. Unf ortunatel y , the strictness of the test results in a somewhat imp ov erished notion of similarit y b et wee n men tor and observer. This, in turn, unnecessarily limits the transfer b et w een men tor and observ er. W e prop ose a mec hanism whereby the m en tor’s inﬂuence ma y p ersist even if the s p eciﬁc action it c ho oses is n ot feasible for the mento r; we instead r ely on the p ossibilit y that the observer may appr oximately du plica te the mento r’s tra jectory instead of exactly dup lica ting it. Supp ose an observ er has p revio usly constru ct ed an estimated v alue f unction u s in g aug- men ted bac kups. Using the men tor action mo del (i.e., the mento r’s c hain Pr m ( s, t )), a h igh v alue has b een calculated for state s . S ubsequen tly , sup p ose the men tor’s action at state s is judged to b e infeasible. Th is is illustrated in Figure 13, w here the estimated v alue at state s is originally due to the mento r’s action π m ( s ), which for the sake of illustration mov es with high p robabilit y to state t , which itself can lead to some h ighly-rewarding r egion of state sp ace. After some num b er of exp eriences at state s , how ev er, the learner concludes that the action π m ( s )—and the asso ciated high pr obabilit y transition to t —is not feasible. A t this p oin t, one of t w o things must o ccur: either (a) the v alue calculated for state s and its p r edece ssors will “collapse” and all exploration to w ards highly-v alued regions b ey ond state s ceases; or (b) the estimated v alue drops sligh tly but exploration con tinues to wards the highly-v alued regions. The latter case m ay arise as follo ws. If the observer has pr eviously explored in the vicinit y of state s , the observer’s own action mo del ma y b e suﬃcien tly devel op ed that they still conn ect the higher v alue-regions b ey ond state s to state s through Bellman bac kups. F or example, if the learner h as suﬃcien t exp erience to h a ve learned that the highly-v alued region can b e reac h ed through the alternativ e tra jectory s − u − v − w , the newly disco ve red infeasibilit y of the mento r’s transition s − t will not hav e a deleterious eﬀect on the v alue estimate at s . If s is highly-v alued, it is likely that states close to the men tor’s tra jec tory will b e explored to some degree. In this case, state s will 601 Price & Boutilier State s w u v t "Bridge" High-value Infeasible Transition Figure 13: An alternativ e path can bridge v alue bac kups around infeasible paths not b e as highly-v alued as it w as when using the mentor’s actio n mo del, but it w ill still b e v alued highly enough that it will lik ely to guide fur ther exploration to w ard the area. W e call this alternativ e (in th is case s − u − v − w ) to the mento r’s actio n a bridge , b ecause it allo ws v alue from h igher v alue regions to “ﬂo w o v er” an infeasible men tor transition. Because the bridge w as formed without the inten tion of the agen t, w e call this pr o cess sp ontane ous bridging . Where a sp ont aneous brid ge do es not exist, the observ er’s own action mo dels are gener- ally undevel op ed (e.g., they are close to th eir uniform prior distributions). T ypically , these undeve lop ed mo dels assign a small p robabilit y to ev ery p ossible outcome and therefore dif- fuse v alue from h igher v alued regions and lead to a very p o or v alue estimate for state s . The resu lt is often a dramatic drop in the v alue of state s and all of its predecessors; and exploration to w ards th e highly-v alued region through the neigh b orho o d of state s ceases. In our example, this could o ccur if the observ er’s transition m odels at state s assign lo w probabilit y (e.g., close to p rior probabilit y) of moving to state u du e to lac k of exp erience (or s imilarly if the surround ing states, such as u or v , ha v e b een insu ﬃcie ntly explored). The sp on taneous b ridging eﬀect motiv ates a b roader notion of similarit y . When the observ er can ﬁnd a “short” sequence of actions that bridges an infeasible action on the men tor’s tra jectory , the men tor’s example can still provi de extremely useful guidance. F or the momen t, we assume a short path is any p at h of length no greater than some giv en in teger k . W e sa y an observer is k -step similar to a mentor at state s if the observer can duplicate in k or fewer steps the men tor’s nominal transition at state s w ith “suﬃcien tly high” prob ability . Giv en this notion of similarit y , an observer can no w test w hether a sp ontaneo us bridge exists and d etermine whether the observ er is in danger of v alue fu n cti on collapse and the concomitan t loss of guidance if it d eci des to suppr ess an augmente d b ac kup at state s . T o do this, the observ er initiates a reac habilit y analysis starting from state s u sing its o wn action mo del Pr o ( s, a, t ) to determine if there is a sequence of actions with leads with suﬃcient ly high probabilit y from state s to s ome state t on the men tor’s tra j ec tory do wnstream of the infeasible action. 16 If a k -step bridge already exists, augment ed bac kups can b e safely suppressed at state s . F or eﬃciency , we main tain a ﬂag at eac h state to mark it as “bridged.” Once a state is k n o wn to b e bridged, the k -step reac habilit y analysis n eed not b e r epeated. If a sp on taneous bridge cannot b e found, it migh t still b e p ossible to int ent ionally set out to build one. T o build a brid ge, the observer m ust explore from state s up to k -steps a wa y , hoping to make con tact with the mento r’s tra j ec tory downstream of the infeasible men tor 16. In a more general state space where ergodicity is lacking, the agent must consider predecessors of state s up to k steps b efore s to guarantee that all k -step paths are c heck ed. 602 Implicit Imit a tion action. W e implemen t a single searc h attempt as a k 2 -step random w alk, which will result in a tra j ectory on a v erage k steps a wa y from s as long ergo dicit y and lo cal connectivit y assumptions are satisﬁed. In ord er for the searc h to o ccur, w e must motiv ate the observer to return to the state s and engage in rep eated exploration. W e could p ro vide motiv ation to the observ er by asking the observ er to assu me that the infeasible actio n will b e rep airable. The observ er will therefore conti nue the augmen ted b ac kups whic h supp ort high-v alue estimates at the s tate s and the observ er will rep eatedly engage in exploration f rom this p oint. The danger, of course, is that there ma y n ot in fact b e a br idge, in w hic h case the ob s erver will rep eat this searc h f or a bridge indeﬁn ite ly . W e therefore need a mechanism to terminate the repair p rocess w hen a k -step repair is infeasible. W e could attempt to explicitly ke ep trac k of all of the p ossible paths op en to the observe r and all of the paths explicitly tried by the observer and determine the repair p ossibilities had b een exhau s ted. Instead, we elect to follo w a pr obabilist ic searc h that eliminates the n eed for b o okk eeping: if a b ridge cannot b e constructed within n attempts of k -step rand om w alk, the “repairabilit y assu mption” is judged f alsiﬁed, the augmen ted bac kup at state s is su ppressed and the observ er’s bias to explore the vicinit y of state s is eliminated. If no bridge is found for state s , a ﬂag is used to mark the state as “irreparable.” This approac h is, of course, a v ery n a ¨ ıv e heuristic strategy; but it illustrates the basic imp ort of brid ging. More systematic strategies could b e used, in v olving explicit “plannin g” to ﬁnd a b ridge using, sa y , lo cal searc h (Alissandrakis, Nehaniv, & Dautenhahn, 2000). Another asp ect of this problem that we d o not ad d ress is the p ersistence of searc h for bridges. I n a sp eciﬁc domain, after some num b er of un successful attempts to ﬁn d bridges, a learner may conclude that it is u nable to reconstruct a men tor’s b eha vior, in whic h case the searc h for b r idges ma y b e abandoned. This inv olve s simple, h igher-le ve l inf eren ce, and some notion of (or prior b eliefs ab out) “similarit y” of capabilities. Th ese notions could also b e used to automatically determine parameter settings (discussed b elo w ). The p aramet ers k and n m ust b e tuned empirically , bu t can b e estimated giv en kno wl- edge of the connectivit y of the domain and pr ior b eliefs ab out how similar (in terms of length of av erage repair) the tra jectories of the mento r and observer will b e. F or instance, n > 8 k − 4 s eems su ita ble in an 8-connected grid wo rld with lo w noise, based on the num b er of tra jectories r equired to co v er the p erimeter states of a k -step rectangle around a state. W e n ote that very large v alues of n can reduce p erform an ce b elo w that of non-imitating agen ts as it results in temp orary “lock u p.” F easibilit y and k -step repair are easily inte grated into th e h omo geneous implicit imita- tion framewo rk. Ess ential ly , w e simply elab orate the conditions under wh ic h the augmente d bac kup will b e emp loy ed. Of course, s ome additional repr esenta tion will b e introduced to k eep track of whether a state is feasible, bridged, or repairable, and ho w many repair at- tempts hav e b een made. The action selection m ec hanism will also b e o v erridden by the bridge-building algorithm when requ ir ed in ord er to searc h for a bridge. Bridge buildin g alw a ys termin ates after n attempts, h ow eve r, so it cannot aﬀect long run con v ergence. All other asp ects of the algorithm, ho w ev er, su c h as the exploration p olicy , are un c hanged. The complete elab orated decision pr ocedure used to determine when augmente d bac kups will b e emplo ye d at state s with resp ect to mento r m app ears in T able 3. It uses some in ternal state to mak e its d ecisions. As in the original mo del, we ﬁr st chec k to see if the observ er’s exp erience-based calculation for the v alue of the state sup ersedes the mentor- 603 Price & Boutilier FUNCTION u se augmen ted?(s,m) : Bo olea n IF V o ( s ) ≻ V m ( s ) THEN R ETURN false ELSE IF f ea s ible ( s, m ) THEN RETURN true ELSE IF br idg ed ( s, m ) THEN RETURN false ELSE IF r eachable ( s , m ) THEN bridged(s,m) := t ru e RETURN false ELSE IF not r epair able ( s, m ) THEN return false ELSE % we are searching IF 0 < sear ch steps ( s, m ) < k THEN % searc h in p rog ress return tru e IF sear ch steps ( s, m ) > k THEN % search failed IF attemp t s ( s ) > n THEN repairable(s) = false RETURN false ELSE reset searc h(s,m) attempts(s) := attempts(s) + 1 RETURN true attempts(s) :=1 % initiate ﬁrst attempt of a search initiate-searc h(s) RETURN true T able 3: Elab orated augmen ted bac kup test based calculation; if so, then the observ er uses its o wn exp erience-based calculat ion. If the ment or’s action is feasible, then we accept the v alue calculated u sing the observ ation- based v alue fu nctio n. If the action is inf easible we c hec k to see if the state is br idged. The ﬁrst time the test is requested, a reac habilit y analysis is p erformed, bu t the results w ill b e dra wn from a cac he for subsequent requests. If the state has b een bridged, w e su ppress augmen ted bac kups , conﬁdent that this w ill not cause v alue fun ct ion collapse. If the s tate is not bridged, we ask if it is repairable. F or the ﬁrst n requests, the agen t will attempt a k -step repair. If the repair succeeds, the state is mark ed as bridged. If we cannot rep air the infeasible transition, we mark it not-repairable and sup press augmente d b ackups. W e may wish to employ implicit imitation w ith feasibilit y testing in a multiple- mento r scenario. The k ey c hange from imp lic it imitation without feasibilit y testing is that the observ er will only imitate feasible actions. When the observe r searc hes thr ough the set of men tors for the one with the action that results in the highest v alue estimate, the observer m ust consider only those mento rs w hose actions are s till considered feasible (or assumed to b e repairable). 5.3 Empirical Demonstrations In this section, w e empirically d emonstrate the utilit y of feasibilit y testing and k -step repair and sh o w ho w the tec hniques can b e used to surmount b oth diﬀerences in actions b et w een agen ts and s mall lo cal d iﬀerence s in state-space top ology . Th e pr oblems here h a ve b een 604 Implicit Imit a tion c hosen sp eciﬁcally to demonstrate the necessit y and utilit y of b oth feasibilit y testing and k -step repair. 5.3.1 Experimen t 1: Necessity of Feasibility Testing Our ﬁr st exp erimen t sh o ws the imp ortance of feasibilit y testing in implicit imitatio n when agen ts ha v e heterogeneous actions. In this scenario, all agen ts m ust na vigate across an obstacle-free , 10 × 10 grid w orld f rom the up p er-le ft corner to a goal lo cation in the lo wer- righ t. The agen t is then reset to the u pp er-le ft corner. T he ﬁr st agen t is a ment or with the “NEWS” action set (North, South, East, and W est mo v emen t actions). The mentor is giv en an optimal s tationary p olicy for this problem. W e study th e p erformance of three learners, eac h with the “Ske w” action set (N, S, NE, SW) and unable to d uplicate the mentor exactly (e.g., d uplicati ng a mento r’s E -mov e requ ir es the learner to mov e NE follo w ed by S, or mo v e S E then N). Due to the nature of the grid world, the con trol and imitation agen ts will actually h a ve to execute more actions to get to the goal th an the men tor and the optimal goal rate for b oth the control and imitator are therefore lo wer than that of the men tor. The ﬁrst learner emplo ys implicit imitation with feasibilit y testing, the second u ses imitatio n without feasibilit y testing, and the third contro l agen t us es no imitation (i.e., is a standard reinforcemen t learning agen t). All agen ts exp erience limited sto c hasticit y in the form of a 5% c hance that their action will b e rand omly p erturb ed. As in the last section, the agen ts use mo del-based reinforcemen t learning with prioritized sw eeping. W e set k = 3 and n = 20. The eﬀectiv eness of feasibilit y testing in imp lic it imitation can b e seen in Figure 14. The horizon tal axis r ep r esen ts time in sim ulation s teps and the v ertical axis represents the a v erage num b er of goals ac hiev ed p er 1000 time steps (a v eraged ov er 10 r uns). W e see that the imitation agen t with f ea sibilit y testing con v erges m uc h more quickly to the optimal goal-a ttainmen t rate than the other agen ts. The agen t without f ea sibilit y testing ac hiev es sp oradic success early on, bu t frequen tly “lo c ks up” d ue to rep eated attempts to duplicate infeasible men tor actions. The agen t still manages to reac h the goal from time to time, as the sto c hastic actions d o not p ermit th e agen t to b ecome p ermanently stuck in this obstacle- free scenario. Th e con trol agen t without an y form of imitatio n demonstrates a signiﬁcan t dela y in con v ergence r ela tiv e to the imitation agen ts due to the lac k of any form of guidance, but easily surpasses the agen t without feasibilit y testing in the long ru n. Th e more gradual slop e of the con trol agen t is du e to the higher v ariance in the control agen t’s disco v ery time for the optimal path, bu t b oth the feasibilit y-testing imitator and the con trol agen t con v erge to optimal solutions. As shown by the comparison of the t w o imitation agen ts, feasibilit y testing is n ec essary to adapt implicit imitati on to conte xts inv olving h ete rogeneous actio ns. 5.3.2 Experimen t 2: Changes to St a te Sp ace W e d ev elop ed feasibilit y testing and b r idging primarily to d ea l with th e p roblem of adapting to agent s with heterogeneo us actions. Th e same tec hn iqu es, ho we ve r, can b e applied to agen ts with diﬀerences in their state-space conn ectivit y (ultimately , these are equiv alen t notions). T o test this, we constru cted a d oma in where all agen ts h a ve the same NEWS action set, but we alter the environmen t of the learners by introdu cing obstacles that aren’t presen t for the mento r. In Figure 15, the learners ﬁnd that the men tor’s path is obstru cte d 605 Price & Boutilier 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 5 10 15 20 25 30 35 40 Feas Ctrl NoFeas FS Series Simulation Steps Average Reward per 1000 Steps Figure 14: Utilit y of feasibilit y testing S X Figure 15: Obstacle m ap and ment or path b y obstacles. Mo v emen t to w ard an obstacle causes a learner to remain in its curr en t state. In th is sense, its action has a diﬀeren t eﬀect than the men tor’s. In Figure 16, we see that the results are qualitativ ely similar to the pr evio us exp eriment. In cont rast to th e previous exp erimen t, b oth imitator and con trol use the “NEWS” action set and therefore hav e a shortest path with th e same length as that of the men tor. Consequently , the op timal goal rate of the imitators and control is higher than in th e p r evio us exp erimen t. The observ er w ithout feasibilit y testing has diﬃcult y with the m az e, as the v alue f unction augmen ted by ment or observ ations consisten tly leads th e observ er to states w hose path to the goal is d irect ly blo c k ed. T he agen t with feasibilit y testing quic kly disco v ers that the men tor’s inﬂuence is inappropr ia te at such states. W e conclude that lo cal diﬀerences in state are w ell hand led by feasibilit y testing. Next, w e demonstrate ho w feasibilit y testing can completely generalize the mento r’s tra jectory . Here, the mento r follo w s a p ath wh ic h is completely infeasible for the imitating agen t. W e ﬁ x the mentor’s path for all r uns and giv e the imitating agen t the maze s ho wn 606 Implicit Imit a tion 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 5 10 15 20 25 30 35 40 45 50 Feas Ctrl NoFeas FO Series Simulation Steps Average Reward per 1000 Steps Figure 16: Interp ola ting around obstacles S X Observer Mentor Figure 17: Paralle l generalization in Figure 17 in whic h all bu t tw o of the states the mentor visits are b loc k ed by an obstacle. The imitating agent is able to us e the mento r’s tra jectory for guidance and builds its own parallel tra jec tory wh ic h is completely d isjoin t from the mento r’s. The r esults in Figure 18 show that gain of the imitator with feasibilit y testing o v er the con trol agen t d iminishes, but still exists marginally when the imitator is forced to generalize a completely infeasible mentor tra jectory . The agen t without feasibilit y testing do es v ery p o orly , even when compared to the con trol agen t. Th is is b ecause it gets stuc k around the do orw a y . The high v alue gradient bac k ed up along the mento r’s path b ecomes accessible to the agen ts at the do orw ay . T he imitation agen t with feasibilit y will conclude that it cannot pro ceed south fr om the do orw a y (in to the w all) and it w ill then try a diﬀerent strategy . The imitator w ithout feasibilit y testing neve r explores far enough a w a y from the d oorwa y to setup an in dep enden t v alue gradien t that w ill guide it to the goal. With a slow er d eca y sc hedule f or exploration, th e imitator w ithout f easibilit y testing would ﬁnd the goal, but this 607 Price & Boutilier 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 10 20 30 40 50 60 Feas Ctrl NoFeas FP Series Simulation Steps Average Reward per 1000 Steps Figure 18: P arallel generalizat ion results w ould still reduce its p erformance b elo w that of the imitator with feasibilit y testing. Th e imitator with feasibilit y testing makes us e of its prior b eliefs that it can follo w the mento r to bac kup v alue p erp endicular to the men tor’s path. A v alue gradien t will therefore form parallel to th e infeasible ment or path and the imitator can follo w along side the inf ea sible path to wa rds the do orwa y where it mak es the necessary f ea sibilit y test and th en pro ceeds to the goal. As explained earlier, in simple pr oblems there is a goo d c h ance that th e informal eﬀects of p rior v alue leak age and sto c h asti c exp lo ration ma y form br idges b efore feasibilit y testing cuts oﬀ the v alue propagation that guides exploration. In more diﬃcult pr oblems w here the agen t sp ends a lot more time exploring, it will accum ulate suﬃ cient s amples to conclude that the m en tor’s actions are infeasible long b efore the agen t has constructed its own b ridge. The imitator’s p erformance wo uld then drop d o wn to th at of an unaugmente d reinforcemen t learner. T o demonstrate b ridging, w e devised a domain in wh ic h agen ts must na vigate from the upp er-left corner to the b ottom-rig ht corn er, across a “riv er” whic h is three steps w ide and exacts a p enalt y of − 0 . 2 p er step (see Figure 19). Th e goal state is w orth 1 . 0. In the ﬁgure, the path of the m entor is sho wn starting fr om the top corner, pro ceeding along the edge of the rive r and then crossing the riv er to the goal. Th e ment or emplo ys the “NEWS” action set. The observ er uses the “Sk ew” action set (N, NE, S, SW) and attempts to r eprod uce the ment or tra ject ory . It will f ail to repro duce the critical transition at the b order of the riv er (b ecause the “East” action is infeasible for a “Sk ew” agent ). Th e mentor action can no longer b e used to bac kup v alue from the rewarding state and there will b e no alternativ e paths b ecause the riv er blo c ks greedy exploration in this r eg ion. Without b ridging or an optimistic and lengthly exploration p hase, observer agen ts quic kly disco v er the negativ e states of the rive r and curtail exploration in this direction b efore actually making it across. 608 Implicit Imit a tion Figure 19: River scenario If we examine the v alue fun ct ion estimate (after 1000 steps) of an imitator with feasibilit y testing but no repair capabilitie s, we see that, due to suppression b y feasibilit y testing, the darkly shaded high-v alue states in Figure 19 (bac k ed up from the goal) terminate abru p tly at an infeasible transition without making it across the riv er. In f ac t, they are dominated by the ligh ter grey circles sho wing negativ e v alues. In this exp erimen t, we show that br idging can pr olo ng the exploration phase in ju s t the right wa y . W e employ the k -step repair pro cedure with k = 3. Examining the graph in Figure 20, we see that b oth imitation agen ts exp erience an early negativ e dip as they are guided deep in to the r iv er by the mento r’s inﬂuence. The agen t without repair ev ent ually d ec ides the mentor’s action is infeasible, and thereafter a v oids the riv er (and the p ossibilit y of ﬁnd ing the goal). The imitator with repair also disco v ers the ment or’s action to b e infeasible, b ut do es not immediately disp ense with the mentor’s guidance. It keeps exploring in the area of the men tor’s tra jectory usin g a random walk, all the while accum ulating a negativ e rew ard until it s uddenly ﬁnds a brid ge and rapidly con v erges on the optimal solution. 17 The con trol agent disco v ers the goal only once in the ten runs. 6. Applicabilit y The simple exp eriments presente d ab o v e demonstrate th e ma jor qualitativ e issu es con- fron ting an implicit imitation agen t and ho w the sp eciﬁc mec h anism s of imp lic it imitation address these issues. In this section, w e examine h o w the assumptions and the mec h a- nisms we presen ted in the p revious sections determine the t yp es of p roblems su ita ble for implicit imitation. W e then presen t sev eral dimen s ions that pr o ve useful for predicting the p erformance of imp lic it imitation in these t yp es of p roblems. 17. Wh il e repair steps take p la ce in an area of negativ e rew ard in this scenario, this need not be the case. Repair do esn’t im ply short-term negativ e return. 609 Price & Boutilier 0 1000 2000 3000 4000 5000 6000 −20 −15 −10 −5 0 5 10 15 Repair Repair Ctrl Ctrl NoRepair NoRepair FB Series Simulation Steps Average Reward per 1000 Steps Figure 20: Utilit y of b r idging W e hav e already identiﬁed a n umb er of assumptions und er whic h implicit imitation is applicable—some assumptions under whic h other mo dels of imitation or teac h ing cannot b e applied, and some assump tio ns that restrict the applicabilit y of our mo del. These include: lac k of explicit comm unication b et w een mentors and observer; ind ep endent ob jectiv es for men tors and observe r; full observ abilit y of men tors by observe r; unobserv abilit y of m en tors’ actions; and (b ounded) h ete rogeneit y . Assumptions suc h as fu ll observ abilit y are necessary for ou r mo del—as formula ted—to wo rk (though we d iscuss extension to the partially ob- serv able case in Section 7). Assum p tio ns of lac k of comm unication and unobserv able actions extend the applicabilit y of imp lici t imitatio n b eyond other mo dels in the literature; if these conditions do not h old, a simpler form of explicit communica tion ma y b e preferable. Finally , the assumptions of b ounded heterogeneit y and indep endent ob ject iv es also ensu re implicit imitatio n can b e applied widely . Ho w ev er, the degree to which rewards are the same and actions are homogeneous can hav e an impact on the utilit y (i.e., the accele ration of learn- ing oﬀered b y) imp lic it imitation. W e turn our atten tion to predicting the p erformance of implicit imitation as a function of certain domain charact eristics. 6.1 Predicting P erformance In this section w e examine t wo qu estio ns: ﬁrst, giv en that implicit imitation is applicable, when can implicit imitation bias an agen t to a sub optimal solution; and second, ho w will the p erformance of implicit imitation v ary with structural charac teristics of the d omains one migh t wan t to apply it to? W e sho w how analysis of the internal stru ct ure of s ta te sp ac e can b e used to m otiv ate a metric that (roughly) predicts implicit imitation p erformance. W e conclude with an analysis of how the p roblem space can b e u n derstoo d in terms of distinct r eg ions pla ying diﬀerent roles w ithin an imitation con text. 610 Implicit Imit a tion In the implicit imitation mo del, we us e observ ations of other agen ts to impro v e the ob- serv er’s kno wledge ab out its en vironment and then rely on a sensib le exploration p olicy to exploit this add itional kn o wledge. A clear un derstanding of ho w kno wledge of the environ- men t aﬀects exploration is therefore central to unders tandin g ho w implicit imitation w ill p erform in a d omai n. Within the implicit imitatio n fr amew ork, agen ts know their rew ard functions, so kn o wl- edge of the environmen t consists solely of knowledge ab out the agen t’s action mo dels. In general, these m odels can tak e an y f orm . F or simplicit y , w e h a ve restricted ourselve s to mo dels that can b e decomp osed into lo cal mo dels for eac h p ossible com bination of a system state and agen t action. The lo cal mo dels f or state-act ion pairs allo w th e prediction of a j -step successor state distribution giv en an y initial state and sequence of actions or lo cal p olicy . The qualit y of the j -step state predictions will b e a fun ction of every action mo del encoun tered b et we en the initial state and the states at time j − 1. Unfortunately , the q u ali t y of the j -st ep estimate can b e d rastic ally altered by the qu alit y of ev en a single in termediate state-actio n mo del. T h is suggests that connected r egions of state space, the states of whic h all ha v e fairly accurate mo dels, will allo w reasonably accurate f u ture state predictions. Since the estimated v alue of a state s is based on b oth the immediate rew ard and the rew ard exp ected to b e receiv ed in subsequent states, the qualit y of th is v alue estimate will also dep end on the qualit y of the action mo dels in those states connected to s . No w, since greedy exploration metho ds bias their exploration according to the estimated v alue of actions, the exploratory c hoices of an agen t at state s will also b e dep endent on the connectivit y of reliable action mo dels at those states reac h able from s . O ur analysis of implicit imitatio n p erformance with resp ect to domain charac teristics is therefore organized around the idea of state space connectivit y and the regions such connectivit y deﬁnes. 6.1.1 The Imit a tion Regions Frame w ork Since connected r egions pla y an imp ortant role in implicit imitation, w e introdu ce a classi- ﬁcation of diﬀeren t regions within the state sp ac e sho wn graphically in Figure 21. In wh at follo ws, we describ e of h o w these regions aﬀect imitation p erformance in our mo del. W e ﬁrst observe that man y tasks can b e carried out by an agen t in a small subset of states within the state sp ace deﬁned f or the p r oblem. More precisely , in man y MDPs, th e optimal p olicy will ens u re that an agent r emai ns in a small sub space of state space. This leads u s to the deﬁnition of ou r ﬁ rst regional distinction: relev ant vs. irr ele v ant regions. T h e r elevant region is the set of states w ith non-zero p robabilit y of o ccupancy under the optimal p olicy . 18 An ε -relev an t region is a natural generalizatio n in which the optimal p olicy k eeps the system within the region a fraction 1 − ε of the time. Within the relev ant region, w e distinguish three additional subregions. The explor e d region con tains those states where the observ er has form ulated r eli able action mo dels on the basis of its own exp erience. T he augmente d region con tains th ose states where the observ er lac ks reliable action mo dels bu t has improv ed v alue estimates due to mentor observ ations. 18. On e often assumes that the system starts in one of a small set of states. If th e Mark ov chain induced by the optimal p olicy then is not ergod ic, then the irrelev ant region will b e nonempty . Oth erwi se it will b e empty . 611 Price & Boutilier Reward Region Augmented Region Irrelevant Mentor Blind Region Observer Region Explored Figure 21: Classiﬁcation of regions of s tate sp ace Note that b oth the explored and augmente d regions are created as th e result of observ ations made b y the learner (of either its o wn transitions or those of a m en tor). These regions will therefore hav e signiﬁcan t “connected comp onents; ” that is, con tiguous regions of state space where reliable action or ment or mo dels are av ailable. Finally , the blind region designates those states where the observ er has neither (signiﬁcant) p ersonal exp erience n or the b eneﬁt of men tor observ ations. Any inform ation ab out states within the blind region will come (largely) from the agen t’s prior b eliefs. 19 W e can now ask how these regions inte ract w ith an imitation agen t. First we consider the impact of relev ance. Imp lic it imitation makes the assumption that more accurate d ynamics mo dels allo w an observ er to mak e b etter decisions w h ic h will, in tur n, result in higher retur ns so oner in the learning pro cess. How eve r, n ot all m o del information is equally h elpful: the imitator needs only enough information ab out the irrelev an t region to b e able to av oid it. Since action choi ces are inﬂuenced by the r ela tiv e v alue of actio ns, the irr elev ant region will b e a v oided wh en it lo oks worse than the relev an t r eg ion. Giv en d iﬀuse priors on action mo dels, none of the actions op en to an agen t will initially app ear p artic ularly attractiv e. Ho wev er, a mento r that provi des observ ations within the relev an t region can quickly mak e the relev an t region lo ok muc h more p romising as a metho d of ac hieving higher returns and therefore constrain exploration signiﬁcan tly . Therefore, considering problems just from the p oin t of view of relev ance, a pr oblem with a small relev an t region relativ e to the entire sp ac e com bined with a mentor that op erates within the relev an t region will result in maxim um adv an tage for an imitation agen t ov er a non-imitating agen t. In the explored region, the observer has suﬃcien tly accurate mo dels to compute a goo d p olicy with resp ect to rew ards within the explored region. Additional observ ations on 19. Ou r partitioning of states into explored, blind and augmented regions b ears some resemblance to Kearns and Singh’s (1998) partitioning of state space into kn own and unknow n regions. Unlike Kearns and Singh, how ever, w e use th e partitions only for analysis. The implicit imitation algorithm does not explicitly main tain these p arti tions or use them in any w ay to compute its p oli cy . 612 Implicit Imit a tion the states within the exp lored region provi ded b y the ment or can still imp ro ve p erformance somewhat if signiﬁcant evidence is required to accurately discriminate b et w een the exp ected v alue of tw o actions. Hence, men tor observ ations in the explored region can help, but will not r esult in dr amat ic sp eedups in conv ergence. No w , w e consider the augmen ted region in whic h the observer’s Q-v alues ha ve b een augmen ted with observ ations of a men tor. In exp erimen ts in previous sections, we hav e seen that an observ er en tering an augmen ted region can exp erience s ig niﬁcant sp eedups in con v ergence du e to the information inh eren t in the augmen ted v alue function ab out the lo ca tion of rew ards in the region. Characteristics of the augmente d zone, ho w ev er, can aﬀect the degree to whic h augmen tation impro v es con v ergence sp eed. Since the ob s erver receiv es observ ations of only the ment or’s state, and not its actions, the observe r has improv ed v alue estimates for states in the augmented region, but no p olicy . The observer must therefore infer whic h actions should b e tak en to du plicat e the men tor’s b eha vior. Where the obs erver has prior b eliefs ab out the eﬀects of its actions, it ma y b e able to p erform immediate inference ab out the ment or’s actual c hoice of action (p erhaps using KL-div ergence or maxim um likel iho o d). Where the observ er’s p rior mo del is u ninformativ e, the observer will ha v e to explore the lo cal action sp ac e. In exploring a lo cal action space, ho w ev er, the agen t must tak e an action and this action will hav e an eﬀect. Sin ce there is no guaran tee that the agen t to ok the action that du p lica tes the ment or’s action, it may end up somewhere diﬀeren t than the men tor. I f the action causes the observer to fall outside of the augmen ted region, the observe r will lose th e guidance that the augmen ted v alue function pro vides and fall bac k to the p erformance lev el of a non-imitating agen t. An imp ortant consideration, then, is the p robabilit y that the observer will remain in augmen ted regions and cont inue to receiv e guidan ce. One qualit y of the augmente d region that aﬀects the observer’s pr obabilit y of sta ying within its b oundaries is its relativ e co verage of the state space. The p olicy of the men tor ma y b e sparse or complete. In a relativ ely deterministic domain with deﬁned b egin and end states, a sp arse p olicy cov ering f ew states ma y b e adequate. In a highly sto c hastic domain with many start and end states, an agen t ma y need a complete p olicy (i.e., co vering ev ery state). Implicit imitation w ill p ro vide more guidance to the agen t in domains that are more s to c hastic and require more complete p olicie s, since the p olicy will co ver a larger part of the state space. As imp ortan t as the completeness of a p olicy is in predicting its guidance, w e must also tak e into accoun t the probabilit y of transitions into and out of th e augmen ted region. Where the actions in a d omai n are largely in ve rtible (directly , or eﬀectiv ely so), the agen t has a chance of re-en tering the augmen ted region. Where ergo dicit y is lac king, ho wev er, the agen t may hav e to wa it until th e p r ocess und ergoes some form of “reset” b efore it has the op p ortun it y to gather add itional evidence regarding the id entit y of the men tor’s actions in the augmente d region. T he r eset p la ces the agen t bac k int o the explored region, from whic h it can make its w a y to the frontie r w here it last explored. Th e lac k of ergo dicit y w ould redu ce the agen t’s abilit y to make progress to wa rds h ig h-v alue regions b efore resets, but the agen t is still guided on eac h attempt by the augmented region. Eﬀectiv ely , the agen t will concen trate its exploration on the b ound ary b et wee n the exp lored region and the men tor augmen ted region. The utilit y of men tor observ ations will dep end on the probabilit y of the augmen ted and explored regions o v erlapping in the course of the agen t’s exploration. In the explored 613 Price & Boutilier regions, accurate action mo dels allo w the agen t to mo v e as qu ickly as p ossible to high v alue regions. I n augmen ted regions, augment ed Q -v alues inf orm agen ts ab out wh ich states lead to h ighly-v alued outcomes. When an augmented r egion abu ts an explored region, the impro v ed v alue estimates from the augment ed region are rapidly communicat ed across the explored region by accurate action mo dels. The observer can use the resultan t imp ro ved v alue estimates in th e explored region, together w ith the accurate action m odels in the explored region, to r apidly mo v e to w ards th e most promising states on the f ron tier of the explored region. F rom these states, the observ er can explore out ward and thereb y eve ntuall y expand th e explored region to encompass the augmen ted region. In th e case wh ere the explored region and augmente d region do not ov erlap, w e ha ve a blind region. Since the observ er has no inf ormat ion b ey ond its priors for th e blind region, the observe r is reduced to random exploration. I n a non-imitation context , any states that are not explored are blind. Ho wev er, in an imitation conte xt, the blind area is reduced in eﬀectiv e size by the augmente d area. Hence, implicit imitation eﬀectiv ely shrinks the size of the searc h sp ac e of the pr ob lem ev en when there is n o ov erlap b et w een explored and augmen ted spaces. The most c hallenging case for implicit imitation transfer o ccurs when the region aug- men ted b y ment or observ ations fails to connect to b oth the observ er explored region and the regions with signiﬁcan t reward v alues. In th is case, the augmen ted region will initially pro vide no guid ance . Once the observer has indep endent ly lo cated rew arding states, the augmen ted regions can b e used to highligh t “sh ortc uts”. These shortcuts repr ese nt im- pro v ement s on the agen t’s p olicy . In domains where a feasible solution is easy to ﬁ nd, but optimal solutions are diﬃcult, implicit imitation can b e u sed to conv ert a feasible solution to an increasingly optimal s ol ution. 6.1.2 Cross regional textures W e h a ve seen h o w distinctiv e regions can b e used to p ro vide a certain lev el of insigh t in to ho w imitatio n will p erform in v arious domains. W e can also analyze imitation p erformance in terms of prop erties that cut across the state space. In our analysis of h o w mo del information impacts imitation p erformance, w e sa w that r egions conn ec ted by accurate action mo dels allo wed an observe r to use men tor observ ations to learn ab out the most promising direction for exploration. W e see, then, that any set of mentor observ ations will b e more u seful if it is concen trated on a conn ected region and less usefu l if d isp ers ed ab out the state space in unconnected comp onen ts. W e are fortunate in completely obs erv able envi ronments that observ ations of mentors tend to capture con tinuous tra ject ories, thereby p ro viding con tin uous r egi ons of augmente d states. In partially observ able en vironments, o cclusion and noise could lessen the v alue of men tor observ ations in the absence of a mo del to pred ict the men tor’s state. The eﬀects of heterogeneit y , whether due to diﬀerences in action capabilities in the men tor and observ er or due to diﬀerences in the en vironment of th e t w o agen ts, can also b e understo o d in terms of the connectivit y of action mo dels. V alue can propagate along c hains of actio n mo dels until w e hit a state in w hic h the ment or and observ er ha v e d iﬀe rent action capabilities. A t this state, it ma y not b e p ossible to ac hiev e the mento r’s v alue and therefore, v alue propagation is blo c ked. Again, the sequent ial decision making asp ect 614 Implicit Imit a tion of reinf orcement learning leads to the conclusion that many scattered diﬀerences b et w een men tor and observ er will create d iscontin uity throughout the problem space, whereas a con tiguous region of d iﬀe rences b et we en mento r and observ er will cause d isco ntin uity in a region, but lea v e other large regions fu lly connected. Hence, the distribution p attern of diﬀerences b et w een mentor and observe r capabilities is as imp ortant as th e prev alence of diﬀerence. W e will explore this pattern in the next section. 6.2 The F racture Metric W e n o w try to charac terize connectivit y in the form of a metric. Since diﬀerences in re- w ard structure, en vironment dynamics and action mo dels that aﬀect connectivit y all would manifest themselve s as diﬀerences in p olicies b et we en mentor and observer, we designed a metric based on diﬀerences in the agent s’ optimal p olicies. W e call this metric fr actur e . Essen tially , it computes the a v erage minim um distance from a state in wh ic h a mentor and observ er disagree on a p olicy to a state in which mento r an d observ er agree on the p ol- icy . This measure roughly captures the d iﬃcu lty the observ er faces in p r oﬁtably exploiting men tor observ ations to r educe its exploration demands. More formally , let π m b e the men tor’s optimal p olicy and π o b e the observe r’s. Let S b e the state sp ace and S π m 6 = π o b e the set of dispute d states where the men tor and observ er ha v e diﬀerent optimal actions. A set of neigh b oring disp uted states constitutes a disputed region. The set S − S π m 6 = π o will b e called the undispute d states . Let M b e a distance metric on the space S . Th is metric corresp onds to the num b er of transitions along the “minimal length” path b et w een states (i.e., the shortest path using nonzero probab ilit y observ er transitions). 20 In a standard grid world, it will corresp ond to the Manhattan distance. W e d eﬁne the fr actur e Φ( S ) of state space S to b e the av erage minimal distance b et ween a dispu ted state and the closest u ndisputed state: Φ( S ) = 1 |S π m 6 = π o | X s ∈S π m 6 = π o min t ∈S −S π m 6 = π o M ( s, t ) . (13) Other things b eing equal, a lo wer fracture v alue will tend to increase the propagation of v alue information across the state sp ac e, p oten tially r esulting in less exploration b eing required. T o test our metric, we applied it to a n umber of scenarios with v arying f r act ure co eﬃcients. It is diﬃcult to construct scenarios wh ic h v ary in their fracture co eﬃcien t ye t ha v e the same exp ecte d v alue. The scenarios in Figure 22 h a ve b een constru cte d so that the length of all p ossible paths from the start state s to th e goal state x are the same in eac h scenario. In eac h scenario, h o we ve r, there is an upp er p ath and a low er path. The men tor is trained in a scenario that p enalizes the lo we r path and s o the m en tor learns to tak e the upp er path. The imitator is trained in a scenario in which the upp er path is p enalized and should therefore tak e the low er path. W e equalized the diﬃcult y of these problems as follo ws: using a generic ε -greedy learning agen t with a ﬁxed exploration schedule (i.e., a ﬁxed initial rate and deca y) in one scenario, we tuned the magnitude of p enalties and their exact p lacemen t along lo ops in their other scenarios so that a learner u sing the same exploration p olicy would conv erge to the optimal p olicy in roughly the same num b er of steps in eac h. 20. The ex pected distance would give a more accurate estimate of fracture, but is more diﬃcult to calculate. 615 Price & Boutilier S X (a) Φ = 0 . 5 S X (b) Φ = 1 . 7 S X (c) Φ = 3 . 5 S X (d) Φ = 6 . 0 Figure 22: F racture metric scenarios Observer Initial Explora tion Rate δ I Φ 5 × 10 − 2 1 × 10 − 2 5 × 10 − 3 1 × 10 − 3 5 × 10 − 4 1 × 10 − 4 5 × 10 − 5 1 × 10 − 5 0.5 60% 70% 90% 1.7 0% 80% 90% 90 % 3.5 30% 100 % 6.0 30 % 70 % 100 % 100 % Figure 23: P ercen tage of run s (of ten) conv erging to optimal p olicy give n fracture Φ and initial exp loration r ate δ I In Figure 22(a), the men tor tak es th e top of eac h lo op and in an optimal run, the imitator w ould tak e the b ottom of eac h lo op. Since the lo ops are short and the length of the common path is long, the av erage fracture is lo w. When w e compare this to Figure 22(d), w e see that the lo ops are ve ry long—the ma jority of states in the s ce nario are on lo ops. Eac h of these states on a lo op has a distance to the nearest state wh ere the obs erv er and men tor p olicie s agree, namely , a state not on the lo op. Th is scenario therefore has a high a ve rage fracture co eﬃcien t. Since th e lo ops in the v arious scenarios diﬀer in length, p enalties inserted in the lo ops v ary with resp ect to their distance from the goal state and therefore aﬀect the total dis- coun ted exp ected rew ard in d iﬀeren t wa ys. The p enalties may also cause the agen t to b ecome stuc k in a lo cal min im um in order to a v oid the p enalties if the exploration rate is to o lo w . In this set of exp eriment s, we therefore compare observe r agen ts on the b asis of ho w likel y they are to con ve rge to the optimal solution giv en the men tor example. Figure 23 p r esen ts the p ercen tage of r u ns (out of ten) in whic h the imitator con v erged to the optimal s olution (i.e., taking only the low er lo ops) as a function of exploration rate and scenario fracture. 21 W e can see a d istinct diagonal trend in the table illustrating that increasing fracture requir es the imitator to increase its lev els of exp loration in order to ﬁ nd 21. F or reasons of computational exp ediency , only th e entries near the diagonal hav e b een computed . Sam- pling of other entries conﬁrms th e trend. 616 Implicit Imit a tion the optimal p olicy . T his suggests that fr actur e reﬂects a feature of RL domains that is ma y b e imp ortan t in predicting the eﬃcacy of implicit imitation. 6.3 Sub optimalit y a nd Bias Implicit imitatio n is fu ndamen tally ab out biasing the exploration of the observe r. As su c h, it is worth wh ile to ask w hen this has a p ositiv e eﬀect on observ er p erform ance. The short answ er is that a mentor f ol lo wing an optimal p olicy for an observ er will cause an observer to explore in th e neigh b orho o d of the optimal p olicy and this will generally bias the obs erver to wards ﬁnding the optimal p olicy . A more detailed answer requires lo oking explicitly at exploration in r einforce ment learn- ing. In theory , an ε -greedy exploration p olicy with a suitable r at e of d ec a y will cause implicit imitators to ev en tually conv erge to the same optimal solution as their un assiste d coun terparts. Ho w ev er, in practice, the exploration r ate is t ypically d eca y ed more quickly in order to impr o v e early exploitation of m entor input. Giv en practical, but theoretically unsound exploration rates, an observ er ma y settle for a m en tor strategy that is feasible, but n on -optimal. W e can easily imagine examples: consider a situation in wh ich an agen t is observing a men tor follo wing some p olicy . Early in th e learning p rocess, the v alue of the p olicy follo we d by the mento r may lo ok b etter than the estimated v alue of the alternativ e p olicie s a v ailable to the observe r. I t could b e the case that the mento r’s p olicy actually is the op timal p olicy . On the other hand, it may b e the case that one of the alternativ e p olicie s, w ith w hic h the observ er has n ei ther p ersonal exp erience, n or observ ations fr om a men tor, is actually su p erior. Give n the lac k of information, an aggressiv e exploitatio n p ol- icy might lead the observ er to falsely conclude that the m en tor’s p olicy is optimal. While implicit imitation can b ias the agent to a sub optimal p olicy , we hav e no reason to exp ect that an agen t learning in a domain su ﬃcien tly chall enging to w arrant the u se of imitation w ould hav e disco ve red a b etter alternativ e. W e emphasize that ev en if the ment or’s p olicy is sub optimal, it still p r o vides a feasible solution whic h will b e preferable to n o solution for man y p racti cal problems. In this regard, w e see that the classic exploration/exploitat ion tradeoﬀ has an additional in terpretation in the implicit imitation setting. A comp onen t of the exploration r ate will corresp ond to the observer’s b elief ab out the su ﬃciency of the m en tor’s p olicy . In this paradigm, then, it seems somewhat misleading to think in terms of a decision ab out whether to “follo w” a s p eciﬁc mento r or n ot . It is more a question of ho w muc h exploration to p erform in addition to that required to reconstruct the men tor’s p olicy . 6.4 Sp eciﬁc Applications W e see applications for implicit imitation in a v ariet y of con texts. The emerging electronic commerce and information infrastructure is d r iving the dev elopmen t of v ast net w orks of m ulti-agen t systems. In net w orks used for comp etitiv e pu rp oses su c h as trade, imp lici t imitatio n can b e used by an RL agen t to learn ab out b uying strategies or information ﬁltering p olicies of other agen ts in order to impr o ve its own b eha vior. In con trol, implicit imitatio n could b e used to transf er kno wledge from an existing learned con troller whic h has already adapted to its clien ts to a new learning con troller with a completely d iﬀeren t arc hitecture. Many mo dern p rod ucts such as elev ator con trollers 617 Price & Boutilier (Crites & Barto, 1998), cell traﬃc routers (Singh & Bertsek as, 1997) and automotiv e f uel injection systems use adaptiv e con trollers to optimize the p erform an ce of a system for sp eciﬁc user p roﬁles. When u pgrading the tec hnology of the und erlying system, it is quite p ossible that sensors, actuators and the in ternal represen tation of the new system will b e incompatible with the old system. I m plici t imitatio n pro vides a metho d of transferring v aluable user in f ormati on b et ween systems without an y explicit comm unication. A traditional application for imitation-lik e tec h n olo gies lies in the area of b o otstrapping in telligen t artifacts u sing traces of human b eha vior. Researc h within the b eha vioral cloning paradigm has inv estigated transfer in applications such as piloting aircraft (Sammut et al., 1992) and con trolling loading cranes ( ˇ Suc & Bratk o, 1997). O ther r esearchers hav e inv es- tigate d the use of imitation to simplify the programming of rob ots (Kuniyo shi, In aba, & Inoue, 1994). Th e abilit y of imitation to transfer complex, nonlinear and dyn amic b ehavio rs from existing human agen ts mak es it particularly attractiv e for con trol problems. 7. Extensions The mo del of implicit imitatio n present ed ab o ve make s certain restrictiv e assumptions re- garding the structure of the decision p roblem b eing solv ed (e.g., fu ll observ abilit y , kn o wledge of rewa rd fu nctio n, discrete state and action space). While these simplifying assump tio ns aided the detailed deve lopment of the mo del, we b eliev e the basic intuitio ns and muc h of the tec hnical deve lopment can b e extended to r ic her problem classes. W e su ggest seve ral p ossible extensions in this s ection, eac h of which provides a v ery in teresting a v en ue for future researc h. 7.1 Unkno wn Rew ard F unctions Our current paradigm assumes that the observer kn o ws its own r ew ard fu nctio n. This assumption is consisten t with the view of RL as a form of automatic p rogramming. W e can, ho w ev er, relax this constraint assuming some abilit y to generalize observ ed r ew ards. Supp ose that the exp ected r eward can b e expressed in terms of a pr obabilit y distribution o v er f ea tures of the obs erver’s state, Pr ( r | f ( s o )). In mo del-based RL, this distrib ution can b e learned by the agen t through its o wn exp erience. If the same features can b e applied to the men tor’s state s m , then the observer can use w hat it has learned ab out the rew ard distribution to estimate exp ected reward f or men tor states as w ell. This extends the paradigm to domains in which rew ards are u nkno w n, b ut preserv es the abilit y of the observ er to ev aluate m en tor exp eriences on its “own terms.” Imitation tec hniqu es designed around the assumption that the observ er and the m entor share iden tical rewa rds , suc h as Utgoﬀ ’s (1991), wo uld of course w ork in the absence of a rew ard function. The notion of inv erse r einforce ment learning (Ng & Russ ell, 2000) could b e adapted to this case as w ell. A c hallenge for future researc h w ould b e to explore a syn thesis b et ween imp lic it imitation and reward inv ersion app roac hes to handle an observe r’s prior b eliefs ab out some inte rmediate lev el of correlation b et ween the reward fun cti on of observ er and m entor. 618 Implicit Imit a tion 7.2 In teraction of agen ts While we cast the general imitation mo del in th e framew ork of sto c h ast ic games, the re- striction of the mo del p resen ted th us far to noninterac ting games essen tially means that the standard issu es asso ciated with multiag en t in teraction d o not arise. There are, of course, man y tasks that r equire in teractions b et ween agen ts; in suc h cases, implicit imitatio n oﬀers the p oten tial to accelerate learning. A general solution requires the inte gration of imitation in to more general mo dels for m ultiagen t RL based on sto c h ast ic or Marko v games (Littman, 1994; Hu & W ellman, 1998; Bo w ling & V eloso, 2001). T his wo uld no doubt b e a rather c hallenging, yet rewarding endea v or. T o tak e a simple example, in simple co ordination problems (e.g., t w o mobile agen ts trying to a v oid eac h other while carrying out related tasks) we m igh t imagine an imitator learning from a men tor by reversing th e roles of their roles wh en considering h o w th e observ ed state transition is inﬂuenced by their join t actio n. In this and more general settings, learning t ypically requires great care, since agen ts learning in a nonstationary en vironment may not con v erge (say , to equilibrium). Again, imitation tec hniques oﬀer certain adv an tages: for instance, mentor exp ertise can suggest means of co ordinating with other agen ts (e.g., by pro viding a fo cal p oin t for equilibrium selectio n, or by making clear a sp eciﬁc con v en tion such as alw a ys “passing to the right ” to a v oid collision). Other c hallenges and opp ortun iti es presen t themselv es when imitation is used in multi- agen t settings. F or example, in comp etitiv e or educational domains, agen ts n ot only hav e to c ho ose actions that maximize inf orm at ion f r om exploration and returns from exploitatio n; they must also r ea son ab out h o w their actions comm unicate inf orm at ion to other agen ts. In a comp etitiv e setting, one agen t may wish to disguise its in ten tions, while in the con text of teac hing, a mentor ma y w ish to c ho ose actions whose pur p ose is abundantly clear. T h ese considerations must b ecome part of an y action selection pr ocess. 7.3 P artially Observ able Domains The extension of this mo del to p arti ally observ able domains is critical, sin ce it is unrealistic in man y settings to sup p ose that a learner can constantl y monitor the activities of a mentor. The central idea of implicit imitation is to extract mo del inform ation from observ ations of the mento r, rather than d u plicat ing men tor b ehavio r. Th is means that the mento r’s int ernal b elief s tate and p olicy are not (directly) relev an t to the learner. W e take a somewhat b eha viorist stance and concern ourselv es only with what the men tor’s ob s erved b eha viors tell us ab out the p ossibilities inherent in the envi ronment. Th e observe r do es hav e to k eep a b elief state ab out the mento r’s current state, but this can b e done u sing the same estimated w orld mo del the observ er uses to up d at e its o wn b elief state. Preliminary in ve stigation of s u c h a mo del su gge sts that dealing with partial ob s erv abilit y is viable. W e ha v e derived up date rules for augmented partially observ able up dates. These up dates are based on a Ba y esian formulatio n of implicit imitation whic h is, in tu r n, based on Ba y esian RL (Dearden et al., 1999) . I n fu lly observ able con texts, we ha v e seen that more eﬀectiv e exploration u sing men tor observ ations is p ossible in fully observ able domains when this Bay esian mo del of imitation is used (Price & Boutilier, 2003). The extension of this mo del to cases w here the ment or’s state is p artia lly obs erv able is reasonably straigh tforw ard. W e antic ipate that up dates p erformed using a b elief state ab out the mento r’s state and 619 Price & Boutilier action w ill help to alleviat e fracture that could b e caused b y incomplete observ ation of b eha vior. More interesti ng is dealing w ith an additional factor in the usual exploration-exploitatio n tradeoﬀ: d etermining w hether it is worth wh ile to tak e actio ns that rend er the men tor “more visible” (e.g., ensurin g the mento r remains in view so that this source of information r emai ns a v ailable while learning). 7.4 Con tinuo us and Mo del-F ree Learning In many realistic domains, con tin uous attributes and large state and action sp aces pr oh ib it the use of explicit table-based representa tions. Reinforcement learning in these d omains is t yp ica lly mo diﬁed to make use of function approxima tors to estimate the Q-function at p oints w here n o direct evidence has b een receiv ed. Tw o imp ortan t approac hes are parameter-based mo dels (e.g., neural net w orks) (Bertsek as & Ts itsi klis, 1996) and the memory-based app roac hes (A tkeson, Mo ore, & Sc haal, 1997). In b oth these appr oa c hes, mo del-free learning is generally employ ed. That is, the agen t k eeps a v alue fun cti on but u s es the environmen t as an imp lic it mo del to p erform bac kups using the sampling distribution pro vided b y en vironmen t observ ations. One str aightfo rward approac h to casting implicit imitation in a con tin uous setting would emplo y a mo del-free learning p aradigm (W atkins & Da yan, 1992). First, recall the aug- men ted Bellman b ac kup function used in implicit imitation: V ( s ) = R o ( s ) + γ max ( max a ∈ A o ( X t ∈ S Pr o ( s, a, t ) V ( t ) ) , X t ∈ S Pr m ( s, t ) V ( t ) ) (14) When we examine the augmen ted b ac kup equation, we see that it can b e con verted to a mo del-free form in m uc h the s ame w a y as th e ordinary Bellman bac kup. W e u s e a standard Q-function with observ er actions, bu t w e will add one additional action which corresp onds to the action a m tak en by th e men tor. 22 No w imagine that the observe r was in state s o , to ok action a o and end ed up in state s ′ o . A t the same time, the m entor made the transition from state s m to s ′ m . W e can then wr ite: Q ( s o , a o ) = (1 − α ) Q ( s o , a o ) + α ( R o ( s o , a o ) + γ max  max a ′ ∈ A o  Q ( s ′ o , a ′ )  , Q ( s ′ o , a m )  (15) Q ( s m , a m ) = (1 − α ) Q ( s m , a m ) + α ( R o ( s m , a m ) + γ max  max a ′ ∈ A o  Q ( s ′ m , a ′ )  , Q ( s ′ m , a m )  As discussed earlier, the relativ e qu al it y of ment or and observer estimates of th e Q- function at sp eciﬁc states ma y v ary . Again, in order to av oid ha ving inaccurate prior b eliefs ab out the mentor’s action mo dels b ias exploration, w e need to employ a conﬁdence measure to decide when to apply these augmented equations. W e feel the most natural setting for these kind of tests is in the memory-based approac hes to fu ncti on appr o ximation. Memory- based approac hes, such as lo cally-w eigh ted regression (A tk eson et al., 1997), n ot only pr o- vide estimates for f unctions at p oint s pr evio usly unvisite d, they also mainta in the evidence 22. This do esn’t imply the observer k no ws which of its actions corresp onds to a m . 620 Implicit Imit a tion set used to generate these estimates. W e note that the implicit bias of memory-based ap- proac hes assumes smo othness b et w een p oin ts unless additional d at a prov es otherwise. On the basis of this bias, we p ropose to compare the a v erage squared distance of the query from the exemplars used in the estimate of the men tor’s Q-v alue to the a verage squared distance from the query to the exemplars used in the observer-based estimate to heuristically decide whic h agen t has th e more reliable Q-v alue. The approac h suggested h ere do es not b eneﬁt fr om prioritized s weeping. Prioritized- sw eeping, has h o we ve r, b een adapted to con tin uous settings (F orb es & Andr e, 2000). W e feel a reasonably eﬃcien t tec hnique could b e made to work. 8. Related W ork Researc h into imitation spans a broad range of dimensions, from ethological studies, to abstract algebraic formulati ons, to industrial con trol algorithms. As these ﬁelds ha ve cross- fertilized and informed eac h other, we ha v e come to stronger conceptual deﬁn iti ons and a b etter und ersta nd ing of th e limits and capabilities of imitatio n. Man y compu ta tional mo dels ha v e b een prop osed to exploit s p ecialized niches in a v ariet y of con trol paradigms, and imitation tec hniques hav e b een app lie d to a v ariet y of r eal-w orld control problems. The conceptual found at ions of imitation ha v e b een clariﬁed by work on natural imita- tion. F rom wo rk on ap es (Russon & Galdik as, 1993), o cto pi (Fiorito & Scotto, 1992), and other animals, we kno w that s o cially facilitated learning is widespr ea d thr oughou t the an- imal kingdom. A num b er of researc hers h a ve p oin ted out, h o we ve r, th at so cial facilita tion can tak e man y forms (Conte , 2000; Noble & T o dd, 1999). F or instance, a mentor’s atten tion to an ob ject can d r a w an observe r’s atten tion to it and thereb y lead the observe r to ma- nipulate the ob ject indep endent ly of the mo del provided by the mento r. “T rue imitation” is therefore t ypically deﬁned in a more restrictiv e fashion. Visalb erghi and F r ag azy (1990) cite Mitc hell’s deﬁn itio n: 1. something C (the copy of the b ehavi or) is p rod u ced by an organism 2. where C is s imilar to s omet hing else M (the Mo del b eha vior) 3. observ ation of M is necessary for the p rod uctio n of C (ab o v e baseline lev els of C o ccurring sp on taneously) 4. C is designed to b e similar to M 5. the b eha vior C must b e a nov el b eha vior not already organized in that precise w a y in the organism’s rep ertoire. This deﬁn itio n p erh aps presup p ose s a cognitiv e stance to w ards imitation in wh ic h an agen t explicitly reasons ab out the b eha viors of other agen ts and how these b eha viors relate to its o wn action capabilities and goals. Imitation can b e fu rther analyzed in terms of the type of corresp ondence demonstrated b y the m entor’s b eha vior and the observe r’s acquired b ehavi or (Nehaniv & Dautenhahn, 1998; Byrne & Russon, 1998). C orr espond ence t yp es are distinguished by lev el. At the action lev el, there is a corresp ondence b et w een actions. A t the program lev el, the actions 621 Price & Boutilier ma y b e completely diﬀerent b ut corresp ondence m ay b e foun d b et w een sub go als. A t the eﬀect lev el, the agen t plans a s et of actions that ac hieve the s ame eﬀect as the demonstrated b eha vior but there is no d irect corresp ondence b et w een sub comp onen ts of the obs erver’s actions and the ment or’s actions. T h e term abstr act imitation has b een prop osed in the case where agen ts imitate b eha viors w h ic h come f rom imitating the mental s tate of other agen ts (Demiris & Ha ye s, 1997). The study of sp eciﬁc computational mo dels of imitation h as yielded insigh ts int o the nature of the observ er-men tor relationship and ho w it aﬀects the acquisition of b ehavio rs by observ ers. F or instance, in the related ﬁ eld of b eha vioral cloning, it has b een observe d that men tors that imp le ment conserv ative p olicies generally yield more reliable clones (Urbancic & Bratk o, 1994) . Highly-trained ment ors follo w in g an optimal p olicy with sm all co v erage of the state space yield less r eliable clones than those that make more mistak es (Sammut et al., 1992) . F or partially observ able p roblems, learning f rom p erfect oracles can b e disastrous, as they ma y c ho ose p olicie s based on p erceptions n ot a v ailable to the observ er. The observe r is therefore incorrectly biased a w a y from less risky p olicies that do not require the additional p erceptual capabilities (Scheﬀer, Greiner, & Dark en, 1997). Finally , it h as b een ob s erved that s u cce ssful clones would often outp erform the original men tor due to the “clean u p eﬀect” (Samm ut et al., 1992). One of the original goals of b eha vioral cloning (Mic hie, 1993 ) was to extract kno wledge from humans to sp eed u p the design of con tr ollers. F or the extracted kn owledge to b e useful, it has b een argued that ru le-base d systems oﬀer th e b est chance of in telligibilit y (v an Len t & Laird, 1999). It has b ecome clear, ho we ve r, th at sym b olic repr esen tations are not a complete answer. Represent ational capacit y is also an issue. Humans often organize con trol tasks by time, whic h is typical ly lac king in state and p erception-based approac hes to con trol. Humans also naturally break tasks do wn in to indep endent comp onen ts and subgoals (Urb ancic & Bratk o, 1994). Studies ha v e also demonstrated that h umans will giv e verbal descriptions of their con trol p olicie s whic h do not matc h their actual actions (Urbancic & Bratk o, 1994). Th e p oten tial for sa ving time in acquisiti on has b een b orne out b y one study wh ic h explicitly compared the time to extract rules with the time required to program a con troller (v an Lent & Laird, 1999). In addition to what has traditionally b een considered imitati on, an agen t may also f ace the p roblem of “learning to imitate” or ﬁ nding a corresp ondence b et ween th e actions and states of the observer and mento r (Nehaniv & Dautenhahn, 1998). A f ully credible appr oa c h to learning by observ ation in the absence of comm unication proto cols w ill hav e to d ea l w ith this issue. The theoretical dev elopmen ts in imitation researc h h a ve b een accompanied by a n umb er of p racti cal imp lemen tations. These implemen tations tak e adv an tage of prop erties of diﬀer- en t cont rol p aradigms to demonstrate v arious asp ects of imitation. Early b eha vioral cloning researc h to ok adv ant age of sup ervised learning tec h niques s u c h as decision trees (Sammut et al., 1992). The decision tree was used to learn how a human op erator mapp ed p ercep- tions to actions. Perce ptions w ere enco ded as discrete v alues. A time d el a y w as inserted in order to sync hronize p erceptions with the actions they trigger. Learning apprent ice systems (Mitc hell et al., 1985) also attempted to extract usefu l kno wledge by watc hing users, but the goal of appr entice s is n ot to indep endently solv e prob lems. Learning apprenti ces are closely related to programming by demonstration systems (Lieb erman, 1993). Later eﬀorts u sed 622 Implicit Imit a tion more sophisticated tec hniques to extract actions fr om visu al p erceptions and abstract these actions for futur e use (Kuniy oshi et al., 1994). W ork on asso ciati ve and recurr en t learn- ing mo dels has allo w ed work in the area to b e extended to learning of temp oral sequences (Billard & Ha y es, 1999). Asso ciat iv e learning h as b een used together with innate follo w ing b eha viors to acquire n a vigation exp ertise from other agen ts (Billard & Ha y es, 1997). A related bu t slight ly diﬀerent form of imitation has b een studied in the m ulti-agen t reinforcemen t learning comm unity . An early p recursor to imitation can b e fou n d in w ork on sh aring of p erceptions b et w een agen ts (T an, 1993). Closer to imitation is the idea of repla ying the p erceptions and actions of one agen t for a second agent (Lin, 1991; Whitehead, 1991a ). Here, the transfer is from one agen t to another, in con trast to b ehavio ral cloning’s transfer from human to agen t. Th e repr ese nta tion is also diﬀeren t. Reinforcemen t learning pro vides agent s with the abilit y to reason ab out the eﬀects of cur ren t actions on exp ected future utilit y so agen ts can inte grate their o wn kno wledge with kno wledge extracted from other agen ts by comparing the relativ e utilit y of the actions su gg ested by eac h kno wledge source. The “seeding approac hes” are closely related. T r a jectories r ec orded from human sub ject s are u sed to initialize a p lanner wh ic h su b sequen tly optimizes the plan in order to accoun t for diﬀerences b et w een the human eﬀecto r and the rob otic eﬀector (A tk eson & Sc haal, 1997). T his tec hn iqu e has b een extended to hand le the notion of sub go als within a task (A tk eson & S c haal, 1997). Sub go als are also addressed b y others ( ˇ Suc & Bratk o, 1997) . Our own w ork is based on the idea of an agen t extracting a mo del from a ment or and using this mo del inf orm at ion to p lac e b ounds on the v alue of actions u sing its own rew ard fun ct ion. Agen ts can therefore learn fr om men tors w ith reward fun cti ons diﬀeren t than their own. Another approac h in this family is based on the assump tio n that the mento r is r ati onal (i.e., follo ws an optimal p olicy), has the same rew ard function as the observe r and c ho oses from the same set of actions. Given these assumptions, we can conclude that the action c hosen by a mento r in a particular state m ust ha ve higher v alue to the ment or than the alternativ es op en to the mento r (Utgoﬀ & Clouse, 1991) and therefore h igher v alue to the observ er than an y alternativ e. The system of Utgoﬀ and Clouse therefore iterativ ely adjusts the v alues of the actions until th is constrain t is satisﬁed in its m o del. A related appr oa c h uses the metho dology of linear-quadratic con trol ( ˇ Suc & Bratk o, 1997). First a mo del of the system is constructed. Then the in v erse control problem is solv ed to ﬁnd a cost matrix that would resu lt in th e observed cont roller b eha vior giv en an en vironment mo del. Recen t w ork on inv erse r einforcement learning tak es a r el ated approac h to reconstructing reward functions from observ ed b ehavi or (Ng & Russell, 2000) . It is similar to the in v ersion of the quadratic control appr oac h, bu t is formulate d for discrete d oma ins. Sev eral researc hers hav e pic k ed up on the idea of common represent ations for p ercep- tual functions and actio n plann in g. On e approac h to using the same represen tation for p erception and control is b ased on the PID cont roller mo del. T he PID con troller r epresen ts the b eha vior. Its output is compared with observ ed b eha viors in order to select the actio n whic h is closest to th e observed b eha vior (Demiris & Hay es, 1999). Explicit motor action sc hema ha v e also b een inv estigated in the dual role of p erceptual and motor representa tions (Matari ´ c, Williamson, Demiris, & Mohan, 1998) . Imitation tec hniques ha v e b een applied in a div erse collection of applications. Clas- sical con trol applications include con trol sys te ms f or rob ot arms (Kuniyo shi et al., 1994; 623 Price & Boutilier F r iedric h, Munch, Dillmann, Bo cionek, & Sassin, 1996), aeration plan ts (Scheﬀe r et al., 1997) , and con tainer loading cranes ( ˇ Suc & Bratko , 1997; Urbancic & Bratk o, 1994). Imi- tation learning has also b een applied to accelerati on of generic r einforce ment learning (Lin, 1991; Whitehead, 1991a ). Less traditional app lic ations include transfer of musical st yle (Ca ˜ namero, Arcos, & de Man taras, 1999) and the supp ort of a so cia l atmosphere (Bil- lard, Ha y es, & Dautenhahn, 1999; Breazeal, 1999; Scassellati, 1999). Imitation h as also b een inv estigated as a route to language acquisition and transmission (Billard et al., 1999; Oliphan t, 1999). 9. Concluding Remarks W e hav e describ ed a f ormal and principled appr oa c h to imitation called implicit imitation. F or sto c hastic problems in whic h explicit forms of comm unication are not p ossible, the underlying mo del-based fr amew ork com b ined with mo del extraction pro vides an alternativ e to other imitation and learning-by- observ ation sy s tems. Our new approac h mak es use of a mo del to compute the actions an imitator should tak e without r equiring that the observer duplicate the men tor’s actions exactly . W e ha ve sho wn implicit imitation to oﬀer signiﬁcan t transfer capabilit y on s ev eral test pr oblems, wh ere it pro v es to b e r obust in th e face of noise, capable of integ rating subskills from multiple ment ors, and able to pro vide b eneﬁts that increase with the diﬃcult y of the problem. W e ha v e seen that feasibilit y testing extends imp lici t imitation in a principled manner to deal with th e situations where the homogeneous action assu m ption is inv alid. Ad ding bridging capabilities preserve s and extends the mentor’s guidance in the presence of infea- sible actions, whether due to diﬀerences in action capabilities or lo cal d iﬀerences in state spaces. Our approac h also relates to the id ea of “follo wing” in the sense that the imitator uses lo cal s earch in its mo del to repair discontin uities in its augmen ted v alue fu n cti on b e- fore acting in the w orld. In the pro cess of applying imitation to v arious d oma ins, w e ha v e learned more ab out its prop erties. In p artic ular we h a ve d ev elop ed the fr actur e metric to c haracterize the eﬀectiv eness of a mentor for a giv en observ er in a s p eciﬁc domain. W e ha v e also made considerable pr og ress in extending imitation to new p roblem classes. T h e mo del w e hav e deve lop ed is rather ﬂexible and can b e extended in seve ral w a ys: for example, a Ba yesia n app roac h to imitation building on this w ork sho ws great p oten tial (2003); and w e ha v e initial formulat ions of promising appr oa c hes to extending implicit imitation to multi- agen t problems, partially observ able d omai ns and domains in w hic h the r ew ard f unction is not s peciﬁed a pr iori. A num b er of challe nges remain in the ﬁeld of imitation. Bakke r and Ku n iy oshi (1996) describ e a n umb er of th ese . Among the more int riguing p roblems uniqu e to imitation are: the ev aluation of the exp ected pay oﬀ for observing a men tor; in ferring useful state and rew ard mapp ings b et we en the d oma ins of mento rs and th ose of observers; and repairing or lo cally searc h ing in ord er to ﬁ t observed b eha viors to an observ er’s o wn capabilities and goals. W e ha v e also raised th e p ossibilit y of agent s attempting to reason ab out th e information rev ealed by their actions in addition to whatev er concrete v alue the actions ha v e for the agen t. Mo del-base d r einforceme nt has b een app lie d to numerous p roblems. Since implicit imi- tation can b e add ed to m o del-based reinforcemen t learning with r ela tiv ely little eﬀort, we 624 Implicit Imit a tion exp ect that it can b e applied to many of the same problems. Its basis in th e simple but elegan t theory of Mark o v decision p r ocesses makes it easy to describ e and analyze. Though w e ha v e fo cused on some simp le examples designed to illustrate the diﬀerent mec hanisms required for imp lic it imitation, we exp ect that v ariations on our approac h will p ro vide in- teresting directions for future r esea rch. Ac kno wledgmen ts Thanks to the anon ymous referees for their suggestions and commen ts on earlier versions of this w ork and Mic hael Littman f or editorial s ugge stions. P ric e was su pp orted b y NCE I R I S- I I I Pro ject BA C. Boutilier was supp orted by NSERC Researc h Gran t OGP0121843 , and the NCE I R I S-II I Pr o ject BA C. Some parts of this pap er were presented in “Imp lic it Im- itatio n in Reinforcemen t Learning,” P r o c e e dings of the Sixte enth International Confer enc e on M achine L e arning (ICML-99) , Bled, Slov enia, pp.325–334 (1999) and ”Imitatio n and Reinforcemen t Learnin g in Agen ts with Heterogeneous Actions,” Pr o c e e dings F ourte enth Biennial Confer e nc e of the Canadian So ciety for Computat ional Studies of Intel ligenc e (AI 2001) , Otta wa, p p .11 1–120 (2001). References Alissandrakis, A., Nehaniv, C. L., & Dautenhahn, K. (2000). Lear ning how to do things with imitation. In Bauer , M., & Rich, C. (Eds.), AAAI F al l Symp osium on L e arning How to Do Things , pp. 1–6 Cape Co d, MA. A tkeson, C. G., & Schaal, S. (19 97). Rob ot lear ning from demonstration. In Pr o c e e dings of t he F ourt e enth International Confer en c e on Machine L e arning , pp. 12–2 0 Nashville, TN. A tkeson, C. G., Mo ore, A. W., & Schaal, S. (19 97). Lo cally wei ghted learning for control. Artiﬁcial Intel ligenc e R eview , 11 (1-5 ), 75–1 13. Bakker, P ., & Kuniyoshi, Y. (199 6). Rob ot see , rob ot do: An ov erview of r obot imitation. In AIS B96 Workshop on L e arning in R ob ots and Animals , pp. 3–11 Br ig h ton,UK. Bellman, R. E. (1 9 57). Dynamic Pr o gr amming . Princeton Univ ersity Pr ess, P rinceton. Bertsek a s, D. P . (19 87). D ynami c Pr o gr amming: Det ermini stic and Sto chastic Mo dels . Prentice-Hall, Englewoo d Cliﬀs. Bertsek a s, D. P ., & Tsitsiklis, J. N. (1996). Neur o-dynamic Pr o gr amming . Athena, Belmo nt, MA. Billard, A., & Hay es, G. (1997 ). Learning to co mm unica te through imitation in a utonomous ro b ots. In Pr o c e e dings of The S eventh International Confer en c e on Artiﬁcia l Neur al Networks , pp. 763–6 8 Lausanne, Switzerland. Billard, A., & Hay es, G. (1999 ). Drama , a connectionist architecturefor co ntrol and learning in autonomous rob ots. A daptive Behavior Journal , 7 , 35– 6 4. Billard, A., Hay es, G., & Dautenhahn, K. (1 9 99). Imitation skills as a means to enhance learning o f a synth etic pr oto-language in a n a utonomous rob ot. In Pr o c e e dings of the AISB’99 Symp osium on Imitation in Animals and Artifacts , pp. 8 8–95 Edinburgh. Boutilier, C. (1999 ). Sequential o ptimality and co ordination in multiagen t sys tems. In Pr o c e e dings of the Sixt e enth In t ernationa l Joint Confer enc e on Artiﬁcial Intel ligenc e , pp. 4 78–485 Sto c kholm. 625 Price & Boutilier Boutilier, C., Dean, T., & Hanks, S. (1999 ). Decision theoretic planning: Structural as sumptions and computational leverage. J ournal of Artiﬁcia l Intel ligenc e R ese ar ch , 11 , 1 – 94. Bowling, M., & V elos o , M. (2001 ). Rational a nd convergen t learning in sto chastic games. In Pr o c e e d- ings of the Sevente enth Int ernationa l Joint Confer enc e on Artiﬁcial Intel ligenc e , pp. 1021 –1026 Seattle. Breazeal, C. (19 99). Imitation as so cial exchange b et ween humans a nd r obot. In Pr o c e e dings of the AISB’99 Symp osium on Imitation in Anima ls and Artifac ts , pp. 96–10 4 Edinburgh. Byrne, R. W., & Russon, A. E. (1998). Lear ning by imitation: a hierarchical appro ac h. Behavior al and Br ain Scienc es , 21 , 6 6 7–721. Ca ˜ namero, D., Arcos, J. L., & de Mantaras, R. L. (1999 ). Imitating human p erformances to au- tomatically g enerate expr e s siv e jazz ballads. In Pr o c e e dings of the AISB’99 Symp osium on Imitation in Animals and Artifa cts , pp. 11 5–20 E din burgh. Cassandra , A. R., Kaelbling , L. P ., & Littman, M. L. (1994 ). Acting optimally in par tially o b- serv a ble sto chastic domains. In Pr o c e e dings of t he Twelfth National Confer enc e on Artiﬁcial Intel ligenc e , pp. 1023 –1028 Seattle. Conte, R. (2000). Intelligen t s o cial lear ning. In Pr o c e e dings of the AISB’00 Symp osium on Starting fr om So ciety: the Applic ations of So cial Analo gies to Computational Systems Bir mingham. Crites, R., & Ba rto, A. G. (1 998). Elev ator gro up control using multiple reinfor cemen t lea r ning agents. Machine-L e arning , 33 (2–3), 235– 62. Dean, T., & Giv an, R. (1 9 97). Mo del minimization in Ma r k ov decision pro cesses. In Pr o c e e dings of the F ourte enth National Confer enc e on Artiﬁcial Int el lig enc e , pp. 106 –111 Pr ovidence. Dearden, R., & Boutilier, C. (1997 ). Abstrac tio n and approximate decision theoretic planning. Artiﬁci al Intel ligenc e , 89 , 219– 283. Dearden, R., F riedman, N., & Andre, D. (1999). Mo del-based ba yesian ex ploration. In Pr o c e e dings of the Fifte enth Confer enc e on Unc ertainty in Artiﬁci al In t el lig enc e , pp. 150 –159 Stockholm. DeGro ot, M. H. (1975 ). Pr ob ability and statistics . Addison-W es ley , Reading, MA. Demiris, J., & Hay es, G. (199 7). Do ro b ots ap e ?. In Pr o c e e dings of the AAAI F al l Symp osium on So cial ly Intel ligent A gents , pp. 28–3 1 Cambridge, MA. Demiris, J., & Hay es, G. (1999). Active and passive routes to imitation. In Pr o c e e dings of the AISB’99 Symp osium on Imitation in Anima ls and Artifac ts , pp. 81–87 Edinburgh. Fiorito, G., & Scotto, P . (1992). Observ ational lear ning in o ctopus v ulg aris. Scienc e , 256 , 54 5–47. F orb es, J., & Andre, D. (2000). Prac tica l reinforce ment lea r ning in contin uous domains. T ech. r ep. UCB/CSD-00-11 09, Computer Science Division, Universit y of California, Be r k eley . F riedrich, H., Munch, S., Dillmann, R., Bo cionek, S., & Sassin, M. (199 6). Rob ot progra mming by demonstration (RPD): Suppo rt the induction by hu man interaction. Machine L e arning , 23 , 163–1 89. Hartmanis, J., & Stearns, R. E . (1966). Algebr aic Stru ctur e The ory of Se quential Machines . Prentice- Hall, E nglew o od Cliﬀs. Hu, J., & W ellman, M. P . (1998 ). Multiagent reinforcement learning: Theoretical framework and an algorithm. In Pr o c e e dings of the Fifthte ent h Int ernatio nal Confer enc e on Machine L e arning , pp. 2 42–250 Madison, WI. Kaelbling, L . P . (199 3). L e arning in Emb e dde d Syst ems . MIT Press, Cambridge,MA. 626 Implicit Imit a tion Kaelbling, L. P ., Littman, M. L., & Mo ore, A. W. (199 6). Reinforcemen t lear ning: A survey . Journal of Artiﬁcial Intel ligenc e R ese ar ch , 4 , 23 7–285. Kearns, M., & Singh, S. (199 8). Near- optimal reinforcement lear ning in p olynomial time. In Pr o c e e d- ings of the Fifthte enth Int ernatio nal Confer enc e on Machine L e arning , pp. 260 –268 Madison, WI. Kuniyoshi, Y., Inaba , M., & Inoue, H. (1994). Learning by watc hing: Extracting r eusable task knowledge from visual observ ation of human p erformance. IEEE T r ansactions on R ob otics and Automation , 10 (6), 799 –822. Lee, D., & Y a nnak akis, M. (1992). Online miminization of transition systems. In Pr o c e e dings of the 24th Annual ACM Symp osium on the The ory of Computing (STOC-92) , pp. 26 4–274 Victoria, BC. Liebe r man, H. (1 993). Mondrian: A teachable gra phical editor. In Cypher, A. (Ed.), Watch What I Do: Pr o gr amming by Demonstr ation , pp. 34 0–358. MIT Press , Cambridge, MA. Lin, L.-J . (199 1). Self-improvemen t based on reinforcement learning, planning and teaching. Machine L e arning: Pr o c e e dings of the Eighth International W orkshop (ML91) , 8 , 3 2 3–27. Lin, L.-J. (199 2). Self-improving rea c tive a gen ts ba sed on reinfor cemen t lear ning, planning and teaching. Machine L e arning , 8 , 2 93–321. Littman, M. L. (199 4). Marko v games as a framework for multi-agent reinforcement learning. In Pr o c e e dings of the Eleventh International Confer enc e on Machine L e arning , pp. 15 7–163 New Brunswick, NJ. Lov ejoy , W. S. (199 1 ). A sur v ey of algo rithmic metho ds for partially observed Marko v decision pro cesses. Annals of Op er ations Rese ar ch , 28 , 47 –66. Mataric, M. J. (1998 ). Using commun ication to reduce loc a lit y in distributed m ulti-agent learning. Journal Exp erimental and The or etic al Artiﬁcial Intel ligenc e , 10 (3), 357– 369. Matari´ c, M. J., Williamson, M., Demiris, J., & Mohan, A. (1998). Behaviour-based primitives for articulated control. In R. Pﬁefer , B. Blumberg , J.-A. M. . S. W. W. (Ed.), Fifth Intern atio nal c onfer enc e on simulation of adaptive b ehavior SAB’98 , pp. 1 65–170 Zur ic h. MIT Press. Meuleau, N., & Bourgine, P . (1999 ). Exploratio n of m ulti-state environmen ts: Lo cal mesures and back-propagation of uncertaint y . Machine L e arning , 32 (2), 117–15 4. Mi, J., & Sampson, A. R. (1993 ). A comparis on of the Bonferroni and Sc heﬀ´ e bo unds. J ournal of Statistic al Planning and Infer enc e , 36 , 10 1 –105. Mich ie, D. (1993 ). Knowledge, lear ning a nd machin e int elligence. In Sterling, L. (Ed.), In tel ligent Systems . Plenum Pr e s s, New Y ork . Mitc hell, T. M., Mahadev an, S., & Steinberg, L. (1985). LEAP: A lear ning apprentice for VLSI design. In Pr o c e e dings of the Ninth Int ern ational Joint Confer enc e on Artiﬁcial Int el ligenc e , pp. 5 73–580 Los Altos, California. Mo rgan Kaufmann Publishers, Inc. Mo ore, A. W., & Atk eson, C. G. (1993). Prio ritized sweeping: Reinforcement lear ning with less data and less real time. Machine L e arning , 13 (1), 103– 30. Myerson, R. B. (1991). Game The ory: Analy sis of Conﬂict . Ha rv ard Universit y Pr ess, Ca m bridge. Nehaniv, C., & Dautenhahn, K. (19 98). Mapping b et ween dissimilar b o dies: Aﬀordances a nd the algebraic foundations of imitation. In Pr o c e e dings of the S eventh Eu r op e an Workshop on L e arning Ro b ots , pp. 64 –72 Edinburgh. 627 Price & Boutilier Ng, A. Y., & Russ e ll, S. (2 000). Algor ithms for inv erse reinforcement learning. In Pr o c e e dings of the Sevente en t h Int ern atio nal Confer enc e on Machine L e arning , pp. 663–6 70 Stanford. Noble, J ., & T o dd, P . M. (199 9). Is it r eally imitation? a review o f simple mechanisms in so cial information gathering . In Pr o c e e dings of the AISB’99 Symp osium on Imitation in Animals and Artifacts , pp. 65–73 E din burgh. Oliphant, M. (199 9). Cultural transmissio n of communications systems: Comparing observ ational and reinforcement learning mo dels. In Pr o c e e dings of the AISB’99 Symp osium on Imitation in Animals and Artifacts , pp. 47–54 Edinburgh. Price, B., & Boutilier, C. (200 3). A Bay esian approach to imitation in r einforcemen t learning. In Pr o- c e e dings of the Eighte ent h International Joint Confer enc e on Artiﬁcial Intel ligenc e Acapulco. to appea r. Puterman, M. L. (19 94). Markov De cision Pr o c esses: Discr ete Sto chastic Dynamic Pr o gr amming . John Wiley and Sons, Inc., New Y ork. Russon, A., & Galdik as, B . (19 93). Imitation in free-ranging rehabilitant orangutans (p ongo- p ygma e us). Journal of Comp ar ative Psycholo gy , 107 (2), 147 – 161. Sammu t, C., Hurst, S., Kedzier , D., & Michie, D. (199 2 ). Lea rning to ﬂy . In Pr o c e e dings of the Ninth Int ernatio nal Confer enc e on Machi ne L e arning , pp. 38 5–393 Ab erdeen, UK . Scassellati, B. (1999 ). Knowing what to imitate and knowing when you succeed. In Pr o c e e dings of the AISB’99 Symp osium on Imitation in Animals and Artifacts , pp. 105–11 3 E din bur gh. Sch eﬀer, T., Gr einer, R., & Dar k en, C. (199 7 ). Why exp erimen tation can b e b etter than per fect guidance. In Pr o c e e dings of the F ourte enth In ternational Confer enc e on Machine L e arning , pp. 3 31–339 Nashville. Seber , G. A. F. (1984). Multivariate Observations . Wiley , New Y ork. Shapley , L. S. (1953). Sto ch astic ga mes. Pr o c e e dings of the National A c ademy of Scienc es , 39 , 327–3 32. Singh, S. P ., & Ber tsek a s, D. (1997 ). Reinforcement learning for dynamic channel allo cation in cellular telephone systems. In Ad vanc es in Neur al information pr o c essing systems , pp. 974– 980 Cambridge, MA. MIT Press. Smallwoo d, R. D., & Sondik, E . J. (1973). The optimal c o n trol of par tially o bserv able Mar k ov pro cesses ov er a ﬁnite horizon. Op er ations Rese ar ch , 21 , 1 071–108 8 . ˇ Suc, D., & Bra tk o, I. (1997 ). Skill reconstruction a s induction of LQ controllers with subgoals. In Pr o c e e dings of the Fifte enth Intern ational Joint Confer enc e on Artiﬁcial Intel ligenc e , pp. 914–9 19 Nagoya. Sutton, R. S. (1988 ). Learning to pr edict by the metho d of tempora l diﬀerences. Machine L e arning , 3 , 9– 44. Sutton, R. S., & Barto, A. G. (199 8). R einfor c ement L e arning: An Intr o duct io n . MIT Press, Cambridge, MA. T an, M. (199 3). Multi-agent reinforcement learning: Independent vs. co oper ativ e agents. In ICML- 93 , pp. 33 0 –37. Urbancic, T., & Bratko, I. (1994). Reco nstruction human skill with machine lea rning. In Eleventh Eur op e an Confer enc e on Artiﬁcial Intel ligenc e , pp. 498 –502 Amsterdam. 628 Implicit Imit a tion Utgoﬀ, P . E ., & Clous e, J. A. (1 991). Two kinds of tr aining information for ev alua tion function learning. In Pr o c e e dings of the Ninth National Confer enc e on Artiﬁcial Intel ligenc e , pp. 596– 600 Anaheim, CA. v an Lent, M., & La ird, J. (1999). Learning hierarchical p erformance knowledge by observ ation. In Pr o c e e dings of the S ixt e enth International Confer enc e on Machine L e arning , pp. 2 29–238 Bled, Slovenia. Visalb erghi, E ., & F rag azy , D. (1990 ). Do monkeys a pe?. In Parker, S., & Gibson, K . (Eds.), L anguage and Intel ligenc e in Monkeys and Ap es , pp. 2 47–273. Ca m bridge Universit y Pr ess, Cambridge. W atkins, C. J. C. H., & Day an, P . (19 92). Q- lea rning. Machine L e arning , 8 , 27 9–292. Whitehead, S. D. (1991a ). Complexity a nalysis of co op erativ e mechanisms in reinforcement learn- ing. In Pr o c e e dings of the Ninth National Confer enc e on Artiﬁcia l In tel lige nc e , pp. 607–6 13 Anaheim. Whitehead, S. D. (1 991b). Complexity a nd co ope ration in q-learning. In Machine L e arning. Pr o- c e e dings of t he Eighth International Workshop (ML91) , pp. 36 3–367. 629

Accelerating Reinforcement Learning through Implicit Imitation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment