Learning Geometrically-Constrained Hidden Markov Models for Robot Navigation: Bridging the Topological-Geometrical Gap

Journal of Articial In telligence Researc h 16 (2002) 167-207 Submitted 3/01; published 3/02 Learning Geometrically-Constrained Hidden Mark o v Mo dels for Rob ot Na vigation: Bridging the T op ological-Geometrical Gap Hagit Shatk a y ha git.sha tka y@celera.com Informatics R ese ar ch Gr oup, Celer a Genomics, R o ckvil le, MD 20850 Leslie P ac k Kaelbling lpk@ai.mit.edu A rticial Intel ligenc e L ab or atory Massachusetts Institute of T e chnolo gy, Cambridge, MA 02139 Y ou wil l c ome to a plac e wher e the str e ets ar e not marke d. Some windows ar e lighte d but mostly they'r e darke d. A plac e you c ould spr ain b oth your elb ow and chin! Do you dar e to stay out? Do you dar e to go in?... A nd if you go in, should you turn left or right... or right-and-thr e e-quarters? or, mayb e, not quite?... Simple it's not, I'm afr aid you wil l nd, for a mind-maker-upp er to make up his mind. Oh, the Plac es Y ou'l l Go, Dr. Seuss. Abstract Hidden Mark o v mo dels ( hmm s) and partially observ able Mark o v decision pro cesses ( pomdp s) pro vide useful to ols for mo deling dynamical systems. They are particularly useful for represen ting the top ology of en vironmen ts suc h as road net w orks and oce buildings, whic h are t ypical for rob ot na vigation and planning. The w ork presen ted here describ es a formal framew ork for incorp orating readily a v ailable o dometric infor- mation and geometrical constrain ts in to b oth the mo dels and the algorithm that learns them. By taking adv an tage of suc h information, learning hmm s/ pomdp s can b e made to generate b etter solutions and require few er iterations, while b eing robust in the face of data reduction. Exp erimen tal results, obtained from b oth sim ulated and real rob ot data, demonstrate the eectiv eness of the approac h. 1 In tro duction This w ork is concerned with rob ots that need to p erform tasks in structur e d envir onments. A rob ot mo ving in the en vironmen t suers from t w o main limitations: its noisy sensors prev en t it from conden tly kno wing where it is, while its noisy eectors prev en t it from kno wing with certain t y where its actions will tak e it. W e concen trate here on structur e d en vironmen ts, whic h can in turn b e c haracterized b y t w o main prop erties: suc h en vironmen ts consist of v ast un- eventful and unin teresting areas, and are in tersp ersed with relativ ely few inter esting p ositions or situations. Consider for instance a rob ot deliv ering a bagel in an oce building. The in teresting situations are the do ors and the in tersections in the building hallw a ys, as w ell as the v arious c  2002 AI Access F oundation and Morgan Kaufmann Publishers. All righ ts reserv ed. Sha tka y & Kaelbling p ositions where the bagel migh t b e with resp ect to the rob ot's arm (e.g., the rob ot is holding the bagel, puts it do wn, etc.) Most other asp ects of the en vironmen t, suc h as the desk p ositions in the oces, are inconsequen tial for the bagel deliv ery task. A natural w a y to represen t the c ombination of suc h an envir onment and the rob ot's inter actions with it, is as a probabilistic automaton, in whic h states represen t in teresting situations, and edges b et w een states represen t the actions leading from one situation to another. Probabilit y distributions o v er the transitions and o v er the p ossible observ ations the rob ot ma y p erceiv e at eac h situation mo del the rob ot's noisy eectors and sensors, resp ectiv ely . Suc h mo dels are formally kno wn as pomdp (partially observ able Mark o v decision pro cess) mo d- els, and ha v e b een pro v en useful for rob ot planning and acting under the inheren t w orld un- certain t y (Simmons & Ko enig, 1995; Nourbakhsh, P o w ers, & Birc held, 1995; Cassandra, Kael- bling, & Kurien, 1996). Despite m uc h w ork on using suc h mo dels, the task of learning them directly and automatically from the data has not b een widely addressed. Researc h concerning this immediate topic to date consists mostly of the w ork done b y Simmons and Ko enig (1996b). The assumption underlying their w ork w as that a h uman pro vides a rather accurate top ological mo del of the states and their connections, and the exact probabilit y distributions are then learned on top of this mo del, using a v ersion of the Baum-W elc h algorithm (Rabiner, 1989). Another in teresting approac h to the acquisition of top ological mo dels is that of Thrun and B  uc k en (1996a,1996b; Thrun, 1999), who fo cused on extracting deterministic top ological maps from previously acquired geometrical- grid-based maps, where the latter w ere learned directly from the data. F urther discussion of related researc h on b oth the geometrical and the top ological approac hes, in their probabilistic and deterministic v ersions, is giv en in the next section. The w ork rep orted here is the rst successful attempt w e are a w are of to learn purely pr ob abilistic- top olo gic al mo dels, directly and completely from recorded data, without using previous h uman- pro vided or grid-based mo dels. It is based on using we ak ge ometric information, recorded b y the rob ot, to help learn the top olo gy of the en vironmen t, and represen t it as a probabilistic mo del. Therefore, it directly bridges the historically p erceiv ed gap b et w een top ological and geometrical information, and addresses the claim presen ted in Thrun's w ork (1999) that the main shortcoming of the top ological approac h is its failure to utilize the inheren t geometry of the learn t en vironmen t. Most rob ots are equipp ed with wheel enco ders that enable an o dometer to record the c hange in the rob ot's p osition as it mo v es through the en vironmen t. This data is t ypically v ery noisy and inaccurate. The o ors in the en vironmen t are rarely smo oth, the wheels of the rob ot are not alw a ys aligned and neither are the motors, the mec hanics is imp erfect, resulting in slippage and drift. All these eects accum ulate, and if w e w ere to mark the initial p osition of the rob ot, and try to estimate its curren t p osition based on summing a long sequence of o dometric recordings, the resulting estimate will b e incorrect. That is, the ra w recorded o dometric information is not an eectiv e to ol, in and of itself, for determining the absolute lo cation of the rob ot in the en vironmen t. While our approac h is not aimed at determining absolute lo cations, the idea underlying it is that this w eak o dometric information, despite its noise and inaccuracy , still pro vides geometrical cues that can help to distinguish b et w een dieren t states, as w ell as to iden tify revisitation of the same state. Hence, suc h information enhances the abilit y to learn top olo gic al mo dels. Ho w ev er, 168 Learning Geometricall y-Constrained HMMs the use of geometrical information requires careful treatmen t of geometrical constrain ts and directional data. W e demonstrate ho w the existing mo dels and algorithms can b e extended to tak e adv an tage of the noisy o dometric data and the geometrical constrain ts. The geometrical information is directly incorp orated in to the probabilistic top ological framew ork, pro ducing a signican t impro v emen t o v er the standard Baum-W elc h algorithm, without the need for h uman- pro vided mo del. The rest of this pap er is organized as follo ws: Section 2 pro vides a surv ey of previous w ork in the area of learning maps for rob ot na vigation, and briey refers to earlier w ork on learning automata; Section 3 presen ts the formal framew ork for this w ork; Section 4 presen ts the main asp ects of our iterativ e learning algorithm, while Section 5 describ es the strategies for selecting the initial p oin t from whic h the iterativ e pro cess b egins; Section 6 presen ts exp erimen tal results obtained from b oth sim ulated and real rob ot data in traditionally hard-to-learn en vironmen ts. The exp erimen ts demonstrate that our algorithm indeed con v erges to b etter mo dels with few er iterations than the standard Baum-W elc h metho d, and is robust in the face of data reduction. 2 Approac hes to Learning Maps and Mo dels The w ork presen ted here lies in the in tersection b et w een the theoretical area of learning compu- tational mo dels|in particular, learning automata from data sequences|and the applied area of map acquisition for rob ot na vigation. W e concen trate here on surv eying the w ork in the latter area, p oin ting out the distinction b et w een our approac h and its predecessors. W e briey review some results from automata and computational learning theory . A more comprehensiv e review of theoretical results is giv en b y Shatk a y (1999). 2.1 Mo deling En vironmen ts for Rob ot Na vigation In the con text of maps and mo dels for rob ot na vigation, a distinction is usually made b et w een t w o principal kinds of maps: ge ometric and top olo gic al . Geometric maps describ e the en vironmen t as a collection of obje cts or o c cupie d p ositions in space, and the ge ometric relationships among them. The top olo gic al framew ork is less concerned with the geometrical p ositions, and mo dels the w orld as a collection of states and their c onne ctivity , that is, whic h states are reac hable from eac h of the other states and what actions lead from one state to the next. W e dra w an additional distinction, b et w een w orld-cen tric 1 maps that pro vide an \ob jectiv e" description of the en vironmen t indep enden t of the agen t using the map, and rob ot-cen tric mo dels whic h capture the inter action of a particular \sub jectiv e" agen t with the en vironmen t. When learning a map , the agen t needs to tak e in to accoun t its o wn noisy sensors and actuators and try to obtain an ob jectiv ely correct map that other agen ts could use as w ell. Similarly , other agen ts using the map need to comp ensate for their o wn limitations in order to assess their p osition according to the map. When learning a mo del that captures inter action , the agen t acquiring the mo del is the one who is also using it. Hence, the noisy sensors and actuators sp ecic to the agen t are reected in the mo del. A dieren t mo del is lik ely to b e needed b y dieren t agen ts. Most of the related w ork describ ed b elo w, esp ecially within the geometrical framew ork, is cen tered around learning ob jectiv e maps of the w orld rather than agen t-sp ecic mo dels. W e shall p oin t out in this surv ey the w ork that is concerned with the latter kind of mo dels. Our w ork fo cuses on acquiring pur ely top olo gic al mo dels , and is less concerned with learning geometrical relationships b et w een lo cations or ob jects, or ob jectiv e maps, although geometrical 1. W e thank Sebastian Thrun for the terminology . 169 Sha tka y & Kaelbling relationships do serv e as an aid in our acquisition pro cess. The concept of a state used in this top ological framew ork is more general than the concept of a ge ometric al lo c ation , since a state can include information suc h as the battery lev el, the arm p osition etc. Suc h information, whic h is of great imp ortance for planning, is non-geometrical in nature and therefore cannot b e readily captured in a purely geometrical framew ork. The follo wing sections pro vide a surv ey of w ork done b oth within the geometrical framew ork and within the top ological framew ork, as w ell as com binations of the t w o approac hes. 2.2 Geometric Maps Geometric maps pro vide a description of the en vironmen t in terms of the ob jects placed in it and their p ositions. F or example, grid-b ase d maps are an instance of the geometric approac h. In a grid-based map, the en vironmen t is mo deled as a grid (an arra y), where eac h p osition in the grid can b e either v acan t or o ccupied b y some ob ject (binary v alues placed in the arra y). This approac h can b e further rened to reect uncertain t y ab out the w orld, b y ha ving grid cells con tain o ccupancy probabilities rather than just binary v alues. A lot of w ork has b een done on learning suc h grid-based maps for rob ot na vigation through the use of sonar readings and their in terpretation, b y Mora v ec and Elfes and others (Mora v ec & Elfes, 1985; Mora v ec, 1988; Elfes, 1989; Asada, 1991). An underlying assumption when learning suc h maps is that the rob ot can tell (or nd out) where it is on the grid when it obtains a sonar reading indicating an ob ject, and therefore can place the ob ject correctly on the grid. A similar lo calization assumption, requiring the rob ot to iden tify its geometrical lo cation, underlies other geometric mapping tec hniques b y Leonard et al. (1991), Smith et al. (1991), Thrun et al. (1998b) and Dissana y ak e et al. (2001), ev en when an explicit grid is not part of the mo del. Explicit lo calization can b e hard to satisfy . Leonard et al. (1991) and Smith et al. (1991) address this issue through the use of geometrical b eacons to estimate the lo cation of the rob ot. In what is kno wn as the Kalman lter metho d, a Gaussian probabilit y distribution is used to mo del the rob ot's p ossible curren t lo cation, based on observ ations collected up to the curren t p oin t, (without allo wing the renemen t of previous p osition estimates based on later observ ations). Researc h in this area has recen tly b een extended in t w o directions: Leonard and F eder (2000) partition the task of learning one large map in to learning m ultiple smaller map-sections, th us addressing the issue of computational eciency . Dissana y ak e et al. (2001) conduct a theoretical study of the approac h and sho w its con v ergence prop erties. The latter ma y lead to computational eciency b y iden tifying the cases for whic h a steady-state solution can b e readily obtained, accordingly b ounding the n um b er of steps required b y the algorithms to reac h a useful solution in these cases. W ork b y Thrun et al. (1998a) uses a similar probabilistic approac h for obtaining grid-based maps. This w ork is rened (Thrun et al., 1998b) to rst learn the lo cation of signican t landmarks in the en vironmen t and then ll in the details of the complete geometrical grid, based on laser range scans. The latter w ork extends the approac h of Smith et al. , b y using observ ations obtained b oth b efor e and after a lo cation has b een visited, in order to deriv e a probabilit y distribution o v er p ossible lo cations. T o ac hiev e this, the authors use a forwar d-b ackwar d pro cedure similar to the one used in the Baum-W elc h algorithm (Rabiner, 1989), in order to determine p ossible lo cations from observ ed data. The approac h resem bles ours b oth in the use of the forwar d- b ackwar d estimation pro cedure, and in its probabilistic basis, aiming at obtaining a maxim um lik eliho o d map of the en vironmen t. It still signican tly diers from ours b oth in its initial assumptions and in its nal results. The data assumed to b e pr ovide d to the learner includes 170 Learning Geometricall y-Constrained HMMs b oth the motion mo del and the p erceptual mo del of the rob ot. These consist of transition and observ ation probabilities within the grid. Both of these comp onen ts are le arnt b y our algorithm, although not in a grid con text but in a coarser-grained, top ological framew ork. The end result of their algorithm is a probabilistic grid-based map, while ours is a probabilistic top ological mo del, as further explained in the next section. In addition to b eing concerned only with lo cations, rather than with the ric her notion of state , a fundamen tal dra wbac k of geometrical maps is their ne gran ularit y and high accuracy . Geo- metrical maps, particularly grid-based ones, tend to giv e an accurate and detailed picture of the en vironmen t. In cases where it is necessary for a rob ot to kno w its exact lo cation in terms of metric co ordinates, metric maps are indeed the b est c hoice. Ho w ev er, man y planning tasks do not require suc h ne gran ularit y or accurate measuremen ts, and are b etter facilitated through a more abstract represen tation of the w orld. F or example, if a rob ot needs to deliv er a bagel from oce a to oce b , all it needs to ha v e is a map depicting the relativ e lo cation of a with resp ect to b , the passagew a ys b et w een the t w o oces, and p erhaps a few other landmarks to help it orien t itself if it gets lost. If it has a reasonably w ell-op erating lo w-lev el obstacle a v oidance mec hanism to help it b ypass o w er p ots and c hairs that it migh t encoun ter on its w a y , suc h ob jects do not need to b e part of the en vironmen t map. Just as a driv er tra v eling b et w een cities needs to kno w neither his longitude and latitude co ordinates on the glob e, nor the lo cation of the sp ecic houses along the w a y , the rob ot do es not need to kno w its exact lo cation within the building nor the exact lo cation of v arious items in the en vironmen t, in order to get from one p oin t to another. Hence, the eort of obtaining suc h detailed maps is not usually justied. In addition the maps can b e v ery large, whic h mak es planning|ev en though planning is p olynomial in the size of the map|inecien t. 2.3 T op ological Maps and Mo dels An alternativ e to the detailed geometric maps are the more abstract top ological maps. Suc h maps sp ecify the top olo gy of imp ortan t landmarks and situations ( states ), and routes or tran- sitions ( ar cs ) b et w een them. They are concerned less with the ph ysical lo cation of landmarks, and more with top ological relationships b et w een situations. T ypically , they are less complex and supp ort m uc h more ecien t planning than metric maps. T op ological maps are built on lo w er- lev el abstractions that allo w the rob ot to mo v e along arcs (p erhaps b y w all- or road-follo wing), to recognize prop erties of lo cations, and to distinguish signican t lo cations as states ; they are exible in allo wing a more general notion of state, p ossibly including information ab out the non-geometrical asp ects of the rob ot's situation. There are t w o t ypical strategies for deriving top ological maps: one is to learn the top ological map directly; the other is to rst learn a geometric map, then to deriv e a top ological mo del from it through some pro cess of analysis. A nice example of the second approac h is pro vided b y Thrun and B  uc k en (1996a, 1996b; Thrun, 1999), who use o ccupancy-grid tec hniques to build the initial map. This strategy is appropriate when the primary cues for decomp osition and abstraction of the map are geometric. Ho w ev er, in man y cases, the no des of a top ological map are dened in terms of other sensory data (e.g., lab els on a do or or whether or not the rob ot is holding a bagel). Learning a geometric map rst also relies on the o dometric abilities of a rob ot; if they are w eak and the space is large, it is v ery dicult to deriv e a consisten t map. 171 Sha tka y & Kaelbling In con trast, our w ork concen trates on learning a top ological mo del dir e ctly , assuming that ab- straction of the rob ot's p erception and action abilities has already b een done. Suc h abstractions w ere man ually enco ded in to the lo w er lev el of our rob ot na vigational soft w are, as describ ed in Section 6. W ork b y Pierce and Kuip ers (1997) discusses an automatic metho d for extracting abstract states and features from ra w p erceptual information. Kuip ers and Byun (1991) pro vide a strategy for learning deterministic top ological maps. It w orks w ell in domains in whic h most of the noise in the rob ot's p erception and action is abstracted a w a y , learning from single visits to no des and tra v ersals of arcs. A strong underlying assumption for these strategies, when building the map, is that the curren t state can b e reliably iden tied based on lo cal information, or based on distance tra v ersed from the previous w ell-iden tied state. These metho ds are unable to handle situations in whic h long sequences of actions and observ ations are necessary to disam biguate the rob ot's state. Mataric (1990) pro vides an alternativ e approac h for learning deterministic top ological maps, represen ted as distributed graphs. The learning pro cess again relies on the assumption that the curren t state can b e distinguished from all other states based on lo cal information whic h includes compass and sonar readings. Uncertain t y is not mo deled through probabilit y distributions. Instead, matc hing of curren t readings to already existing states is not required to b e exact, and thresholds of tolerated error are set empirically . Another dierence from the w ork presen ted here, is that while w e learn the complete probabilistic top ology of the en vironmen t, in Mataric's w ork the o v erall top ology of the graph is assumed in adv ance to b e a linear list, and additional edges are added during the learning pro cess. No probabilit y distribution is asso ciated with the edges, and a mec hanism for c ho osing whic h edge to tak e is determined as part of the goal seeking pro cess, and is not part of the mo del itself. Engelson and McDermott (1992) learn \diktiometric" maps (top ological maps with metric rela- tions b et w een no des) from exp erience. The uncertain t y mo del they use is in terv al-based rather than probabilistic, and the learned represen tation is deterministic. A d ho c routines handle prob- lems resulting from failures of the uncertain t y represen tation. W e prefer to learn a com bined mo del of the w orld and the rob ot's in teraction with the w orld; this allo ws robust planning that tak es in to accoun t lik eliho o d of error in sensing and action. The w ork most closely related to ours is b y Ko enig and Simmons (1996b, 1996a), who learn pomdp mo dels (sto c hastic top ological mo dels) of a rob ot hallw a y en vironmen t. They also recognize the dicult y of learning a go o d mo del without initial information; they solv e the problem b y using a h uman-pro vided top ological map, together with further constrain ts on the structure of the mo del. A mo died v ersion of the Baum-W elc h algorithm learns the parameters of the mo del. They also dev elop ed an incremen tal v ersion of Baum-W elc h that can b e used on-line. Their mo dels con tain v ery w eak metric information, represen ting hallw a ys as c hains of one-meter segmen ts and allo wing the learning algorithm to select the most probable c hain length. This metho d is eectiv e, but results in large mo dels with size prop ortional to the hallw a ys' length, and strongly dep ends on the qualit y of the h uman-pro vided initial mo del. 2.4 Learning Automata from Data Informally sp eaking, an automaton consists of a set of states and a set of transitions that lead from one state to another. In the con text of this w ork, the automaton states corresp ond to the states of the mo deled en vironmen ts, and the transitions, to the state c hanges due to actions p erformed in the en vironmen t. Eac h transition of the automaton is tagged b y a sym b ol from an 172 Learning Geometricall y-Constrained HMMs input alphab et , , corresp onding to the action or the input to the system that caused the state transition. Classical automata theory (e.g., Hop croft & Ullman, 1979) distinguishes b et w een deterministic and non-deterministic automata. If, for eac h alphab et sym b ol  , there is a single edge tagged b y it, going out of eac h state, the automaton is deterministic . Otherwise, the transition b et w een states is not uniquely determined b y the input sym b ol and the automaton is non-deterministic . If w e augmen t eac h transition edge of a non-deterministic automaton with a probabilit y of taking it giv en a certain input,  , the resulting automaton is called pr ob abilistic . The basic problem of learning nite deterministic automata from giv en data can b e roughly describ ed as follo ws: Given a set of p ositive and a set of ne gative example strings, S and T r esp e ctively, over alphab et  , and a xe d numb er of states k , c onstruct a minimal deterministic nite automaton with no mor e than k states that ac c epts S and do es not ac c ept T . This problem has b een sho wn to b e np - c omplete (Gold, 1978). Despite the hardness, p ositiv e results ha v e b een sho wn p ossible under v arious sp ecial settings. Angluin (1987) sho w ed that if an oracle can answ er mem b ership queries and pro vide coun terexamples to conjectures ab out the automaton, there is a p olynomial time learning algorithm from p ositiv e and negativ e examples. Riv est and Sc hapire (1987, 1989), pro vide sev eral eectiv e metho ds, that under v arious settings, learn deterministic automata that are correct with high pr ob ability . While the ab o v e w ork deals with learning from noise-fr e e data, Basy e, Dean and Kaelbling (1995) presen ted sev eral algorithms that, with high pr ob ability , learn input-output deterministic automata, when the data observ ed b y the learner is corrupted b y v arious forms of noise. In all these cases, the learned automaton is deterministic rather than pr ob abilistic . The basic learning problem in the probabilistic con text is to nd an automaton that assigns the same distribution as the true one to data sequences, using training data S , that w as generated b y the true automaton. Another form of a learning problem is that of nding a probabilistic automaton  that assigns the maxim um lik eliho o d to the training data S ; that is, an automaton that maximizes Pr( S j  ). Ab e and W arm uth (1992) sho w that nding a probabilistic automaton with 2 states, ev en when a small error with resp ect to the true mo del is allo w ed with some probabilit y (the pr ob ably appr oximately c orr e ct , or P A C , learning mo del), cannot b e done in p olynomial time with p oly- nomial n um b er of examples, unless np = rp . F rom their w ork arises the broadly accepted conjecture, whic h has not y et b een pro v en, that learning hidden Mark o v Mo dels is hard ev en in the p a c sense. There are t w o w a ys to address this hardness: one is to restrict the class of probabilistic mo dels learned, while the other is to learn unrestricted hidden Mark o v mo dels with go o d practical results but with no p a c guaran tees on the qualit y of the result. W ork b y Ron et al. (1994, 1995, 1998) pursues the rst approac h, learning restricted classes of automata, namely , acyclic probabilistic nite automata, and pr ob abilistic nite sux automata. Both classes are useful for v arious applications related to natural language pro cessing, and can b e learned in p olynomial time within the p a c framew ork. The second approac h, whic h is the one predominan tly tak en in this w ork, is to learn a mo del that is a mem b er of the complete unrestricted class of hidden Mark o v mo dels. Only w eak guaran tees exist ab out the go o dness of the mo del, but the learning pro cedure ma y b e directed to obtain practically go o d results. This approac h is based on guessing an automaton (mo del), and using an iterativ e pro cedure to mak e the automaton t b etter to the training data. One algorithm commonly used for this purp ose is the Baum-W elc h algorithm (Baum, P etrie, Soules, & W eiss, 1970), whic h is presen ted in detail b y Rabiner (1989). The iterativ e up dates of the mo del are 173 Sha tka y & Kaelbling based on gathering sucien t statistics from the data giv en the curren t automaton, and the up date pro cedure is guaran teed to con v erge to a mo del that lo cally maximizes the lik eliho o d function P r (data j mo del). Since the maxim um is lo c al , the mo del migh t not b e close enough to the true automaton b y whic h the data w as generated, and a c hallenging problem is to nd w a ys to force the algorithm in to con v erging to higher-lik eliho o d maxima, or at least to mak e it con v erge faster, facilitating m ultiple guesses of initial mo dels, th us raising the probabilit y of con v erging to higher-lik eliho o d maxima. Suc h an approac h is the one tak en in the w ork presen ted here. W e assume, throughout this pap er, that the numb er of states in the mo del w e are learning is kno wn. This is not a v ery strong assumption since there are metho ds for learning the n um b er of states. Regularization metho ds for deciding on the n um b er of states and other mo del parameters, are discussed, for instance, in V apnik's b o ok (1995). W e do not address this issue here. The rest of the w ork describ es our approac h to learning top ological mo dels. W e use noisy o dometric information that is readily a v ailable in most rob ots. This geometrical information is t ypically not used b y top ological mapping metho ds. W e demonstrate ho w a top ological mo del and the algorithm used to learn it can b e extended to directly incorp orate this w eak o dometric information. W e further sho w that b y doing so, w e can a v oid the use of h uman-pro vided a priori mo dels and still learn sto c hastic en vironmen t mo dels ecien tly and eectiv ely . 3 Mo dels and Assumptions This section describ es the formal framew ork for our w ork. It starts b y in tro ducing the classic hidden Mark o v mo del. The mo del is then extended to accommo date noisy o dometric information in its most na  v e form, ignoring information ab out the rob ot's heading and orien tation, and later adapted to accommo date heading information. W e concen trate here on describing mo dels and algorithms for learning hmm s, rather than pomdp s. This means that the rob ot has no decisions to mak e regarding its next action at ev ery state; only one action can b e executed at eac h state. In our exp erimen ts, a h uman op era- tor ga v e the action command asso ciated with eac h state to the rob ot when gathering the data. Note that the action is not necessarily the same one for ev ery state, e.g., the rob ot is told to alw a ys turn righ t in state 1 and mo v e forw ard at state 2. Ho w ev er, at eac h state only one ac- tion can b e tak en. The extension to complete pomdp s, whic h w e ha v e implemen ted, is through learning an hmm for eac h of the p ossible actions; it is straigh tforw ard although notationally more cum b ersome, th us w e limit the discussion here to hmm s. 3.1 HMMs { The Basics A hidden Mark o v mo del consists of states, transitions, observ ations and probabilistic b eha vior, and is formally dened as a tuple  = h S; O ; A; B ;  i , satisfying the follo wing conditions:  S = f s 0 ; : : : ; s N  1 g is a nite set of N states.  O = f o 0 ; : : : ; o M  1 g is a nite set of M p ossible observ ation v alues. 174 Learning Geometricall y-Constrained HMMs  A is a sto c hastic transition matrix, with A i;j = P r ( q t +1 = s j j q t = s i ), where 0  i; j  N  1. q t is the state at time t . F or ev ery state s i , N  1 X j =0 A i;j = 1. A i;j holds the transition probabilit y from state s i to state s j .  B is a sto c hastic observ ation matrix, with B j;k = P r ( v t = o k j q t = s j ), where 0  j  N  1 ; 0  k  M  1 . v t is the observ ation recorded at time t . F or ev ery state s j , M  1 X k =0 B j;k = 1. B j;k holds the probabilit y of observing o k while b eing at state s j .   is a sto c hastic initial distribution v ector, with  i = P r ( q 0 = s i ), 0  i  N  1. N  1 X i =0  i = 1.  i holds the probabilit y of b eing in state s i at time 0, when starting to record observ ations. This mo del corresp onds to a w orld whose actual state at an y giv en time t , q t 2 S , is hidden and not directly observ able, but some observ able asp ects of the state, v t 2 O , are detected and recorded when the state is visited at time t . An agen t mo v es from one hidden state to the next according to the probabilit y distribution enco ded in matrix A . The observ ed information in eac h state is go v erned b y the probabilit y matrix B . Although our w ork is concerned with discrete observ ations, the extension to con tin uous observ ations is straigh tforw ard and has b een w ell addressed in w ork on hidden Mark o v mo dels (Lip orace, 1982; Juang, 1985). Simply stated, the problem of le arning an hmm is that of \rev erse engineering" a hidden Mark o v mo del for a sto c hastic system from the sampled data, generated b y the system. W e formalize the learning task in Section 4.1. The next section extends hmm s to accoun t for geometric information. 3.2 Adding Odometry to Hidden Mark o v Mo dels The w orld is comp osed of a nite set of states. There is a fundamen tal distinction in our framew ork b et w een the term state and the term lo c ation . The state of the rob ot do es not directly corresp ond to its lo cation. A state ma y include other information, suc h as the rob ot's battery lev el or its orientation in that lo cation. A rob ot standing in the en trance to oce 101 facing right is in a dier ent state than a rob ot standing in the same place facing left ; similarly , a rob ot standing with a bagel in its arm is in a dieren t state from the same rob ot b eing in the same p osition without the bagel. The dynamics of the w orld are describ ed b y state-transition distributions that sp ecify the prob- abilit y of making transitions from one state to the next as a result of a certain action. There is a nite set of observ ations that can b e p erceiv ed in eac h state; the relativ e frequency of eac h observ ation is describ ed b y a probabilit y distribution and dep ends only on the curren t state. In our mo del, observ ations are multi-dimensional ; an observ ation is a v ector of v alues, eac h c hosen from a nite domain. That is, w e factorize the observ ation asso ciated with eac h state in to sev eral comp onen ts. F or instance, as demonstrated in Section 6.1, w e view the observ ation recorded b y the rob ot when standing in an oce en vironmen t as consisting of three comp onen ts, corresp onding to the three cardinal directions: fron t, left and righ t. In this example, the obser- v ation v ector is th us 3-dimensional. It is assumed that the v ector's comp onen ts are conditionally indep enden t, giv en the state. 175 Sha tka y & Kaelbling In addition to the ab o v e comp onen ts, eac h state is assumed to b e asso ciated with a p osition in a metric sp ac e . Whenev er a state transition is made, the rob ot records an o dometry ve ctor , whic h estimates the p osition of the curren t state relativ e to the previous one. F or the time b eing w e as- sume that the o dometry v ector consists of readings along the x and y co ordinates of a global co or- dinate system, and that these readings are corrupted with indep enden t normal noise. The latter indep endence assumption is not a strict one, and can b e relaxed b y in tro ducing a complete co- v ariance matrix, although w e ha v e not done this in this w ork. In Section 3.3 w e extend the o dom- etry v ector to include information ab out the he ading of the rob ot, and drop the global co ordinate framew ork. Note that the o dometric relationship c haracterizes a tr ansition rather than a state and, as describ ed b elo w, receiv es a dieren t treatmen t than the observations that are asso ciated with states . There are t w o imp ortan t assumptions underlying our treatmen t of o dometric relations b et w een states: First, that there is an inheren t \true" o dometric relation b et w een the p osition of ev ery t w o states in the w orld; second, that when the rob ot mo v es from one state to the next, there is a normal, 0-mean noise around the correct exp ected o dometric reading along eac h o dometric dimension. This noise reects t w o kinds of o dometric error sources: { The lac k of precision in the discretization of the real w orld in to states (e.g. there is a rather large area in whic h the rob ot can stand whic h can b e regarded as \the do orw a y of the AI lab"). { The lac k of precision of the o dometric measures recorded b y the rob ot, due to slippage, friction, disalignmen t of the wheels, imprecision of the measuring instrumen ts, etc. T o formally in tro duce o dometric information in to the hidden Mark o v mo del framew ork, w e dene an augmente d hidden Markov mo del as a tuple  = h S; O ; A; B ; R ;  i , where:  S = f s 0 ; : : : ; s N  1 g is a nite set of N states.  O = Q l i =1 O i is a nite set of observation ve ctors of length l . The i th elemen t of an observ ation v ector is c hosen from the nite set O i .  A is a sto c hastic transition matrix, with A i;j = P r ( q t +1 = s j j q t = s i ), 0  i; j  N  1. q t is the state at time t . F or ev ery state s i , N  1 X j =0 A i;j = 1. A i;j holds the transition probabilit y from state s i to state s j .  B is an arra y of l sto c hastic observ ation matrices, with B i;j;k = P r ( V t [ i ] = o k j q t = s j ); 1  i  l ; 0  j  N  1 ; o k 2 O i ; V t is the observ ation v ector at time t ; V t [ i ] is its i th comp onen t. B i;j;k holds the probabilit y of observing o k along the i th comp onen t of the observ ation v ector, while b eing at state s j .  R is a relation matrix, sp ecifying for eac h pair of states, s i and s j , the mean and v ariance of the D -dimensional 2 o dometric relation b et w een them.  ( R i;j [ m ]) is the mean of the m th 2. F or the time b eing w e consider D to b e 2, corresp onding to ( x; y ) readings. 176 Learning Geometricall y-Constrained HMMs comp onen t of the relation b et w een s i and s j and  2 ( R i;j [ m ]), the v ariance. F urthermore, R is ge ometric al ly c onsistent : for eac h comp onen t m , the relation  m ( a; b ) def =  ( R a;b [ m ]) m ust b e a dir e cte d metric , satisfying the follo wing prop erties for all states a , b , and c :   m ( a; a ) = 0;   m ( a; b ) =   m ( b; a ) (anti-symmetry) ; and   m ( a; c ) =  m ( a; b ) +  m ( b; c ) ( additivity ) : This represen tation of o dometric relations reects the t w o assumptions, previously stated, regarding the nature of the o dometric information. The \true" o dometric relation b et w een the p osition of ev ery t w o states is represen ted as the me an. The noise around the correct exp ected o dometric relation, accoun ting for b oth the lac k of precision in the real-w orld discretization and the inaccuracy in measuremen t, is represen ted through the varianc e.   is a sto c hastic initial probabilit y v ector describing the distribution of the initial state. F or simplicit y it is assumed here to b e of the form h 0 ; : : : ; 0 ; 1 ; 0 ; : : : ; 0 i , implying that there is one designated initial state, s i , in whic h the rob ot is alw a ys started. This mo del extends the standard hidden Mark o v mo del describ ed in Section 3.1 in t w o w a ys:  It facilitates observ ations that are factored in to comp onen ts, and represen ted as v ectors. These comp onen ts are assumed to b e conditionally indep enden t of eac h other giv en the state. Suc h factorization, together with the conditional indep endence assumption, allo ws for a simple calculation of the probabilit y of the complete observ ation v ector from the probabilities of its comp onen ts. It therefore results in few er probabilistic parameters in the learn t mo del than if w e w ere to view eac h observ ation v ector, consisting of a p ossible com bination of comp onen t-v alues as a single \atomic" observ ation.  It in tro duces the o dometric r elation matrix R and constrain ts o v er its comp onen ts. Using R and the constrain ts o v er it, as explained in Section 4, has pro v en useful for learning the other mo del parameters, as demonstrated in Section 6. 3.3 Handling Directional Data W e further extend the mo del to accommo date dir e ctional changes in addition to the p ositional c hanges. There are t w o issues stemming from directional c hanges while mo ving in an en viron- men t: the need for non-traditional distributions to mo del directional c hanges, and the need to correct for the cumulative r otational err or whic h sev erely in terferes with lo cation estimation within a global co ordinate framew ork. A detailed discussion of these t w o problems and their solution is giv en in an earlier pap er b y the authors (Shatk a y & Kaelbling, 1998). F or the sak e of completeness, w e briey review these t w o issues here. 3.3.1 Cir cular Distributions The rob ot's c hange in direction as it mo v es through the en vironmen t is expressed in terms of the angular c hange with resp ect to its original heading. Since angular measures are inheren tly cir- cular, treating them as \normally distributed", and using the standard pro cedures for obtaining sucien t statistics from the data is not adequate. As a trivial example, if w e w ere to a v erage 177 Sha tka y & Kaelbling 173 0 0 − 179 0 0 − 3 0 0 1 1 3 3 2 2 θ 2 θ 3 θ 1 x y 1 -1 -1 1 Figure 1: Simple a v erage of t w o angles, depicted as v ectors to the unit circle. The a v erage angle is formed b y the dashed v ector. Figure 2: Directional data represen ted as angles and as v ectors on the unit circle. the t w o angular readings, 173  and  179  , using simple a v erage w e obtain the angle  3  , whic h is far from the in tuitiv e  180  , as illustrated in Figure 1. T o address the circularit y issue, w e use the von Mises distribution, whic h is a circular v ersion of the normal distribution, to mo del the c hange in heading b et w een t w o states, as explained b elo w. A collection of c hanges in heading within a t w o dimensional space can b e represen ted in terms of either Cartesian or p olar co ordinates. Using a Cartesian system, n c hanges in headings can b e recorded as a sequence of 2-dimensional v ectors, ( h x 1 ; y 1 i ; : : : h x n ; y n i ), on the unit circle, as sho wn in Figure 2. The same c hanges can also b e represen ted as the corresp onding angles b et w een the radii from the cen ter of the unit circle and the X axis, (  1 ; : : : ;  n ), resp ectiv ely . The relationship b et w een the t w o represen tations is: x i = cos (  i ) ; y i = sin (  i ) ; (1  i  n ) : The v ector mean of the n p oin ts, h x ; y i , is calculated as: x = P n i =1 cos(  i ) n ; y = P n i =1 sin (  i ) n : (1) Using p olar co ordinates, w e can express the mean v ector in terms of angle,  , and length, a , where (except for the case x = y = 0):  = arctan ( y x ) ; a = ( x 2 + y 2 ) 1 2 : The angle  is the mean angle, while the length a is a measure (b et w een 0 and 1) of ho w concen trated the sample angles are around  . The closer a is to 1, the more concen trated the sample is around the mean, whic h corresp onds to a smaller sample v ariance. In tuitiv ely , a satisfactory circular v ersion of the normal distribution w ould ha v e a mean for whic h the maxim um lik eliho o d estimate is the a v erage angle as calculated ab o v e. In a w a y analogous to Gauss' deriv ation of the Normal distribution, v on Mises dev elop ed suc h a circular v ersion (Gum b el, Green w o o d, & Durand, 1953; Mardia, 1972), whic h is dened as follo ws: Denition: A cir cular r andom variable,  , 0    2  , is said to have the von Mises distribution with p ar ameters  and  , wher e 0    2  and  > 0 , if its pr ob ability density 178 Learning Geometricall y-Constrained HMMs function is: f ; (  ) = 1 2  I 0 (  ) e  cos (    ) ; wher e I 0 (  ) is the mo die d Bessel function of the rst kind and or der 0 : I 0 (  ) = 1 X r =0 1 r ! 2 ( 1 2  ) 2 r : (2) The parameters  and  corresp ond to the distribution's me an and c onc entr ation resp ectiv ely . While other circular-normal distributions do exist, the v on Mises has the desirable estimation pro cedure alluded to earlier: Giv en a set of heading samples, angles  1 ; : : :  n , from a v on Mises distribution, the maxim um lik eliho o d estimate  for  is:  = arctan ( y x ) ; where y , x are as dened in Equation 1. The maxim um lik eliho o d estimate for the concen tration parameter,  , is the  that satises: I 1 (  ) I 0 (  ) = max[ 1 n n X i =1 cos (  i   ) ; 0] ; where I 1 is the mo died Bessel function of the rst kind and order 1: I 1 (  ) = 1 X r =0 1 r !( r + 1)! ( 1 2  ) 2 r +1 : (3) F urther information ab out the estimation pro cedure is b ey ond the scop e of this pap er and can b e found elsewhere (Gum b el et al., 1953; Mardia, 1972). T o conclude, w e assume that the c hange in heading   is von Mises -distributed, around a mean  with concen tration parameter  . This assumption is reected in the mo del learning pro cedures as explained later in Section 4.2.3. The c hange in heading h   ( a; b ) ;   ( a; b ) i b et w een eac h pair of states ( a; b ) completes the set of parameters included in the relation matrix R whic h w as in tro duced earlier in Section 3.2. 3.3.2 Cumula tive R ot a tional Err or W e tend to think ab out an en vironmen t as consisting of landmarks xed in a global co ordinate system and corridors or transitions connecting these landmarks. This idea underlies the t ypical maps constructed and used in ev eryda y life. Ho w ev er, this view of the en vironmen t ma y b e problematic when rob ots are in v olv ed. Conceptually , a rob ot has t w o lev els at whic h it op erates; the abstr act lev el, in whic h it cen ters itself through corridors, follo ws w alls and a v oids obstacles, and the physic al lev el in whic h motors turn the wheels as the rob ot mo v es. In the ph ysical lev el man y inaccuracies can manifest themselv es: wheels can b e unaligned with eac h other resulting in a drift to the righ t or to the left, one motor can b e sligh tly faster than another resulting in similar drifts, an obstacle under one of the wheels can cause the rob ot to rotate around itself sligh tly , or unev en o ors ma y cause 179 Sha tka y & Kaelbling - recorded position - actual position ε −ε Figure 3: A rob ot mo ving along the solid arro w, while correcting for drift in the direction of the dashed arro w. The dotted arro w marks its r e c or de d c hange in p osition. the rob ot to slip in a certain direction. In addition, the measuring instrumen tation for o dometric information ma y not b e accurate in and of itself. A t the abstract lev el, correctiv e actions are constan tly executed to o v ercome the ph ysical drift and drag. F or example, if the left wheel is misaligned and drags the rob ot left w ards, a correctiv e action of mo ving to the righ t is constan tly tak en in the higher lev el to k eep the rob ot cen tered in the corridor. The phenomena describ ed ab o v e ha v e a signican t eect on the o dometry recorded b y the rob ot, if suc h data in terpreted with resp ect to one global framew ork. F or example, consider the rob ot depicted in Figure 3. It drifts to the left    when mo ving from one state to the next, and corrects for it b y mo ving   to the righ t in order to main tain itself cen tered in the corridor. Let us assume that states are 5 meters apart along the cen ter of the corridor, and that the cen ter of the corridor is aligned with the Y axis of the global co ordinate system. The rob ot steps bac k and forth in the corridor from one state to the next. Whenev er the rob ot reac hes a state, its o dometry reading c hanges b y h x; y ;  i along the h X ; Y ; heading i dimensions, resp ectiv ely . As the rob ot pro ceeds, the deviation with resp ect to the X axis b ecomes more and more sev ere. Th us, after going through sev eral transitions, the o dometric c hanges recorded b et w een ev ery pair of states, if tak en with resp ect to a global co ordinate system, b ecome larger and larger. Similar problems of inconsisten t o dometric c hanges recorded b et w een pairs of states can arise along an y of the o dometric dimensions. It is esp ecially sev ere when suc h inconsistencies arise with resp ect to the heading, since this can lead to mistak enly switc hing mo v emen t along the X and the Y axes, as w ell as confusion b et w een forw ards and bac kw ards mo v emen t (when the deviation in the heading is around 90  or 180  resp ectiv ely). In early w ork (Shatk a y & Kaelbling, 1997) w e assumed p erp endicularity of the corridors, whic h w as tak en adv an tage of while the rob ot collected the data. Odometric readings w ere recorded with resp ect to a glob al c o or dinate system , and the rob ot could re-align itself with the origin after eac h turn. A tra jectory of o dometry recorded under this p erp endicularity assumption b y our rob ot Ramona, along the x and y axes is giv en in Figure 4. The sequence sho wn w as recorded while the rob ot dro v e rep eatedly around a lo op of corridors. F urther details ab out the data gathering pro cess are pro vided in Section 6. In con trast, Figure 5 sho ws a tra jectory of another sequence of o dometric readings recorded b y Ramona, driving through the same corridors, without using the p erp endicularit y assumption. The data collected under the latter setting is sub jected to cumulative r otational err or. 180 Learning Geometricall y-Constrained HMMs 200 400 600 800 1000 200 400 600 800 1000 1200 -2500 -2000 -1500 -1000 -500 500 1000 500 1000 1500 2000 2500 3000 Figure 4: Sequence gathered b y Ramona, p erp en- dicularit y assumed. Figure 5: Sequence gathered b y Ramona, no p er- p endicularit y assumed. Suc h data can b e handled through state-r elative c o or dinate systems (Shatk a y & Kaelbling, 1998). The latter implies that eac h state s i has its o wn co ordinate system, as sho wn in Figure 6: the origin is anc hored in s i , the Y axis is aligned with the rob ot's heading in the state (denoted b y b old arro ws in the gure), and the X axis is p erp endicular to it. This is in con trast to a global co ordinate system whic h is anc hored in the initial starting state. Within the global co ordinate system, the relations recorded ma y v ary greatly among m ultiple instances of the same transition b et w een the same pair of states. By using the state-relativ e system, the recorded and learned relationship b et w een eac h pair of states, h s i ; s j i , is reliable, despite the fact that it is based on m ultiple transitions recorded from s i to s j . Under state-relativ e co ordinate systems, the geometric relation stored in R ij , (whic h w as in- tro duced in Section 3.2), is expressed for eac h pair of states, s i and s j , with resp ect to the co ordinate system asso ciated with state s i . Accordingly , the constrain ts imp osed o v er the x and y comp onen ts of the relation matrix m ust b e sp ecied with resp ect to the explicit co ordinate system used, as explained b elo w. Giv en a pair of states a and b , w e denote b y  h x;y i ( a; b ) the v ector h  ( R a;b [ x ]) ;  ( R a;b [ y ]) i . Let us dene T ab to b e the transformation that maps an h x a ; y a i p oin t represen ted with resp ect to the co ordinate system of state a , to the same p oin t represen ted with resp ect to the co ordinate system of state b , h x b ; y b i . More explicitly , let   ab b e the mean c hange in heading from state a to state b . Applying T ab to a v ector h x a y a i results in the v ector h x b y b i as follo ws: * x b y b + = T ab * x a y a + = * x a cos (   ab )  y a sin(   ab ) x a sin(   ab ) + y a cos (   ab ) + : The consistency constrain ts within this framew ork m ust b e restated as:   h x;y i ( a; a ) = h 0 ; 0 i ;   h x;y i ( a; b ) = T ba [  h x;y i ( b; a )] (anti-symmetry) ;   h x;y i ( a; c ) =  h x;y i ( a; b ) + T ba [  h x;y i ( b; c )] (additivity) . 181 Sha tka y & Kaelbling x ∆ y ∆ ∆θ Si Sj y x Figure 6: A rob ot in state S i , faces in the Y -axis direction; the relation S i , S j is wr t S i 's co ordinate system. These consistency constrain ts are the ones that need to b e enforced b y our learning algorithm whic h constructs the hmm . It is imp ortan t to note that the transformation T itself do es not constitute a set of additional parameters that need to b e learn t. Rather, it is calculated in terms of the heading-c hange parameter,   , whic h is already an in tegral part of the relation matrix w e ha v e dened in Sections 3.2 and 3.3.1. W e ha v e in tro duced the basic formal mo del that w e use for represen ting en vironmen ts and the rob ot's in teraction with them. In the follo wing section w e state the learning problem and describ e the basic algorithm for learning the mo del from data. 4 Learning HMMs with Odometric Information This section formalizes the learning problem for hmm s, and discusses ho w o dometric information is incorp orated in to the learning algorithm. An o v erview of the complete algorithm is pro vided in the App endix for this pap er. 4.1 The Learning Problem The learning problem for hidden Mark o v mo dels can b e generally stated as follo ws: Given an exp erienc e se quenc e E , nd a hidden Markov mo del that c ould have gener ate d this se quenc e and is \useful" or \close to the original" ac c or ding to some criterion. An explicit common statistical approac h is to lo ok for a mo del  that maximizes the likeliho o d of the data sequence E giv en the mo del. F ormally stated, it maximizes Pr ( E j  ). Ho w ev er, giv en the complicated landscap e of t ypical lik eliho o d functions in a m ulti-parameter domain, obtaining a maxim um lik eliho o d mo del is not feasible. All studied practical metho ds, and in particular the w ell-kno wn Baum- W elc h algorithm (Rabiner (1989) and references therein) can only guaran tee a lo c al-maximum lik eliho o d mo del. Another w a y of ev aluating the qualit y of a learned mo del is b y comparing it to the true mo del. W e note that sto c hastic mo dels (suc h as hmm s) induce a probabilit y distribution o v er all obser- v ation sequences of a giv en length. The Kullbac k-Leibler (Kullbac k & Leibler, 1951) div ergence of a learned distribution from a true one is a commonly used measure for estimating ho w go o d a 182 Learning Geometricall y-Constrained HMMs learned mo del is. Obtaining a mo del that minimizes this measure is a p ossible learning goal. The culprit here is that in practice, when w e learn a mo del from data, w e do not ha v e an y \ground truth" mo del to compare the learned mo del with. Still, w e can ev aluate le arning algorithms b y measuring ho w w ell they p erform on data obtained from kno wn mo dels. It is reasonable to ex- p ect that an algorithm that learns w ell from data that is generated from a mo del w e do ha v e, will p erform w ell on data generated from an unkno wn mo del, assuming that the mo dels indeed form a suitable represen tation of the true generating pro cess. W e discuss the Kullbac k-Leibler ( kl ) div ergence in more detail in Section 6.2 in the con text of ev aluating our exp erimen tal results. T o summarize, the learning problem as w e address it in this w ork is that of obtaining a mo del b y attempting to (lo cally) maximize the lik eliho o d, while ev aluating the results based on the kl -div ergence with resp ect to the true underlying distribution, when suc h a distribution is a v ailable. 4.2 The Learning Algorithm The learning algorithm starts from an initial mo del  0 and is giv en an exp erienc e sequence E ; it returns a revised mo del  , whic h (lo cally) maximizes the lik eliho o d P ( E j  ). The exp erience sequence E is of length T ; eac h elemen t, E t , for 0  t  ( T  1), is a pair h r t ; V t i , where r t is the observ ed relation v ector along the x , y and  dimensions, b et w een the states q t  1 and q t , and V t is the observ ation v ector at time t . Our algorithm extends the standard Baum-W elc h algorithm to deal with the relational in- formation and the factored observ ation sets. The Baum-W elc h algorithm is an exp ectation- maximization ( em ) algorithm (Dempster, Laird, & Rubin, 1977); it alternates b et w een  the E-step of computing the state-o ccupation and state-transition probabilities,  and  , at eac h time in the sequence giv en E and the curren t mo del  , and  the M-step of nding a new mo del,  , that maximizes P ( E j ;  ;  ), pro viding monotone con v ergence of the lik eliho o d function P ( E j  ) to a lo cal maxim um. Ho w ev er, our extension in tro duces an additional comp onen t, namely , the relation matrix R . It can b e view ed as ha ving t w o kinds of observ ations: state observ ations (as the ordinary hmm | with the distinction that w e observ e in teger v ectors rather than in tegers) and tr ansition observ a- tions (the o dometry relations b et w een states). The latter m ust satisfy geometrical constrain ts. Hence, an extension of the standard up date form ulae, as describ ed b elo w, is required. 4.2.1 St a te-Occup a tion Pr obabilities F ollo wing Rabiner (1989), w e rst compute the forw ard (  ) and bac kw ard (  ) matrices.  t ( i ) denotes the probabilit y densit y v alue of observing E 0 through E t and q t = s i , giv en  ;  t ( i ) is the probabilit y densit y of observing E t +1 through E T  1 giv en q t = s i and  . F ormally:  t ( i ) = P r ( E 0 ; : : : ; E t ; q t = s i j  ) ;  t ( i ) = P r ( E t +1 ; : : : ; E T  1 j q t = s i ;  ) : When some of the measuremen ts are con tin uous (as is the case with R ), these matrices con tain probabilit y densit y v alues rather than probabilities. The forw ard pro cedure for calculating the  matrix is initialized with  0 ( i ) = ( b i 0 if  i = 1 0 otherwise ; 183 Sha tka y & Kaelbling and con tin ued for 0 < t  T  1 with  t ( j ) = N  1 X i =0  t  1 ( i ) A i;j f ( r t j R i;j ) b j t : (4) The expression f ( r t j R i;j ) denotes the density at p oin t r t according to the distribution represen ted b y the means and v ariances in en try i; j of the relation matrix R , while b j t is the probabilit y of observing v ector v t in state s j ; that is, b j t = Q l i =0 B i;j;v t [ i ] . The bac kw ard pro cedure for calculating the  matrix is initialized with  T  1 ( j ) = 1 , and con tin ued for 0  t < T  1 with  t ( i ) = N  1 X j =0  t +1 ( j ) A i;j f ( r t +1 j R i;j ) b j t +1 : (5) Giv en  and  , w e no w compute for eac h giv en time p oin t t the state-o ccupation and state- transition probabilities,  and  . The state-o ccupation probabilities,  t ( i ), represen ting the probabilit y of b eing in state s i at time t giv en the exp erience sequence and the curren t mo del, are computed as follo ws:  t ( i ) = Pr ( q t = s i j E ;  ) =  t ( i )  t ( i ) P N  1 j =0  t ( j )  t ( j ) : (6) Similarly ,  t ( i; j ), the state-transition probabilities from state i to state j at time t giv en the exp erience sequence and the curren t mo del, are computed as:  t ( i; j ) = Pr ( q t = s i ; q t +1 = s j j E ;  ) =  t ( i ) A i;j b j t +1 f ( r t +1 j R i;j )  t +1 ( j ) N  1 X i =0 N  1 X j =0  t ( i ) A i;j b j t +1 f ( r t +1 j R i;j )  t +1 ( j ) : (7) These are essen tially the same form ulae app earing in Rabiner's tutorial (Rabiner, 1989), but they also tak e in to accoun t the densit y of the o dometric relations. In the next phase of the algorithm, the goal is to nd a new mo del,  , that maximizes the lik eli- ho o d conditioned on the curren t transition and observ ation probabilities, Pr( E j ;  ;  ). Usually , this is simply done using maxim um-lik eliho o d estimation of the probabilit y distributions in A and B b y computing exp ected transition and observ ation frequencies. In our mo del w e m ust also compute a new relation matrix, R , under the constrain t that it remain geometrically consisten t. Through the rest of this section w e use the notation v to denote a reestimated v alue, where v denotes the curren t v alue. 4.2.2 Upd a ting Transition and Obser v a tion P arameters The A and B matrices can b e straigh tforw ardly reestimated. A i;j is the exp ected n um b er of transitions from s i to s j divided b y the exp ected n um b er of transitions from s i , and B i;j;k is the exp ected n um b er of times o k is observ ed along the i th dimension when in state s j , divided b y the exp ected n um b er of times of b eing in s j : A i;j = P T  2 t =0  t ( i; j ) P T  2 t =0  t ( i ) ; B i;j;k = P T  1 t =0  [ V t [ i ]= o k ]  t ( j ) P T  1 t =0  t ( i ) : (8) The expression  c denotes an indicator function with v alue 1 if condition c is true and 0 otherwise. 184 Learning Geometricall y-Constrained HMMs -6 -4 -2 2 4 6 P Q -8 -6 -4 -2 2 4 6 8 -7.5 -5 -2.5 2.5 5 7.5 P Q Figure 7: Examples of t w o sets of normally distributed p oin ts with constrained means, in 1 and in 2 dimensions. 4.2.3 Upd a ting Rela tion P arameters When reestimating the relation matrix, R , the geometrical constrain ts induce in terdep endencies among the optimal mean estimates as w ell as b et w een optimal v ariance estimates and mean estimates. P arameter estimation under this form of constrain ts is almost un treated in main- stream statistics (Bartels, 1984) and w e found no previous existing solutions to the estimation problem addressed here. As an illustration for the issues in v olv ed in estimation under constrain ts consider the follo wing estimation problem of 2 normal means: Example 4.1 The data c onsists of two sample sets of p oints P = f p 1 ; p 2 ; : : : ; p n g and Q = f q 1 ; q 2 ; : : : ; q k g , indep endently dr awn fr om two distinct normal distributions with me ans  P ;  Q and varianc es  2 P ;  2 Q , r esp e ctively. We ar e aske d to nd maximum likeliho o d estimates for the two distribution p ar ameters. Mor e over, we ar e told that the me ans of the two distributions ar e related , such that  Q =   P , as il lustr ate d in Figur e 7. If not for the latter c onstr aint, the task is simple (DeGr o ot, 1986), and we have:  P = P n i =1 p i n ;  2 P = P n i =1 ( p i   P ) 2 n ; and similarly for  Q and  2 Q . However, the c onstr aint  P =   Q r e quir es nding a single mean ,  , and setting the other one to its ne gate d value,   . Intuitively, when cho osing such a maximum likeliho o d single me an, the mor e c onc entr ate d sample should have mor e ee ct, while the mor e varie d sample should b e mor e \submissive." Thus, the over al l sample deviation fr om the me ans would b e minimize d and the likeliho o d of the data maximize d. Ther efor e, ther e is a m utual dep endence b etwe en the estimation of the me an and the estimation of the varianc e. Sinc e the samples ar e indep endently dr awn, their joint likeliho o d function is: f ( P ; Q j  P ;  Q ;  2 P ;  2 Q ) = n Y i =1 e  ( p i   P ) 2 2  2 P p 2   P  k Y j =1 e  ( q j   Q ) 2 2  2 Q p 2   Q : By taking the derivatives of this joint lo g-likeliho o d function, with r esp e ct to  P ,  P and  Q , and e quating them to 0 , while using the c onstr aint  Q =   P , we obtain the fol lowing set of mutual e quations for maximum likeliho o d estimators:  P = (  2 Q P n i =1 p i )  (  2 P P k j =1 q j ) n 2 Q + k  2 P ;  Q =   P ;  2 P = P n i =1 ( p i   P ) 2 n ;  2 Q = P k j =1 ( q j +  P ) 2 k : 185 Sha tka y & Kaelbling By substituting the expr essions for  P and  Q into the expr ession for  P , we obtain a cubic e qua- tion which is cumb ersome, but stil l solvable (in this simple c ase). The solution pr ovides a maxi- mum likeliho o d estimate for the me an and varianc e under the c onstr aint  Q =   P : 2 W e no w pro ceed to the actual up date of the relation matrix under constrain ts. F or clarit y , w e initially discuss only the rst t w o geometrical constrain ts, and discuss the additivit y constrain t in Section 4.3. Recall that w e concen trate here on the enforcemen t of glob al c onstr aints, appropriate under the p erp endicularity assumption , although the same idea is applied in the case of state- relativ e constrain ts. Zer o distanc es b et w een states and themselv es are trivially enforced, b y setting all the diagonal en tries in the R matrix to 0, with a small v ariance. A nti-symmetry within a global co ordinate system is enforced b y using the data recorded along the transition from state s j to s i as w ell as from state s i to s j when reestimating  ( R i;j ). As demonstrated in Example 4.1, the v ariance has to b e tak en in to accoun t, leading to the follo wing set of m utual equations:  m i;j = P T  2 t =0  r t [ m ]  t ( i;j ) (  m i;j ) 2  r t [ m ]  t ( j;i ) (  m j;i ) 2  P T  2 t =0   t ( i;j ) (  m i;j ) 2 +  t ( j;i ) (  m j;i ) 2  ; (9) (  m i;j ) 2 = P T  2 t =0 [  t ( i; j )( r t [ m ]   m i;j ) 2 ] P T  2 t =0  t ( i; j ) : (10) F or the x and y dimensions, ( m = x; y ), this amoun ts to a complicated but still solv able cubic equation. Ho w ev er, in the more general case, when accoun ting for the orien tation of the rob ot, and also when complete additivit y is enforced, w e do not obtain suc h closed form reestimation form ulae. T o a v oid these hardships, w e use a lag-b ehind up date rule; the y et-un up dated estimate of the v ariance is used for calculating a new estimate for the mean, and this new mean estimate is used to up date the v ariance, using Equation 10. 3 Th us, the mean is up dated using a v ariance parameter that lags b ehind it in the up date pro cess, and the reestimation Equation (9) needs to use  m rather than  m as follo ws:  m i;j = P T  2 t =0 h r t [ m ]  t ( i;j ) (  m i;j ) 2  r t [ m ]  t ( j;i ) (  m j;i ) 2 i P T  2 t =0 h  t ( i;j ) (  m i;j ) 2 +  t ( j;i ) (  m j;i ) 2 i : (11) As w e ha v e sho wn (Shatk a y , 1999), this lag-b ehind p olicy is an instance of generalized em (McLac h- lan & Krishnan, 1997). The latter guaran tees monotone con v ergence to a lo cal maxim um of the lik eliho o d function, ev en when eac h \maximization" step incr e ases rather than strictly maxi- mizes the exp ected lik eliho o d of the data giv en the curren t mo del. Similarly , the reestimation form ula for the v on Mises me an (  ) and c onc entr ation (  ) parameters of the he ading change b et w een states s i and s j is the solution to the equations:   i;j = arctan 0 B B B B @ T  2 X t =0 [sin( r t [  ])(  t ( i; j )  i;j   t ( j; i )  j;i )] T  2 X t =0 [cos ( r t [  ])(  t ( i; j )  i;j +  t ( j; i )  j;i )] 1 C C C C A 3. A similar approac h, termed one step late up date, is tak en b y others applying em to highly non-linear opti- mization problems (McLac hlan & Krishnan, 1997). 186 Learning Geometricall y-Constrained HMMs I 1 [   i;j ] I 0 [   i;j ] = max " P T  2 t =0 [  t ( i; j ) cos ( r t [  ]    i;j )] P T  2 t =0  t ( i; j ) ; 0 # ; (12) where I 0 and I 1 are the mo died Bessel functions as dened b y Equations 2 and 3 in Section 3.3.1. Again, to a v oid the need to solv e the m utual equations, w e tak e adv an tage of the lag-b ehind strat- egy , up dating the mean using the curren t estimates of the concen tration parameters,  i;j ;  j;i , as follo ws:   i;j = arctan P T  2 t =0 [sin( r t [  ])(  t ( i; j )  i;j   t ( j; i )  j;i )] P T  2 t =0 [cos ( r t [  ])(  t ( i; j )  i;j +  t ( j; i )  j;i )] ! ; (13) and then calculating the new concen tration parameters based on the newly up dated mean, as the solution to Equation 12, through the use of lo okup-tables. A p ossible alternativ e to our lag-b ehind approac h is to up date the mean as though the assump- tion  j;i =  i;j holds. Under this assumption, the v ariance terms in Equation 9 cancel out, and the mean up date is indep enden t of the v ariance once again. Then the v ariances are up dated as stated in Equation 10, without assuming an y constrain ts o v er them. This approac h w as tak en in earlier stages of this w ork (Shatk a y & Kaelbling, 1997, 1998). The lag-b ehind strategy is sup erior, b oth according to our exp erimen ts, and due to its b eing an instance of generalized em . 4.3 Enforcing Additivit y Note that the additivit y constrain t directly implies the other t w o geometrical constrain ts 4 . Th us, enforcing it results in complete geometrical consistency . W e presen t here the metho d for directly enforcing additivit y through the reestimation pro cedure along the x and y dimensions. F or the heading dimension w e describ e ho w complete geometrical consistency is ac hiev ed through the pro jection of an ti-symmetric estimates on to a geometrically-consisten t space. As b efore, to simplify the presen tation, w e fo cus on the case of global co ordinate systems. The same basic idea applies to state-relativ e co ordinate systems, but the relationship used to reco v er the mean  ij from individual state co ordinates is more complex. 4.3.1 Additivity in the x , y dimensions The main observ ation underlying our approac h is that the additivit y constrain t is a result of the fact that states can b e em b edded in a ge ometric al sp ac e . That is, assuming w e ha v e N states, s 0 ; : : : ; s N  1 , there are p oin ts on the X , Y and  axes, x 0 ; : : : ; x N  1 , y 0 ; : : : ; y N  1 ,  0 ; : : : ;  N  1 , resp ectiv ely , suc h that eac h state, s i , is asso ciated with the co ordinates h x i ; y i ;  i i . Assuming one global co ordinate system, the mean o dometric relation from state s i to state s j can b e expressed as: h x j  x i ; y j  y i ;  j   i i . During the maximization phase of the em iteration, rather than try to maximize with resp ect to N 2 o dometric r elation ve ctors , h  X ij ,  Y ij ,   ij i , w e r ep ar ameterize the problem. Sp ecically , w e express eac h o dometric relation as a function of t w o of the N state p ositions , and maximize with resp ect to the unc onstr aine d, N state p ositions. F or instance, for the X dimension, rather than searc h for N 2 maxim um lik eliho o d estimates for  x ij , w e use the maximization step to nd N 1-dimensional p oin ts, x 0 ; : : : ; x N  1 . W e can then calculate  x ij = x j  x i . Moreo v er, since all w e are in terested in is nding the b est r elationships b et w een x i and x j , w e can x one of 4. f  ( a; a ) =  ( a; a ) +  ( a; a ) g ) (  ( a; a ) = 0) ; f (  ( a; a ) = 0) ; (  ( a; a ) =  ( a; b ) +  ( b; a )) g ) (  ( a; b ) =   ( b; a )). 187 Sha tka y & Kaelbling the x i 's at 0 (e.g. x 0 = 0), and nd optimal estimates for the remaining N  1 state p ositions. The v ariance reestimation remains as b efore, and the lag-b ehind p olicy is used to eliminate the in terdep endency b et w een the up date of the mean and the v ariance parameters. 4.3.2 Additive Heading Estima tion Unfortunately , the reparameterization describ ed ab o v e is not feasible for estimation of c hanges in he ading , due to the v on Mises distribution assumption o v er the heading measures. By repa- rameterizing   ij as  j   i and trying to maximize the lik eliho o d function with resp ect to the  parameters, w e obtain a set of N  1 trigonometric e quations with terms of the form cos (  j )  sin(  i ) whic h do not enable simple solution. As an alternativ e, it is p ossible to use the an ti-symmetric reestimation pro cedure describ ed earlier, follo w ed b y a p erp endicular pr oje ction op erator, mapping the resulting headings v ector h   00 ; : : : ;   ij ; : : : ;   N  1 ;N  1 i , 0  i; j  N  1, whic h do es not satisfy additivit y , on to a v ector of headings within an additive line ar ve ctor sp ac e . Simple orthogonal pro jection is not satisfactory within our setting, since it simply lo oks for the additiv e v ector closest to the non-additiv e one. This pro cedure ignores the fact that some of the en tries in the non-additiv e v ector are based on a lot of observ ations, and are therefore more reliable, while other, less reliable ones, are based on hardly an y data at all. In tuitiv ely , w e w ould lik e to k eep the estimates that are w ell accoun ted for in tact, and adapt the less reliable estimates to meet the additivit y constrain t. More precisely , there are heading-c hange estimates b et w een states that are b etter accoun ted for than others, in the sense that the transitions b et w een these states ha v e higher exp ected coun ts than transition b et w een other states (higher P t  t ( i; j )). W e w ould lik e to pro ject the non-additive heading estimates v ector on to a subsp ac e of the additive v ector space, in whic h the v ectors ha v e the same v alues as the non-additiv e v ector in the en tries that are w ell-accoun ted for, that is, those with the highest v alues of P t  t ( i; j ). The dicult y is that the latter subspace is not a line ar v ector space (for instance, it do es not satisfy closure under scalar m ultiplication), and the pro jection op erator o v er linear spaces cannot b e applied directly . Still, this set of v ectors do es form an ane v ector space, and w e can pro ject on to it using an algebraic tec hnique, as explained b elo w. 5 Denition A  R n is an n-dimensional ane sp ac e if for al l ve ctors v a 2 A , the set of ve ctors: A  v a def = f u a  v a j u a 2 A g is a linear sp ac e. Hence, w e can pic k a v ector in an ane space, v a 1 2 A , and dene the translation T a : A ! V , where V is a linear space, V = A  v a 1 . This translation is trivially extended for an y v ector v 0 2 R n , b y dening T a ( v 0 ) = v 0  v a 1 . In order to pro ject an y v ector v 2 R n on to A , w e apply the translation T a to v and pro ject T a ( v ) on to V , whic h results in a v ector P ( T a ( v )) in V . By applying the in v erse transform T  1 a to it, w e obtain the pro jection of v on A , as demonstrated in Figure 8. The linear space in the gure is the t w o dimensional v ector space fh x; y ij y =  x g , and the ane space is fh x; y ij y =  x + 4 g . The transform T a consists of subtracting the v ector h 0 ; 4 i . The solid arro w corresp onds to the direct pro jection of the v ector v on to the p oin t P ( v ) of the ane space. The dotted arro ws represen t the pro jection via translation of v to T a ( v ), the pro jection of the latter on to the linear v ector space, and the in v erse translation of the result, P ( T a ( v )), on to the ane space. 5. Man y thanks to John Hughes for in tro ducing us to this tec hnique. 188 Learning Geometricall y-Constrained HMMs -2 2 4 -4 -2 2 4 6 v T a ( v ) P ( v ) P ( T a ( v )) < x, - x > < x, - x + 4 > Figure 8: Pro jecting v on to the ane v ector space fh x; y ij y =  x + 4 g . Although the pro cedure for preserving additivit y o v er headings is not formally pro v en to pre- serv e monotone con v ergence of the lik eliho o d function to w ards a lo cal maxim um, our extensiv e exp erimen ts consisting of h undreds of runs ha v e sho wn that monotone con v ergence is preserv ed. 5 Cho osing an Initial Mo del T ypically , in instances of the Baum-W elc h algorithm, an initial mo del is pic k ed uniformly at random from the space of all p ossible mo dels, p erhaps trying m ultiple initial mo dels to nd dif- feren t lo cal lik eliho o d maxima. An alternativ e approac h w e ha v e rep orted (Shatk a y & Kaelbling, 1997) w as based on clustering the accum ulated o dometric information using the simple k-means algorithm (Duda & Hart, 1973), taking the clusters to b e the states in whic h the observ ations w ere recorded, to obtain state and observ ation coun ts and estimate the mo del parameters. If p erp endicularit y is assumed when collecting the data, as sho wn in Figure 4, the k-means algorithm assigns the same cluster (state) to o dometric readings recorded at close lo cations, leading to reasonable initial mo dels. Ho w ev er, when this assumption is dropp ed, as illustrated in Figure 5, the cum ulativ e rotational error distorts the o dometric lo cation recorded within a global co ordinate system, so that the lo cation assigned to the same state during m ultiple visits v aries greatly and w ould not b e recognized as \the same" b y a simple lo cation-based clustering algorithm. T o o v ercome this, w e dev elop ed an alternativ e initialization heuristics, whic h w e call tag-b ase d initialization . It is based dir e ctly on the recorded r elations b et w een states, rather than on states' absolute lo cation. F or clarit y , the description here consists mostly of an illustrativ e example, and concen trates on the case where global consistency constrain ts are enforced. Giv en a sequence of observ ations and o dometric readings E , w e b egin b y clustering the o dometric readings in to buckets. The n um b er of buc k ets is at most the n um b er of distinct state transitions recorded in the sequence. The goal at this stage is to ha v e eac h buc k et con tain all the o dometric readings that are close to eac h other along all three dimensions. T o ac hiev e this, w e start b y xing a predetermined, small standard deviation v alue along the x , y , and  dimensions. Denote these standard deviation v alues  x ;  y ;   resp ectiv ely , (t ypically  x =  y ). The rst o dometric reading is assigned to buc k et 0 and the mean of this buc k et is set to b e the v alue of this reading. Through the rest of the pro cess the subsequen t o dometric readings are examined. If the next reading is within 1 : 5 standard deviations along eac h of the three dimensions from the mean of some existing non-empt y buc k et, add it to the buc k et and 189 Sha tka y & Kaelbling <-1, 98, 91.5> µ1: 1 µ2: <1996, -2.5, 89> 2 µ3: <0.5, -99.5, 88.5> 3 µ4: <-2001, 3, 90.5> 4 < -4, 102, 91 > < 2, 94, 92 > <1994, 0, 88 > < 1998, -5, 90 > < -2, -106, 91 > < 3, -93, 86 > < -1999, -1, 94 > < -2003, 7, 87 > Figure 9: The buc k et assignmen t of the example sequence. up date the buc k et mean accordingly . If not, assign it to an empt y buc k et and set the mean of the buc k et to b e this reading. In tuitiv ely , b y using this heuristic eac h of the resulting buc k ets is tigh tly concen trated ab out its mean. W e note that other clustering algorithms (Duda & Hart, 1973) could b e used at the buc k eting stage. Example 5.1 We would like to le arn a 4-state mo del fr om a se quenc e of o dometric r e adings, h x; y ;  i as fol lows: h 2 94 92 i ; h 1994 0 88 i ; h 3  93 86 i ; h 1999 1 94 i ; h 4 102 91 i ; h 1998  5 90 i ; h 2  106 91 i ; h 2003 7 87 i : As a rst stage we plac e these r e adings into buckets. Supp ose the standar d deviation c onstant is 20 . The plac ement is as shown in Figur e 9. The me an value asso ciate d with e ach bucket is shown as wel l. 2 The next stage of the algorithm is the state-tagging phase, in whic h eac h o dometric reading, r t , is assigned a pair of states, s i ; s j , denoting the origin state (from whic h the transition to ok place) and the destination state (to whic h the transition led), resp ectiv ely . In conjunction, the mean en tries,  ij , of the relation matrix, R , are p opulated. Example 5.1 (con t.) R eturning to the se quenc e ab ove, the pr o c ess is demonstr ate d in Fig- ur e 10. We assume that the data r e c or ding starts at state 0 , and that the o dometric change thr ough self tr ansitions is 0 , with some smal l standar d deviation (we use 20 her e as wel l). This is shown on p art A of the gur e. Sinc e the rst element in the se quenc e, h 2 94 92 i , is mor e than two standar d deviations away fr om the me an  [0][0] and no other entry in the r elation r ow of state 0 is p opulate d, we pick 1 as the next state and p opulate the me an  [0][1] to b e the same as the me an of bucket 1 , to which h 2 94 92 i b elongs. T o maintain ge ometric al c onsistency the me an  [1][0] is set to   [0][1] , as shown in p art B of the gur e. We now have p opulate d 2 o-diagonal entries, and the state se quenc e is h 0 ; 1 i . The entry [0][1] in the matrix b e c omes asso ciate d with bucket 1 , and this information is r e c or de d for helping with tagging futur e o dometric r e adings b elonging to the same bucket. The next o dometric r e ading, h 1994 0 88 i , is a few standar d deviations fr om any p opulate d me an in r ow 1 (wher e 1 is the curr ent b elieve d state). Henc e, we pick a new state 2 , and set the me an  [1][2] to b e  2 |the me an of bucket 2 |to which the r e ading b elongs (Figur e 10 C). The entry [1][2] is r e c or de d as asso ciate d with bucket 2 . T o pr eserve anti-symmetry and additivity,  [2][1] is set to   [1][2] .  [0][2] is set to b e the sum  [0][1] +  [1][2] , and  [2][0] is set to   [0][2] . 190 Learning Geometricall y-Constrained HMMs -179.5> 95.5, <1995, 98, 91.5> <-1, 2.5, -89> <-1996, 89> -2.5, <1996, -91.5> -98, < 1, <-1995, -95.5, 179.5> <0,0,0> <0,0,0> <0,0,0> <0,0,0> <0,0,0> <0,0,0> <0,0,0> <0,0,0> 98, 91.5> <-1, 95.5, <1995, -179.5> < 0.5, 88.5> -99.5, 99.5, -88.5> <-0.5, 177.5> <1996.5, -102, 102, -177.5> <-1996.5, <-1995, -95.5, 179.5> -98, -91.5> < 1, 89> -2.5, <1996, 2.5, -89> <-1996, 4, 91> <-1995.5, <0,0,0> <0,0,0> <0,0,0> -4, -91> <1995.5, <0,0,0> -91.5> < 1, -98, <0,0,0> <0,0,0> <0,0,0> <0,0,0> 98, <-1, 91.5> Bucket(R[0][1]) = µ1 0 3 2 1 3 2 1 0 1 1 2 2 3 0 3 0 A S: 0. 1 S: 0 B ,..., S:0, 1, 2, 3, 0, 1, 2, 3, 0 S: 0,1,2,3,0 µ4 Bucket(R[3][0]) = Bucket(R[2][3]) = µ3 3 2 1 0 3 2 1 0 0 0 1 2 3 1 2 3 C D Bucket(R[1][2]) = µ2 S: 0,1,2,3 S: 0, 1, 2 Figure 10: P opulating the o dometric relation matrix and creating a state tagging sequence. Similarly,  [2][3] is up date d to b e the me an of bucket 3 , c ausing the setting of  [3][2] ,  [1][3] ,  [0][3] ,  [3][1] , and  [3][0] . Bucket 3 is asso ciate d with  [2][3] . A t this stage the o dometric table is ful ly p opulate d, as shown in p art D of Figur e 10. The state se quenc e at this p oint is: h 0 ; 1 ; 2 ; 3 i . The next r e ading, h 1999  1 94 i , is within one standar d deviation fr om  [3][0] and ther efor e the next state is 0 . Entry [3][0] is asso ciate d with bucket 4 , (the bucket to which the r e ading was assigne d), and the state se quenc e b e c omes: h 0 ; 1 ; 2 ; 3 ; 0 i . The next r e ading, b eing fr om bucket 1 , is asso ciate d with the r elation fr om state 0 that is tagge d by bucket 1 , namely, state 1 . By r ep e ating this for the last two r e adings, the nal state tr ansition se quenc e b e c omes h 0 ; 1 ; 2 ; 3 ; 0 ; 1 ; 2 ; 3 ; 0 i : 2 Note that the pro cess describ ed in the ab o v e illustration w as simplied. In the general case, w e need to tak e in to accoun t the rotational error in the data, use state-relativ e co ordinate systems, and therefore p opulate the en tries under the transformed an ti-symmetry and additivit y constrain ts:   h x;y i ( a; b ) = T ba [  h x;y i ( b; a )] ;   h x;y i ( a; c ) =  h x;y i ( a; b ) + T ba [  h x;y i ( b; c )], as dened in Section 3.3.2. 191 Sha tka y & Kaelbling It is p ossible that b y the end of the tagging algorithm, some ro ws or columns of the relation matrix are still unp opulated. This happ ens when there is to o little data to learn from or when the n um b er of states pro vided to the algorithm is to o large with resp ect to the actual mo del. In suc h cases w e can either \trim" the mo del, using the n um b er of p opulated ro ws as the n um b er of states, or pic k random o dometric readings to p opulate the rest of the table, impro ving these estimates later. Note that the rst approac h suggests a metho d for le arning the numb er of states in the mo del when this is not giv en, starting from a gross o v er-estimate of the n um b er, and trun- cating it to the n um b er of p opulated ro ws in the o dometric table after initialization is p erformed. Once the state-transition sequence is obtained, the rest of the initialization algorithm is the same as it is for k-means based initialization, deriving state-transition coun ts from the state-transition sequence, assigning the observ ations to the states under the assumption that the state sequence is correct, and obtaining state-transition and observ ation probabilities. The initialization phase do es not incur m uc h computational o v erhead, and is equiv alen t time-wise to p erforming one additional iteration of the em pro cedure. 6 Exp erimen ts and Results The goal of the w ork describ ed so far is to use o dometry to impro v e the learning of top ological mo dels, while using few er iterations and less data. W e tested our algorithm in a simple rob ot- na vigation w orld. Our exp erimen ts consist of running the algorithm b oth on data obtained from a sim ulated mo del and on data gathered b y our mobile rob ot, Ramona. The amoun t of data gathered b y Ramona is used here as a pro of of concept but is not sucien t for statistical analysis. F or the latter, w e use data obtained from the sim ulated mo del. W e gathered data and used the algorithms b oth with and without the p erp endicularit y assumption (see Section 3.3.2), and results are pro vided from b oth settings. 6.1 Rob ot Domain The rob ot used in our exp erimen ts, Ramona, is a mo died R WI B21 rob ot. It has a cylindrical sync hro-driv e base, 24 ultrasonic sensors and 24 infrared sensors, situated ev enly around its circumference. The infrared sensors are used mostly for short-range obstacle a v oidance. The ultrasonic sensors are longer ranged, and are used for obtaining (noisy) observ ations of the en vironmen t. In the exp erimen ts describ ed here, the rob ot follo ws a pr escrib e d path through the corridors in the oce en vironmen t of our departmen t. Th us, there is no decision-making in v olv ed, and an hmm is a sucien t mo del, rather than a complete pomdp . Lo w-lev el soft w are 6 pro vides a lev el of abstraction that allo ws the rob ot to mo v e through hallw a ys from in tersection to in tersection and to turn ninet y degrees to the left or righ t. The soft w are uses sonar data to distinguish do ors, op enings, and interse ctions along the path, and to stop the rob ot's curren t action whenev er suc h a landmark is detected. Eac h stop|either due to the natural termination of an action or due to a landmark detection|is considered b y the rob ot to b e a \state". A t eac h stop, ultrasonic data in terpretation allo ws the rob ot to p erceiv e, in eac h of the three cardinal directions, (fron t, left and righ t), whether there is an op en space, a do or, a w all, or something unkno wn. Enco ders on the rob ot's wheels allo w it to estimate its p ose (p osition and orien tation) with re- sp ect to its p ose at the previous in tersection. After recording b oth the sonar-based observ ations 6. The lo w-lev el soft w are w as written and main tained b y James Kurien. 192 Learning Geometricall y-Constrained HMMs 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 2 3 4 5 6 8 9 10 12 13 14 15 17 24 25 26 27 28 29 7 42 43 18 19 20 21 22 23 11 30 31 32 33 34 35 36 38 41 37 39 40 16 Figure 11: T rue mo del of the corridors Ra- mona tra v ersed. Arro ws represen t the pre- scrib ed path direction. Figure 12: T rue mo del of a prescrib ed path through the sim ulated hallw a y en vironmen t. and the o dometric information, the rob ot go es on to execute the next prescrib ed action. The action command is issued man ually b y a h uman op erator. Of course, b oth the action p erfor- mance and the p erception routines are sub ject to error. The path Ramona follo w ed consists of 4 connected corridors in our building, whic h include 17 states, as sho wn in Figure 11. In our sim ulation, w e man ually generated an hmm represen ting a prescrib ed path of the rob ot through the complete oce en vironmen t of our departmen t, consisting of 44 states, and the asso ciated transition, observ ation, and o dometric distributions. The transition probabilities reect an action failure rate of ab out 5  10%. That is, the probabilit y of mo ving from the curren t state to the correct next state in the en vironmen t, under the predetermined action is b et w een 0 : 85 and 0 : 95. The probabilit y of self transition is t ypically b et w een 0 : 05 and 0 : 15. Some small probabilit y (t ypically smaller than 0 : 02) is sometimes assigned to other transitions. Our exp erience with the real rob ot pro v es that this is a reasonable transition mo del, since t ypically the rob ot mo v es to the next state correctly , and the only error that o ccurs with some signican t frequency is when it do es not mo v e at all, due to sonar in terpretation indicating a barrier when there is actually none. Once the action command is rep eated the rob ot usually p erforms the action correctly , mo ving to the exp ected next state. The observ ation distribution t ypically assigns probabilities of 0 : 85  0 : 95 to the true observ ation that should b e p erceiv ed b y the rob ot at eac h state, and probabilities of 0 : 05  0 : 15 to other observ ations that migh t b e p erceiv ed. F or example, if a do or should actually b e p erceiv ed, a do or is t ypically assigned a probabilit y of 0 : 85  0 : 9, a w all is assigned a probabilit y of 0 : 09  0 : 1 and an op en space is assigned a probabilit y of ab out 0 : 01 to b e p erceiv ed. The standard deviation around o dometric readings is ab out 5% of the mean. Figure 12 sho ws the hmm corresp onding to the sim ulated hallw a y en vironmen t. Observ ations and orien tation are omitted from the gure for clarit y . No des corresp ond to states in the en vironmen t, while directed edges corresp ond to the corridors; the arro ws p oin t at the direction in whic h the corridors w ere tra v ersed. F urther in terpretation of the gures is pro vided in the follo wing section. 193 Sha tka y & Kaelbling 6.2 Ev aluation Metho d There are a n um b er of dieren t w a ys of ev aluating the results of a mo del-learning algorithm. None are completely satisfactory , but they all giv e some insigh t in to the utilit y of the results. In this domain, there are transitions and observ ations that usually tak e place, and are therefore more lik ely than the others. F urthermore, the relational information giv es us a rough estimate of the metric lo cations of the states. T o get a qualitative sense of the plausibilit y of a learn t mo del, w e can extract an essential map from the learn t mo del, consisting of the states , the most likely tr ansitions and the metric me asur es asso ciated with them, and ask whether this map corresp onds to the essential map underlying the true w orld. Figures 11 and 12 are suc h essen tial v ersions of the true mo dels, while Figures 15 and 17, sho wn later, are essen tial v ersions of represen tativ e learn t ones (obtained from sequences gathered under the p erp endicularit y assumption). Blac k dots represen t the ph ysical lo cations of states, and eac h state is assigned a unique n um b er. Multiple state n um b ers asso ciated with a single lo cation t ypically corresp ond to dieren t orien tations of the rob ot at that lo cation. The larger blac k circle represen ts the initial state. Solid arro ws represen t the most lik ely non-self transitions b et w een the states. Dashed arro ws represen t the other transitions when their probabilit y is 0 : 2 or higher. T ypically , due to the predetermined path w e ha v e tak en, the connectivit y of the mo deled en vironmen t is lo w, and therefore the transitions represen ted b y dashed arro ws are almost as lik ely as the most lik ely ones. Note that the length of the arro ws, within eac h plot, is signican t and represen ts the length of the corridors, dra wn to scale. It is imp ortan t to note that the gures do not pro vide a complete represen tation of the mo dels. First, they lac k observ ation and orien tation information. W e stress the fact that the gures serv e more as a visual aid than as a plot of the true mo del. W e are lo oking for a go o d top olo gic al mo del rather than a ge ometric al mo del. The gures pro vide a geometrical em b edding of the top ological mo del. Ho w ev er, ev en when the geometry , as describ ed b y the relation matrix, is dieren t, the top ology , as describ ed b y the transition and observ ation matrices, can still b e v alid. T raditionally , in sim ulation exp erimen ts, the learn t mo del is quantitatively compared to the actual mo del that generated the data. Eac h of the mo dels induces a probabilit y distribution on strings of observ ations; the asymmetric Kullbac k-Leibler div ergence (Kullbac k & Leibler, 1951) b et w een the t w o distributions is a measure of ho w go o d the learn t mo del is with resp ect to the true mo del. Giv en a true probabilit y distribution P = f p 1 ; :::; p n g and a learn t one Q = f q 1 ; :::; q n g , the kl div ergence of Q with resp ect to P is: D ( P jj Q ) def = n X i =1 p i log 2 p i q i : W e rep ort our results in terms of a sampled v ersion of the kl div ergence, as describ ed b y Juang and Rabiner (1985). It is based on generating sequences of sucien t length (5 sequences of 1000 observ ations in our case) according to the distribution induced b y the true mo del, and comparing their log-lik eliho o d according to the learn t mo del with the true mo del log-lik eliho o d. The total dierence in log-lik eliho o d is then divided b y the total n um b er of observ ations, accum ulated o v er all the sequences, giving a n um b er that roughly measures the dierence in log-lik eliho o d p er observ ation. F ormally stated, let M 1 b e the true mo del and M 2 a learn t one. By generating K sequences S 1 ; : : : ; S K , eac h of length T , from the true mo del, M 1 , the sampled kl -div ergence, D s is: D s ( M 1 jj M 2 ) = K X i =1 [log (Pr( S i j M 1 ))  log (Pr( S i j M 2 ))] K T : 194 Learning Geometricall y-Constrained HMMs 200 400 600 800 1000 200 400 600 800 1000 1200 -1500 -1250 -100 0 -750 -500 -250 -1500 -1000 -500 500 1000 Figure 13: Sequence gathered b y Ramona, p erp endicularit y assumed. Figure 14: Sequence generated b y our sim ula- tor, p erp endicularit y assumed. W e ignore the o dometric information when applying the kl measure, th us allo wing comparison b et w een purely top ological mo dels that are learn t with and without o dometry . 6.3 Results within a Global F ramew ork W e let Ramona go around the path depicted in Figure 11 and collect a sequence of ab out 300 observ ations, while assuming p erp endicularity of the en vironmen t, that is, at ev ery turning p oin t the angle of turn is 90  . Th us at eac h turn Ramona realigns its o dometric readings with its initial X and Y axes. Figure 13 plots the sequence of metric co ordinates, gathered in this w a y , while accum ulating consecutiv e o dometric readings, pro jected on h x; y i . W e applied the learning algorithm to the data 30 times. 10 of these runs w ere started from a k-means-based initial mo del, 10 started from a tag-based initial mo del, and 10 started from a random initial mo del. In addition w e also ran the standard Baum-W elc h algorithm, ignoring the o dometric information, 10 times. (Note that there is non-determinism ev en when using biased initial mo dels, since the k-means clustering starts from random seeds, and lo w 7 random noise is added to the data in all algorithms to a v oid n umerical instabilities, th us m ultiple runs giv e m ultiple results). W e rep ort here the results obtained using the tag-based metho d, whic h is the most appropriate initialization metho d in the general case. These results are con trasted with those obtained when o dometric information is not used at all. F or a comparison of all four settings the reader is referred to the complete rep ort of this w ork (Shatk a y , 1999). Figure 15 sho ws the essen tial represen tations of t ypical learn t mo dels starting from a tag-based initial mo del. The geometry of the learn t mo del strongly corresp onds to that of the true en- vironmen t, and most of the states' p ositions w ere learn t correctly . Although the gure do es not sho w it, the learn t observ ation distributions at eac h state usually matc h w ell with the true observ ations. T o demonstrate the eect of o dometry on the qualit y of the learn t top ological mo del, w e con trast the plotted mo dels learn t using o dometry with a represen tativ e top olo gic al mo del learn t without 7. A random n um b er b et w een -1cm and 1cm is added to recorded distances that are t ypically sev eral meters long. 195 Sha tka y & Kaelbling 0 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 3 5 6 7 8 4 9 10 11 12 13 14 15 0 16 1 2 16 10 8 7 5 0 12 1 9 2 3 14 4 6 11 13 15 Figure 15: Learn t mo del of the corridors Ra- mona tra v ersed. Figure 16: The top ology of a mo del learn t without the use of o dometry . the use of o dometric information. Figure 16 sho ws the top ology of a t ypical mo del learn t without the use of o dometric information. In this case, the arcs represen t only top ological relationships, and their length is not meaningful. The initial state is sho wn as a b old circle. It is clear that the top ology learn t do es not matc h the c haracteristic lo op top ology of the true en vironmen t. F or obtaining statistically sucien t information, w e generated 5 data sequences, eac h of length 1000, using Mon te Carlo sampling from the hidden Mark o v mo del whose pro jection is sho wn in Figure 12. One of these sequences is depicted in Figure 14. The gure demonstrates that the noise mo del used in the sim ulation is indeed compatible with the noise pattern asso ciated with real rob ot data. W e used four dieren t settings of the learning algorithm:  starting from a biased, tag-based, initial mo del and using o dometric information;  starting from a biased, k-means-based, initial mo del and using o dometric information;  starting from an initial mo del pic k ed uniformly at random, while using o dometric infor- mation;  starting from a random initial mo del without using o dometric information (standard Baum- W elc h). F or eac h sequence and eac h of the four algorithmic settings w e ran the algorithm 10 times. T o k eep the discussion fo cused, w e concen trate here on the rst and the last of these settings and the reader is referred to a more extensiv e rep ort (Shatk a y , 1999) for a complete discussion. In all the exp erimen ts, N w as set to b e 44, whic h is the \correct" n um b er of states; for gener- alization, it will b e necessary to use cross-v alidation or regularization metho ds to select mo del complexit y . Section 5 also suggests one p ossible heuristic for obtaining an estimate of the n um b er of states. Figure 17 sho ws an essen tial v ersion of one learn t mo del, obtained from the sequence sho wn in Figure 14, using tag-based initialization. W e note that the learn t mo del is not completely 196 Learning Geometricall y-Constrained HMMs 0 1 2 3 4 5 6 12 13 14 15 16 27 18 29 34 35 36 37 38 39 40 41 42 43 10 11 7 8 9 26 30 22 31 32 23 33 24 25 28 21 17 19 20 Figure 17: Learn t mo del of the sim ulated hallw a y en vironmen t. accurate with resp ect to the true mo del. Ho w ev er, there is an ob vious corresp ondence b et w een groups of states in the learn t and true mo dels, and most of the transitions (as w ell as the observ ations, whic h are not sho wn) w ere learn t correctly . The qualit y of the geometry of the learn t mo del in this sim ulated large en vironmen t v aries, and the geometrical results are not as uniformly go o d as w as the case when learning the smaller en vironmen t from real rob ot data. As the en vironmen t gets large, the global relations b et w een remote states, whic h are reected in the geometrical consistency constrain ts, b ecome harder to learn. Still, the top ology of the learn t mo del as demonstrated b y our statistical exp erimen ts is go o d. T able 1 lists the kl div ergence b et w een the true and learn t mo del, as w ell as the n um b er of runs un til con v ergence w as reac hed, for eac h of the 5 sequences for b oth the setting that uses o dometric information under tag-based initialization and the learning algorithm that do es not use o dometric information, a v eraged o v er 10 runs p er sequence. W e stress that eac h kl div ergence measure is calculated based on new data se quenc es that are generated from the true mo del, as describ ed in Section 6.2. The 5 sequences from whic h the mo dels w ere learn t do not participate in the testing pro cess. The kl div ergence with resp ect to the true mo del for mo dels learn t using o dometry , is ab out 5-6 times smal ler than for mo dels learn t without o dometric data. The standard deviation around the means is ab out 0.2 for kl distances for mo dels learn t with o dometry and 1.5 for the no- o dometry setting. T o c hec k the signicance of our results w e used the simple t w o-sample t-test. The mo dels learn t using o dometric information ha v e statistically signican tly ( p  0 : 0005) lo w er a v erage kl div ergence than the others. Seq. # 1 2 3 4 5 With kl 0.981 1.290 1.115 1.241 1.241 Odo Iter # 16.70 20.90 22.30 12.70 27.50 No kl 6.351 4.863 5.926 6.261 4.802 Odo Iter # 124.1 126.0 113.0 107.4 122.9 T able 1: Av erage results of t w o learning settings with v e training sequences. 197 Sha tka y & Kaelbling In addition, the n um b er of iterations required for con v ergence when learning using o dometric information is roughly 4-5 times smaller than that required when ignoring suc h information. Again, the t-test v eries the signicance of this result. Under all three initialization settings, the mo dels learn t are top ologically somewhat inferior (and this is with high statistical signicance), in terms of the kl div ergence, to those learn t without enforcing additivit y , rep orted in earlier pap ers (Shatk a y & Kaelbling, 1997, 1998). This is lik ely to b e a result of the v ery strong constrain ts enforced during the learning pro cess, whic h prev en t the algorithm from searc hing b etter areas of the learning-space, and restrict it to reac h p o or lo cal maxima. The geometry lo oks sup erior in some cases, but it is not signican tly b etter. Ho w ev er, there seems to b e less v ariabilit y in the qualit y of the geometrical mo dels across m ultiple runs when additivit y is enforced. While the details of an extensiv e comparison b et w een the dieren t initialization metho ds are b ey ond the scop e of this pap er, w e p oin t out that our studies of b oth small and large mo dels sho w that when large mo dels and long data sequences are in v olv ed, random initialization often results in lo w er KL-div ergence than the tag-based initialization. This again has to do with the strong bias of tag-based initialization, whic h can lead to v ery p eak ed mo dels compared with the less-p eak ed distributions asso ciated with the true mo del. Random initialization leads to atter mo dels. As the KL-div ergence strongly p enalizes mo dels that are m uc h more p eak ed than the true ones, randomly initialized mo dels are often closer, in terms of this measure, to the true mo dels than the v ery p eak ed ones learn t from other initial mo dels. When learning small mo dels, where sucien t training data is a v ailable, the tag-based initialization results in mo dels that are clearly sup erior to the random ones. Again, the reader is referred to the complete rep ort of this w ork (Shatk a y , 1999) for a comparativ e study of all initialization metho ds under the v arious settings. 6.4 Results within a Relativ e F ramew ork W e applied the algorithm describ ed in Section 4.3, extended to accommo date the state-relativ e constrain ts (as listed in Section 3.3.2). The data used w as gathered b y the rob ot from the same en vironmen t, and generated from the same sim ulated mo del as b efore (Figures 11, 12). Ho w ev er, here the data is generated without assuming p erp endicularity. This means that the x and y co ordinates are not realigned after eac h turn with the global x and y axes, but rather, recorded \as-is." The ev aluation metho ds sta y as describ ed ab o v e. Figure 18 sho ws the pro jection of the o dometric readings that Ramona recorded along the x and y dimensions, while tra v ersing this en vironmen t. F or obtaining statistically sucien t information, w e generated 5 data sequences, eac h of length 800, using Mon te Carlo sampling from the hidden Mark o v mo del whose pro jection is sho wn in Figure 12. One of these sequences is depicted in Figure 19. Figure 20 sho ws a t ypical mo del obtained b y applying the algorithm enforcing the complete geometrical consistency , to the rob ot data sho wn in Figure 18, using tag-based initialization. W e note that the rectangular geometry of the en vironmen t is preserv ed, although state 0 do es not participate in the lo op. This is explained b y observing the corresp onding area of the true en vironmen t as depicted in Figure 11, consisting of the 4 states clustered at the b ottom left corner ( 0, 14, 15 and 16 ). Due to the relativ ely large n um b er of states that are close together in that area of the true en vironmen t, it w as not recognized that w e ev er returned particularly to state 0 during the lo op. Therefore, there w as only one transition recorded from state 0 to state 198 Learning Geometricall y-Constrained HMMs -2500 -2000 -1500 -1000 -500 500 1000 500 1000 1500 2000 2500 3000 -1500 -1000 -500 500 -1500 -1000 -500 500 1000 1500 Figure 18: Sequence gathered b y Ramona, no p erp endicularit y assumed. Figure 19: Sequence generated b y our sim ula- tor, no p erp endicularit y assumed. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Figure 20: Learn t mo del of the corridors Ramona tra v ersed. Initialization is tag-based. 1 according to the exp ected transition coun ts calculated b y the algorithm. When pro jecting the angles to main tain additivit y , (as describ ed in Section 4.3.2), the angle from state 0 to 1 w as therefore compromised, allo wing geometrical consistency to main tain the rectangular geometry among the more regularly visited states. F or the purp ose of quan titativ ely ev aluating the learning algorithm w e list in T able 2 the kl div ergence b et w een the true and learn t mo del, as w ell as the n um b er of iterations un til con v er- gence w as reac hed, for eac h of the 5 sim ulation sequences with/without o dometric information, a v eraged o v er 10 runs p er sequence. The table demonstrates that the kl div ergence with re- sp ect to the true mo del for mo dels learn t using o dometric data, is ab out 8 times smal ler than for mo dels learn t without it. T o c hec k the signicance of our results w e again use the simple t w o-sample t-test. The mo dels learn t using o dometric information ha v e highly statistically sig- nican tly ( p  0 : 0005) lo w er a v erage kl div ergence than the others. In addition, the n um b er of 199 Sha tka y & Kaelbling Seq. # 1 2 3 4 5 With kl 1.46 1.18 1.20 1.02 1.22 Odo Iter # 11.8 36.8 30.7 24.6 33.3 No kl 6.91 9.93 10.03 9.54 12.43 Odo Iter # 113.3 113.1 102.0 104.2 112.5 T able 2: Av erage results of 2 learning settings with 5 training sequences. iterations required for con v ergence when learning using o dometric information is smaller than required when ignoring suc h information. Again, the t-test v eries the signicance ( p < 0 : 005) of this result. It is imp ortan t to p oin t out that the n um b er of iterations, although m uc h lo w er, do es not auto- matically imply that our algorithm runs in less time than the non-o dometric Baum-W elc h. The ma jor b ottlenec k is caused b y the need to compute within the forw ard-bac kw ard calculations, as describ ed in Section 4.2.1, the v alues of the normal and the v on-Mises densities. These re- quire the calculation of exp onen t terms rather than simple m ultiplications, slo wing do wn eac h iteration, under the curren t na  v e implemen tation. Ho w ev er, w e can solv e this b y augmen ting the program with lo ok-up tables for obtaining the relev an t v alues rather than calculating them. In addition, w e can tak e adv an tage of the symmetry in the relations table to cut do wn on the amoun t of calculation required. It is also p ossible to use the fact that man y o dometric rela- tions remain unc hanged (particularly in the later iterations of the algorithm) from one iteration to the next, and therefore v alues can b e cac hed and shared b et w een iterations rather than b e recalculated at eac h iteration. 6.5 Reducing the Amoun t of Data Learning hmm s ob viously requires visiting states and transitioning b et w een them m ultiple times, to gather sucien t data for robust statistical estimation. In tuitiv ely , exploiting o dometric data can help reduce the n um b er of visits needed for obtaining a reliable mo del. T o examine the inuence of reduction in the length of data sequences on the qualit y of the learn t mo dels, w e to ok one of the 5 sequences and used its prexes of length 100 to 800 (the complete sequence), in incremen ts of 100, as training sequences. W e ran the t w o algorithmic settings o v er eac h of the 8 prex sequences, 10 times rep eatedly . W e then used the kl -div ergence as describ ed ab o v e to ev aluate eac h of the resulting mo dels with resp ect to the true mo del. F or eac h prex length w e a v eraged the kl -div ergence o v er the 10 runs. The plot in Figure 21 depicts the a v erage kl -div ergence as a function of the sequence length for eac h of the t w o settings. It demonstrates that, in terms of the kl div ergence, our algorithm, whic h uses o dometric information, is robust in the face of data reduction, (do wn to 200 data p oin ts). In con trast, learning without the use of o dometry quic kly deteriorates as the amoun t of data is reduced. W e note that the data sequence is t wice as \wide" when o dometry is used than when it is not; that is, there is more information in eac h elemen t of the sequence when o dometry data is recorded. Ho w ev er, the eort of recording this additional o dometric information is negligible, and is w ell rew arded b y the fact that few er observ ations and less exploration are required for obtaining a data sequence sucien t for adequate learning. 200 Learning Geometricall y-Constrained HMMs 0 200 400 600 800 Seq. Length 10 20 30 40 50 KL Odometry Used No Odometry Figure 21: Av erage kl div ergence as a function of sequence length. 7 Conclusions Odometric information, whic h is often readily a v ailable in the rob otics domain, mak es it p ossible to learn hidden Mark o v mo dels ecien tly and eectiv ely , while using shorter training sequences. More imp ortan tly , in con trast to the traditional p erception of viewing the top ological and the geometric mo dels as t w o distinct t yp es of en tities, w e ha v e sho wn that the o dometric information can b e dir e ctly incorp orated in to the traditional top ological hmm mo del, while main taining con v ergence of the reestimation algorithm to a lo cal maxim um of the lik eliho o d function. Our metho d uses the o dometric information in t w o w a ys. W e rst c ho ose an initial mo del, based on the o dometric information. An iterativ e pro cedure, whic h extends the Baum-W elc h algorithm, is then used to learn the top ological mo del of the en vironmen t while learning an additional set of constrained geometric parameters. The additional set of constrained parame- ters constitutes an extension to the basic hmm/pomdp mo del of transitions and observ ations. Ev en though w e are primarily in terested in the underlying top ological mo del (transition and observ ation probabilities), our exp erimen ts demonstrate that the use of o dometric relations can reduce the n um b er of iterations and the amoun t of data required b y the algorithm, and impro v e the resulting mo del. The initialization pro cedure and the enforcemen t of the additivit y constrain t o v er relativ ely small mo dels pro v e helpful b oth top ologically and geometrically . An extensiv e study (Shatk a y , 1999) sho ws that for long data sequences, generated from large mo dels, enforcing only anti- symmetry rather than additivity , leads to b etter top ological mo dels. This is b ecause in these cases, initialization is not alw a ys go o d, and additivit y ma y o v er-constrain the learning to an unfa v orable area. Learning large mo dels ma y b enet from enforcing only an ti-symmetry during the rst few iterations, and complete additivit y in later iterations. Alternativ ely , w e ma y use our algorithm, enforcing additivit y , to learn separate mo dels for small p ortions of the en vironmen t, com bining them later in to one complete mo del. A similar idea of com bining small mo del- fragmen ts in to a complete map of an en vironmen ts w as applied, in the con text of geometrical maps, in recen t w ork b y Leonard and F eder (2000). 201 Sha tka y & Kaelbling The w ork presen ted here demonstrates ho w domain-sp ecic information and constrain ts can b e enforced as part of the statistical estimation pro cess, resulting in b etter mo dels, while requiring shorter data sequences. W e strongly b eliev e that this idea can b e applied in domains other than rob otics. In particular, the acquisition of hmm s for use in molecular biology ma y greatly b enet from exploiting geometrical (and other) constrain ts on molecular structures. Similarly , temp oral constrain ts ma y b e exploited in domains in whic h pomdp s are appropriate for decision-supp ort, suc h as air-trac con trol and medicine. Ac kno wledgmen ts W e thank Sebastian Thrun for his insigh tful commen ts throughout this w ork, John Hughes and Luis Ortiz for their helpful advice, An thon y Cassandra for his co de for generating random distributions, Bill Smart for sustaining Ramona and Jim Kurien for pro viding the lo w lev el co de for driving her. The presen tation in this pap er has b eneted from the commen ts made b y the anon ymous referees to whom w e are grateful. This w ork w as done while b oth authors w ere at the Computer Science departmen t at Bro wn Univ ersit y , and w as supp orted b y D ARP A/Rome Labs Planning Initiativ e gran t F30602-95-1-0020, b y NSF gran ts IRI-9453383 and IRI-9312395, and b y the Bro wn Univ ersit y Graduate Researc h F ello wship. 202 Learning Geometricall y-Constrained HMMs App endix A. An Ov erview of the Odometric Learning Algorithm The algorithm tak es as input an exp erience sequence E = h r ; V i , consisting of the o dometric sequence r and the observ ation sequence V , as dened in the b eginning of Section 4.2. The n um b er of states is also assumed to b e giv en. Learn Odometric HMM ( E ) 1 Initialize matrices A; B ; R (See Section 5) 2 max change 1 3 while ( max change >  ) 4 do Calculate F orw ard probabilities,  (Equation 4) 5 Calculate Bac kw ard probabilities,  (Equation 5) 6 Calculate state-o ccupation probabilities,  (Equation 6) 7 Calculate State-transition probabilities,  ; (Equation 7) 8 Old A A; Old B B 9 A Reestimate ( A ) (Equation 8, left) 10 B Reestimate ( B ) (Equation 8, righ t) 11 R  Reestimate ( R  ) (Equations 12 and 13) 12 h R x ; R y i Reestimate( R x ; R y ) (Equations 10 and 11) 13 max change MAX ( Get Max Change ( A; Old A ) ; Get Max Change ( B ; Old B )) The equations referenced in Step 12 corresp ond to up dates under the p erp endicularit y assump- tion, where a glob al fr amework is used. See (Shatk a y , 1999) for up date form ulae within a state-relativ e framew ork. If additivit y is enforced, step 11 is follo w ed b y a pro jection of the reestimated R  on to an additiv e ane space, as describ ed in Section 4.3.2. In addition, step 12 is substituted b y the pro cedure describ ed in Section 4.3.1. The reader is referred again to (Shatk a y , 1999) for further detail. G et M ax C hange is a function that tak es t w o matrices and returns the maximal elemen t-wise absolute dierence b et w een them.  is a constan t set to denote the margin of error on c hanges in parameters. When the c hange in parameters is \small enough", the mo del is regarded as \unc hanged". 203 Sha tka y & Kaelbling References Ab e, N., & W arm uth, M. K. (1992). On the computational complexit y of appro ximating distri- butions b y probabilistic automata. Machine L e arning , 9 (2), 205{260. Angluin, D. (1987). Learning regular sets from queries and coun terexamples. Information and Computation , 75 , 87{106. Asada, M. (1991). Map building for a mobile rob ot from sensory data. In Iy engar, S. S., & Elfes, A. (Eds.), A utonomous Mobile R ob ots , pp. 312{322. IEEE Computer So ciet y Press. Bartels, R. (1984). Estimation in a bidirectional mixture of v on Mises distributions. Biometrics , 40 , 777{784. Basy e, K., Dean, T., & Kaelbling, L. P . (1995). Learning dynamics: System iden tication for p erceptually c hallenged agen ts. A rticial Intel ligenc e , 72 (1). Baum, L. E., P etrie, T., Soules, G., & W eiss, N. (1970). A maximization tec hnique o ccurring in the statistical analysis of probabilistic functions of Mark o v c hains. The A nnals of Mathematic al Statistics , 41 (1), 164{171. Cassandra, A. R., Kaelbling, L. P ., & Kurien, J. A. (1996). Acting under uncertain t y: Discrete Ba y esian mo dels for mobile-rob ot na vigation. In Pr o c e e dings of IEEE/RSJ International Confer enc e on Intel ligent R ob ots and Systems . DeGro ot, M. H. (1986). Pr ob ability and Statistics (2nd edition). Addison-W esley . Dempster, A. P ., Laird, N. M., & Rubin, D. B. (1977). Maxim um lik eliho o d from incomplete data via the EM algorithm. Journal of the R oyal Statistic al So ciety , 39 (1), 1{38. Dissana y ak e, G., Newman, P ., Clark, S., Durran t-Wh yte, H. F., & Csorba, M. (2001). A solution to the sim ultaneous lo calization and map building (SLAM) problem. IEEE T r ansactions on R ob otics and A utomation , 17 (3). Duda, R. O., & Hart, P . E. (1973). Unsup ervise d L e arning and Clustering , c hap. 6. John Wiley and Sons. Elfes, A. (1989). Using o ccupancy grids for mobile rob ot p erception and na vigation. Computer, Sp e cial Issue on A utonomous Intel ligent Machines , 22 (6), 46{57. Engelson, S. P ., & McDermott, D. V. (1992). Error correction in mobile rob ot map learning. In Pr o c e e dings of the IEEE International Confer enc e on R ob otics and A utomation , pp. 2555{2560, Nice, F rance. Gold, E. M. (1978). Complexit y of automaton iden tication from giv en data. Information and Contr ol , 37 , 302{320. Gum b el, E. G., Green w o o d, J. A., & Durand, D. (1953). The circular normal distribution: Theory and tables. A meric an Statistic al So ciety Journal , 48 , 131{152. Hop croft, J. E., & Ullman, J. D. (1979). Intr o duction to A utomata The ory, L anguages, and Computation . Addison & W esley . 204 Learning Geometricall y-Constrained HMMs Juang, B. H. (1985). Maxim um lik eliho o d estimation for mixture m ultiv ariate sto c hastic obser- v ations of Mark o v c hains. A T&T T e chnic al Journal , 64 (6). Juang, B. H., & Rabiner, L. R. (1985). A probabilistic distance measure for hidden Mark o v mo dels. A T&T T e chnic al Journal , 64 (2), 391{408. Ko enig, S., & Simmons, R. G. (1996a). P assiv e distance learning for rob ot na vigation. In Pr o c e e dings of the Thirte enth International Confer enc e on Machine L e arning , pp. 266{ 274. Ko enig, S., & Simmons, R. G. (1996b). Unsup ervised learning of probabilistic mo dels for rob ot na vigation. In Pr o c e e dings of the IEEE International Confer enc e on R ob otics and A utoma- tion . Kuip ers, B., & Byun, Y.-T. (1991). A rob ot exploration and mapping strategy based on a se- man tic hierarc h y of spatial represen tations. Journal of R ob otics and A utonomous Systems , 8 , 47{63. Kullbac k, S., & Leibler, R. A. (1951). On information and suciency . A nnals of Mathematic al Statistics , 22 (1), 79{86. Leonard, J., Durran t-Wh yte, H. F., & Co x, I. J. (1991). Dynamic map building for an au- tonomous mobile rob ot. In Iy engar, S. S., & Elfes, A. (Eds.), A utonomous Mobile R ob ots , pp. 331{338. IEEE Computer So ciet y Press. Leonard, J. J., & F eder, H. J. S. (2000). A computationally ecien t metho d for large-scale con- curren t mapping and lo calization. In Hollerbac h, J., & Ko disc hek, D. (Eds.), Pr o c e e dings of the Ninth International Symp osium on R ob otics R ese ar ch . Lip orace, L. A. (1982). Maxim um lik eliho o d estimation for m ultiv ariate observ ations of Mark o v sources. IEEE T r ansactions on Information The ory , 28 (5). Mardia, K. V. (1972). Statistics of Dir e ctional Data . Academic Press. Mataric, M. J. (1990). A distributed mo del for mobile rob ot en vironmen t-learning and na viga- tion. Master's thesis, MIT, Articial In telligence Lab oratory . McLac hlan, G. J., & Krishnan, T. (1997). The EM A lgorithm and Extensions . John Wiley & Sons. Mora v ec, H. P . (1988). Sensor fusion in certain t y grids for mobile rob ots. AI Magazine , 9 (2), 61{74. Mora v ec, H. P ., & Elfes, A. (1985). High resolution maps from wide angle sonar. In Pr o c e e dings of the International Confer enc e on R ob otics and A utomation , pp. 116{121. Nourbakhsh, I., P o w ers, R., & Birc held, S. (1995). Dervish: An oce-na vigating rob ot. AI Magazine , 16 (1), 53{60. Pierce, D., & Kuip ers, B. (1997). Map learning with unin terpreted sensors and eectors. A rti- cial Intel ligenc e , 92 (1-2), 169{227. 205 Sha tka y & Kaelbling Rabiner, L. R. (1989). A tutorial on hidden Mark o v mo dels and selected applications in sp eec h recognition. Pr o c e e dings of the IEEE , 77 (2), 257{285. Riv est, R. L., & Sc hapire, R. E. (1987). Div ersit y based inference of nite automata. In Pr o c e e dings of the IEEE Twenty Eighth A nnual Symp osium on F oundations of Computer Scienc e , pp. 78{87, Los Angeles, California. Riv est, R. L., & Sc hapire, R. E. (1989). Inference of nite automata using homing sequences. In Pr o c e e dings of the Twenty First A nnual Symp osium on The ory of Computing , pp. 411{420, Seattle, W ashington. Ron, D., Singer, Y., & Tish bi, N. (1994). Learning probabilistic automata with v ariable mem- ory length. In Pr o c e e dings of the Seventh A nnual Workshop on Computational L e arning The ory , pp. 35{46. Ron, D., Singer, Y., & Tish bi, N. (1995). On the learnabilit y and usage of acyclic probabilistic nite automata. In Pr o c e e dings of the Eighth A nnual Workshop on Computational L e arning The ory , pp. 31{40. Ron, D., Singer, Y., & Tish b y , N. (1998). On the learnabilit y and usage of acyclic probabilistic nite automata. Journal of Computer and Systems Scienc e , 56 (2). Shatk a y , H. (1999). L e arning Mo dels for R ob ot Navigation . Ph.D. thesis, Departmen t of Com- puter Science, Bro wn Univ ersit y , Pro vidence, RI. Shatk a y , H., & Kaelbling, L. P . (1997). Learning top ological maps with w eak lo cal o dometric information. In Pr o c e e dings of the Fifte enth International Joint Confer enc e on A rticial Intel ligenc e , Nago y a, Japan. Shatk a y , H., & Kaelbling, L. P . (1998). Heading in the righ t direction. In Pr o c e e dings of the Fifte enth International Confer enc e on Machine L e arning , Madison, Wisconsin. Simmons, R. G., & Ko enig, S. (1995). Probabilistic na vigation in partially observ able en viron- men ts. In Pr o c e e dings of the International Joint Confer enc e on A rticial Intel ligenc e . Smith, R., Self, M., & Cheeseman, P . (1991). A sto c hastic map for uncertain spatial relation- ships. In Iy engar, S. S., & Elfes, A. (Eds.), A utonomous Mobile R ob ots , pp. 323{330. IEEE Computer So ciet y Press. Thrun, S. (1999). Learning metric-top ological maps for indo or mobile rob ot na vigation. AI Journal , 1 , 21{71. Thrun, S., & B  uc k en, A. (1996a). In tegrating grid-based and top ological maps for mobile rob ot na vigation. In Pr o c e e dings of the Thirte enth National Confer enc e on A rticial Intel ligenc e , pp. 944{950. Thrun, S., & B  uc k en, A. (1996b). Learning maps for indo or mobile rob ot na vigation. T ec h. rep. CMU-CS-96-121, Sc ho ol of Computer Science, Carnegie Mellon Univ ersit y , Pittsburgh, P A. Thrun, S., Burgard, W., & F o x, D. (1998a). A probabilistic approac h to concurren t map acqui- sition and lo calization for mobile rob ots. Machine L e arning , 31 , 29{53. 206 Learning Geometricall y-Constrained HMMs Thrun, S., Gutmann, J.-S., F o x, D., Burgard, W., & Kuip ers, B. J. (1998b). In tegrating top olog- ical and metric maps for mobile rob ot na vigation: A statistical approac h. In Pr o c e e dings of the Fifte enth National Confer enc e on A rticial Intel ligenc e , pp. 989{995. V apnik, V. N. (1995). The Natur e of Statistic al L e arning The ory . Springer. 207

Learning Geometrically-Constrained Hidden Markov Models for Robot Navigation: Bridging the Topological-Geometrical Gap

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment