Axioms for Rational Reinforcement Learning

Axioms for Rational Reinfor c emen t Learning P eter Sunehag and Marcus Hutter Researc h Sc ho ol of Computer Science Australian National Univ ersit y Can b erra, ACT , 0200, Australia { Peter.Sune hag,Marc us.Hutter } @anu.edu.au July 201 1 Abstract W e pro vide a formal, simple and in tuitiv e theory of rational decision m ak- ing including sequen tial decisions that aﬀect the environmen t. The theory has a geometric ﬂa v or, whic h make s th e argumen ts easy to visualize and un der- stand. Our theory is for complete d ecision make rs, whic h means that they ha v e a complete set of preferences. Our m ain resu lt s ho ws that a complete rational d ecision make r implicitly has a probabilistic mo d el of the environ- men t. W e h a v e a coun table version of this resu lt th at brin gs ligh t on the issue of coun table vs ﬁ nite additivit y b y sh o wing h o w it dep ends on the geome- try of the space wh ic h w e hav e preferen ces o ve r. This is ac hieve d through fruitfully connecting rationalit y with the Hahn-Banac h Th eorem. Th e theory present ed here can b e vie wed as a fo rmalization and ext ension of t he b etting o dds approac h t o probabilit y of Ramsey and De Finetti [Ram31, deF37]. Con ten ts 1 In tro duction 2 2 Rational Decisions for Accepting or Rejecting Contracts 4 3 Coun table Set s of Ev ents 10 4 Rational Agen ts for Classes of Environmen ts 13 5 Conclusions 15 Keyw ords Rationalit y; Probabilit y; Utilit y; B anac h Sp ace; L in ear F unctional. 1 1 In tro duction W e study complete decision make rs that can tak e a sequence of actions to rationally pursue any giv en task. W e supp ose that the task is describ ed in a reinforcemen t learning fra mework where the agent tak es actions and receiv es observ a tions and rew ards. The aim is to maximize total reward in some giv en sense. Rationality is meant in the sense of internal consistency [Sug91], whic h is ho w it has b een used in [NM44] a nd [Sav54]. In [NM44], it is prov en that preferences together with rationality axioms and probabilities for p ossible ev en ts imply the existence of utility v alues for those ev en ts that explain the preferences as arising through maximizing exp ected utility . Their ra tionalit y axioms a re 1. Completeness: Give n an y tw o choic es w e either prefer one of them to the ot her or w e consider them to b e equally preferable; 2. T ra nsitivit y: A preferable to B and B to C imply A preferable to C; 3. Indep endence: If A is preferable to B and t ∈ [0 , 1] then tA + (1 − t ) C is preferable (or equal) to tB + (1 − t ) C ; 4. Contin uity: If A is preferable to B and B to C then there exists t ∈ [0 , 1] suc h that B is equally preferable to tA + (1 − t ) C . In [Sa v54] the pro babilities are not giv en but it is instead prov en that preferences together with rationa lit y a xioms imply the existence of probabilities and utilities. W e are here in t erested in the case whe re one is giv en utilit y (rew ards) and preferences o v er actions and then deriving the existence o f a pr o babilistic world mo del. W e put an emphasis on extensions to sequen tial decision making with resp ect to a countable class of en vironmen ts. W e set up simple a xioms for a rationa l decision mak er, whic h implies that the decisions can b e explained (or deﬁned) from probabilistic b eliefs. The theory of [Sa v54] is called sub jectiv e exp ected utilit y theory (SEUT) and w as in tended to pro vide statistics with a strictly b ehav iorial foundation. The b ehav ioral approac h stands in stark con trast to approac hes that directly p o stulate a xioms that “degrees of b elief” should satisfy [Cox46, Hal99, Ja y03 ]. Co x’s approach [Cox46, Ja y03] has also been found [Par94] to need additional tec hnical assumptions in addition to the common sense axioms orig inally listed b y Cox. The original pro of by [Co x46] has b een exp osed as not mathematically rigorous a nd his theorem as wrong [Hal99]. An alternativ e approach by [Ram31, deF37] is in terpreting proba bilities as fair b etting o dds. The theory of [Sav54] has great ly inﬂuenced economics [Sug91] where it has b een used a s a description of ratio nal agents . Seemingly strange b eha vior was explained as hav ing b eliefs (probabilities) and tastes (utilities) t ha t w ere diﬀeren t from those of the p erson to whom it lo ok ed irrationa l. This has turned out to b e insuﬃcien t as a description of h uman b ehavior [All5 3, Ell61] and it is b etter suited as a normative theory or design principle in artiﬁcial in telligence. In this article, w e are interes ted 2 in studying the necessit y for rational agents (biological or not) to hav e a probabilis- tic mo del of their en vironmen t. T o achiev e this, and to hav e as simple common sense axioms of r ationality as p ossible, w e p ostulate that g iven an y set of v alues (a con tract) asso ciated with the p ossible ev en ts, the decision maker needs to ha v e an opinion o n w ether he prefers these v alues to a guaran teed zero outcome or not (o r equal). F rom this setting a nd our ot her rat ionalit y axioms w e deduce the existence of probabilities that explain all preferences as maximizing exp ected v alue. There is an in tuitiv e similarity to the idea of explaining/deriving probabilities as a b o ok- mak er’s b etting o dds as done in [deF37] and [Ram31]. One can ar gue that the theory presen t ed here (in Section 2) is a formalizatio n and extension of the b etting o dds approac h. Geometrically , the result sa ys that there is a h yp er-plane in the space of con tracts t ha t separates accept from reject. W e generalize t his statemen t, by using the Hahn-Banach Theorem, to the countable case where the set of hy p er-pla nes (the dual space) dep ends on the space of con tract. The answ ers for diﬀeren t cases can then b e found in the Banac h space theory literat ure. This pro vides a new approach to understanding issues lik e ﬁnite vs. countable additivit y . W e tak e adv antage of this t o form ulate rationa l agents that can deal successfully with countable (p ossibly univ ersal as in all computable en vironmen ts) classes of en vironmen ts. Our presen tation b egins in Section 2 by ﬁrst lo oking at a fundamen tal case where one has to accept or r eject certain contracts deﬁning p ositive a nd negativ e rew ards that dep end on the outcome of an ev en t with ﬁnitely man y p ossibilities. T o dra w the conclusion tha t there are implicit unique probabilistic b eliefs, it is imp ortant that the decision mak er has an opinion (acceptable, rejectable or b oth) on ev ery p ossible contract. This is what w e mean when w e say c omplete de c i s ion maker . In a more general setting, w e consider sequen tial decision making where giv en an y con tract on the sequenc e of o bserv ations and a ctio ns, t he decision mak er mu st b e able to ch o ose a p olicy (i.e. an a ctio n tree). Note that the a ctions may a ﬀ ect the en vironmen t. A con tract on suc h a sequence can e.g. b e view ed as describing a re- w ard structure for a t ask. An example of a task is a cleaning rob o t that gets p ositiv e rew ards for collecting dust and negativ e for falling down the stairs. A prerequisite for b eing able to con tin ue to collect dust can b e to r ec harge t he battery b efore run- ning out. A sp ecialized decision mak er tha t deals only with one con tract/task do es not alw a ys need to hav e implicit proba bilities, it can suﬃce with qualitativ e b eliefs to tak e reasonable decisions. A qualitativ e b elief can b e that one pizza delive ry com- pan y (e.g. Pizza Hut vs Dominos) is more lik ely to arriv e on time than the other. If o ne b eliev es the pizzas are equally go o d and the price is the same, we will c hose the company w e b eliev e is more often deliv ering on time. Considering a ll con tracts (rew ard structures) on the actions a nd ev en ts, leads to a situation where ha ving a w a y of making ratio nal (coherent) decisions, implies that the decision mak er has implicit probabilistic b eliefs. W e sa y that the probabilities are implicit b ecause the decision mak er, whic h might e.g. b e a human, a dog, a computer o r just a set o f rules, might hav e a non-proba bilistic description o f ho w the decisions are made. In Section 3 , w e inv estigate extensions to the case with coun tably man y p ossible 3 outcomes and the in teresting issue of coun table v ersus ﬁnite additivity . Sa v ag e’s axioms are kno wn to only lead to ﬁnite a dditivit y while [Arr70] sho w ed that adding a monotone contin uity assumption guarantee s countable additivit y . W e ﬁnd that in our setting, it dep ends on the space o f contracts in a n intere sting wa y . In Section 4, w e discuss a setting where we ha v e a class of en vironmen ts. 2 Rational Decisions fo r Accepting or Rejecting Con tracts W e consider a setting where w e observ e a symbol (letter) from a ﬁnite alphab et and w e are oﬀered a form of b et w e call a con tract t ha t w e can accept or not. Deﬁnition 1 (Passiv e E nvironmen t, Ev en t) A p assive env ir onmen t is a se- quenc e of symb ols (letters) j t , c al le d events, b eing pr e sente d one at a time. A t time t the symb ols j 1 , ..., j t ar e av a ilable. We c an e quivalently say that a p assive envir on ment is a function ν fr om ﬁni te s trings to { 0 , 1 } wher e ν ( j 1 , ..., j t ) = 1 if and only if the envir onment b e gins with j 1 , ..., j t . Deﬁnition 2 (Contract) Supp ose that we have a p assive envir onm ent with s ym- b ols fr om an alphab et with m elements. A c ontr a ct for an event is an el e m ent x = ( x 1 , ..., x m ) in R m and x j is the r e w ar d r e c eive d if the ev e nt is the j :th symb ol, under the assumption that the c ontr act is ac c epte d (se e next deﬁnition). Deﬁnition 3 (Dec ision Mak er, D ecision) A de cision maker ( f o r some unknown envir on ment) is a set Z ⊂ R m which de ﬁnes exa c tly the c ontr acts that ar e ac c ept- able. In o ther wor ds, a de cision ma k er is a function fr om R m to { ac c epte d, r eje cte d, either } . The function value is c al le d the de cision. If x ∈ Z and λ ≥ 0 then we w an t λx ∈ Z since it is simply a multiple of the same con tract. W e also w an t the sum of tw o acceptable con tracts to b e acceptable. If we cannot lose money w e are prepared to accept the contract. If w e are guaran teed to win money w e are no t prepared to reject it. W e summarize these prop erties in t he deﬁnition b elo w of a rational decision make r. Deﬁnition 4 (R ationalit y I) We say that the de cision make r ( Z ⊂ R m ) is r a tio- nal if 1. Every c ontr act x ∈ R m is either ac c eptable or r eje ctable or b oth; 2. x is a c c eptable if a nd only if − x is r eje ctable; 3. x, y ∈ Z , λ, γ ≥ 0 then λx + γ y ∈ Z ; 4. If x k ≥ 0 ∀ k then x = ( x 1 , ..., x m ) ∈ Z while if x k < 0 ∀ k then x / ∈ Z . 4 If w e wan t to compare these axioms to ra tionalit y axioms for a preference relation on contracts we will say that x is b etter or equal (as in equally go o d) tha n y if x − y is acceptable while it is w orse or equal if x − y is rejectable. The ﬁrst axiom is completeness . The second says that if x is b etter or equal than y then y is w orse or equal to x . The third implies transitivit y since ( x − y ) + ( y − z ) = ( x − z ). The fourth sa ys that if x has a b etter (or equal) rew ard t ha n y for an y ev en t , then x is b etter (or equal) than y . 2.1 Probabilities and Exp ectations Theorem 5 (E xistence of Probabilities) Given a r ational de cision maker, ther e ar e numb ers p i ≥ 0 that satisfy { x | X x i p i > 0 } ⊂ Z ⊆ { x | X x i p i ≥ 0 } . (1) Assuming P i p i = 1 makes the numb ers unique and we wil l use the no tation P r ( i ) = p i . Pro of. See the pro of o f the more general Theorem 23. It tells us tha t the closure ¯ Z of Z is a closed half space and can b e written as { x | P x i p i ≥ 0 } for some ve ctor p = ( p i ) (since ev ery linear functional on R m is of the form f ( x ) = P x i p i ) and not ev ery p i is 0. The fourth prop erty tells us that p i ≥ 0 ∀ i . Deﬁnition 6 (E xp ectation) We wil l r efer to the function g ( x ) = P p i x i fr om (1) as the de cision mak e rs exp e ctation. In this terminol o gy, a r ational d e cision maker has an exp e ctation function and ac c epts a c ontr act x if g ( x ) > 0 and r eje ct it if g ( x ) < 0 . Remark 7 Supp ose that we have a c ontr act x = ( x i ) wher e x i = 1 f o r a l l i . If w e want g ( x ) = 1 , w e ne e d P p i = 1 . W e will write E ( x ) instead of g ( x ) (assuming P p i = 1) from now on and call it the exp ected v alue or exp ectation of x . 2.2 Multiple Ev en ts Supp ose that the contract is suc h tha t w e can view the sym b ol to b e drawn as consisting of tw o (or sev eral) sym b ols from smaller alphab ets. That is w e can write a drawn sym b ol as ( i, j ) where all the p ossibilities can b e found through 1 ≤ i ≤ m , 1 ≤ j ≤ n . In t his w a y of writing, a contract is deﬁned b y real n um b ers x i,j . Theorem 5 tells us that for a rationa l decision make r there exists unique r i,j ≥ 0 suc h that P i,j r i,j = 1 and an expectatio n function g ( x ) = P r i,j x i,j suc h that con tracts a re a ccepted if g ( x ) > 0 and rejected if g ( x ) < 0 . 5 2.3 Marginals Supp ose that w e can take rational decisions on b ets f o r a pair of horse races, while the person tha t oﬀers us b ets only cares ab out t he ﬁrst race. Then w e are still equipped to resp ond since the b ets that only dep end o n the ﬁrst race is a subset of all b ets o n the pair of races. Deﬁnition 8 (Marginals) Supp ose that we have a r ational de cisi o n mak er ( Z ) fo r c ontr acts on the events ( i, j ) . Then we say that the m ar ginal d e cision maker fo r the ﬁrst symb ol ( Z 1 ) i s the r estriction of the de cisi o n maker Z to the c ontr acts x i,j that only dep end on i , i.e. x i,j = x i . In other wor ds given a c ontr act y = ( y i ) on the ﬁrst event, we e xtend that c ontr act to a c on tr act on ( i, j ) by letting y i,j = y i and then the original de cision maker c an de cide. Supp ose that x i,j = x i . Then the exp ectation P r i,j x i,j can b e rewritten as P p i x i where p i = P j r i,j . W e write that P r ( i ) = X j P r ( i, j ) . These are the marginal probabilities for the ﬁrst v a riable that describ e the marg ina l decision mak er f or that v a riable. Naturally w e can also deﬁne a marginal for the second v ariable (considering con tracts x i,j = x j ) by letting q j = P i r i,j and P r ( j ) = P i P r ( i, j ). The marg inals deﬁne sets Z 1 ⊂ R m and Z 2 ⊂ R n of acceptable con tracts on the ﬁrst and second v ar ia bles separately . 2.4 Conditioning Again supp ose that w e are taking decisions on b ets for a pair of horse races, but this time supp ose that the ﬁrst ra ce is already ov er and w e know t he result. W e are still equipped to resp ond to b ets o n the second race by extending t he b et to a b et o n b oth where there is no rew ard for (pairs of ) ev en ts tha t are inconsisten t with what w e know . Deﬁnition 9 (Conditioning) Supp ose that we have a r ational de cision m aker ( Z ) for c ontr acts on the events ( i, j ) . We deﬁn e the c onditional de cision make r Z j = j 0 for i give n j = j 0 by r estricting the orig i n al de cision make r Z to c ontr acts x i,j which ar e such that x i,j = 0 if j 6 = j 0 . In o ther wor ds if we start with a c ontr act y = ( y i ) on i we extend it to a c ontr act o n ( i, j ) by letting y i,j 0 = y i and y i,j = 0 i f j 6 = j 0 . Then the original de cision maker c an m a ke a de cision for that c ontr a ct. Supp ose that x i,j = 0 if j 6 = j 0 . The unconditional exp ectation o f this contract is P i,j r i,j x i,j as usual whic h equals P i r i,j 0 x i,j 0 . This leads to the same decisions (i.e. the same Z ) as using P i r i,j 0 P k r k,j 0 x i,j 0 whic h is of the f orm in Theorem 5. W e write that P r ( i | j 0 ) = P r ( i, j 0 ) P k P r ( k , j 0 ) = P r ( i, j 0 ) P r ( j 0 ) . (2) 6 F rom this it follo ws that P r ( i 0 ) P r ( j 0 | i 0 ) = P r ( j 0 ) P r ( i 0 | j 0 ) (3) whic h is one w a y o f writing Bay es rule. 2.5 Learning In the previous section w e deﬁned conditioning whic h lead us to a deﬁnition of what it means to learn. Giv en that we hav e probabilities for ev en ts t hat are sequences of a certain num b er of sym b ols and w e ha v e observ ed one or sev eral of them, w e use conditioning to determine what our b elief regarding the remaining sym b ols should b e. Deﬁnition 10 (Learning) Given a r ational de cision maker, deﬁne d by p i 1 ,...,i T for the even ts ( i t ) T t =1 and the ﬁrst t − 1 symb ols i 1 , ..., i t − 1 , we deﬁn e the informe d r ational de cisi o n maker for i t by c onditionin g on the p ast i 1 , ..., i t − 1 and m a r ginaliz e over the futur e i t +1 , ..., i T . F ormal ly, P informe d i t ( i ) = P r ( i | i 1 , ..., i t ) = P j t +1 ,...,j T p i 1 ,...,i t ,j t +1 ,...,j T P j t ,...,j T p i 1 ,...,i t − 1 ,j t ,...,j T . 2.6 Cho osing b et w een Con tracts Deﬁnition 11 (Cho osing con tract) We say that to r ational ly pr efer c ontr act x over y is (e quivale n t) to r ational ly c onsider x − y to b e ac c eptable. As b efore we a ssume tha t w e hav e a decision mak er t hat tak es ratio na l decisions on accepting or rejecting contracts x that are ba sed on an ev en t tha t will b e observ ed. Hence there exist implicit probabilities that represen t all c hoices and a n exp ectation function. Supp o se that an agen t ha s to choose b et w een action a 1 that leads to receiving rew ard x i if i is dra wn and action a 2 that leads to receiving y i in the case of seeing i . Let z i = x i − y i . W e can now go bac k to c ho osing b etw een accepting and rejecting a contract by sa ying that c ho osing ( preferring ) a 1 o v er a 2 means accepting the contract z . In other w ords if E ( x ) > E ( y ) choose a 1 and if E ( x ) < E ( y ) choose a 2 . Remark 12 We note that if we p ostulate that cho osing b etwe en c ontr act x and the zer o c ontr act is the same as cho o s i n g b etwe en ac c epting or r eje cting x , then b eing able to cho ose b etwe en c ontr acts im p lies the ability to cho ose b etwe en ac c epting and r eje cting one c ontr act. We, ther efor e, c an say that the ability to cho ose b etwe en a p air of c ontr acts is e quivalent to the ability to cho ose to ac c ept or r eje ct a single c ontr act. 7 W e can also c ho ose b et w een sev eral con tracts. Supp ose t ha t action a k giv es us the con tract x k = ( x k i ) m i =1 . If E ( x j ) > E ( x k ) ∀ k 6 = j then w e strictly prefer a j o v er all other actions. In other w ords a contract x j − x k w ould for all k b e accepted and not rejected by a rational decision maker. Remark 13 If we have a r ational de cision ma ker for ac c epting o r r eje cting c on- tr acts, then ther e ar e im plicitly pr ob abilities p i for symb ol i that char ac terize the de cisi o n s. A r ational choic e b etwe e n actions a k le adin g to c ontr acts x k is taken by cho osi ng action a ∗ = arg max k X i p i x k i . (4) 2.7 Cho osing b et w een En vironmen ts In this section, w e assume that the ev en t that the con tracts are concerned with migh t b e aﬀected b y the c hoice o f action. Deﬁnition 14 (R eactiv e en vironmen t) An envir onment is a tr e e with symb ols j t (p er c epts) on the no des a nd actions a t on the e dges. We pr ovide the envi r onment with an action a t at e ac h time t and it pr esents the symb ol j t at the no de w e arrive at by fol low i n g the e dge chosen by the action. We c an also e quivalen tly say that a r e active envir on m ent ν is a function fr om strings a 1 j 1 , ..., a t j t to { 0 , 1 } which e quals 1 if and on ly if ν would pr o duc e j 1 , ..., j t given the actions a 1 , ..., a t . W e will deﬁne the concept of a decision maker fo r the case where one decision will b e take n in a situation where not only the contract, but also the out come can dep end o n the c hoice. W e do this b y deﬁning the c hoice as b eing b et w een tw o diﬀeren t env ironmen ts. Deﬁnition 15 (A ctiv e decision mak er) Consider a ch o ic e b etwe en having c on- tr act x fo r p a s sive envir onment env 1 or c ontr act y for p assive envir onment env 2 . A de cisi o n ma k er is a se t Z ⊂ R m 1 × R m 2 which deﬁ n es ex actly the p airs ( x, y ) for which we cho ose env 1 with x over env 2 with y . Deﬁnition 16 (R ational act ive c hoice) T o cho ose b etwe en action a 1 with c on- tr act x and a 2 with c ontr act y in a situation wher e the action may aﬀ e ct the event, we c onsider two sep ar ate envir on ments, namely the envi r onments that r esult fr om the two diﬀer ent actions. We would then have a situation wher e we wil l have one observation fr om e ach en v i r onmen t. Pr eferring a 1 with x to a 2 with y is (e quivalent) to c on s ider x − y to b e an ac c eptable c ontr act for the p air of eve nts. Remark 17 Deﬁnition 16 me ans that a 1 with x is pr eferr e d over a 2 with y if a 1 with x − y i s pr eferr e d over a 2 with the zer o c ontr act. 8 Prop osition 18 ( Probabilities for reactive sett ing) Supp ose that we have a r e active envir onment and a r ational active de c ision maker that wil l make one choi c e b etwe en action a 1 and a 2 as describ e d in Deﬁ nitions 15 and 16, then ther e exist p i ≥ 0 and q i ≥ 0 such that ac tion a 1 with c ontr act x is pr efe rr e d over action a 2 with c ontr act y if P p i x i > P q i y i and the r everse if P p i x i < P q i y i . This m e ans that the de cision maker acts ac c or ding to pr ob abilities P r ( ·| a 1 ) and P r ( ·| a 2 ) . Pro of. Let ˜ Z b e a ll contracts that when combined with action a 1 is preferred o v er a 2 with the zero contract. Theorem 1 g ua ran tees the existence of p i suc h that P p i x i > 0 implies that x ∈ ˜ Z a nd P p i x i < 0 implies that x / ∈ ˜ Z . The same w a y w e ﬁnd q i that describ e when we prefer a 2 with y to a 1 with the zero con tract. That these probabilities ( p i and q i ) explain the full decision mak er as stated in t he prop osition now f ollo ws directly fr o m Deﬁnition 16 understo o d as in Remark 17. Supp ose that w e are going to make a seq uence of T < ∞ decisions where at ev ery p oin t of time w e will hav e a ﬁnite n um b er of actions to c hose b et w een. W e will consider contracts, whic h can pa y out some rew ard at each time step and that can dep end on eve rything (actions chose n and sym b ols observ ed) that has happ ened up un til this time and we w an t to maximize the accum ulated rew ard at time T . W e can view the c hoice as just making one ch oice, namely c ho o sing an action tree. W e will sometimes call an action tree a p olicy . Deﬁnition 19 (A ction t ree) An action tr e e is a function fr om histories of sym- b ols j 1 , ..., j t and de cisions a 1 , ..., a t − 1 to new de cisions, given that the de cisions wer e made ac c or din g to the function. F o rm al ly, f ( a 1 , j 1 , ..., a t − 1 , j t − 1 ) = a t . An action tree will assign exactly one a ction for a n y of the circumstances that one can end up in. That is, given t he history up to an y time t < T of actions and ev en ts, w e hav e a chosen action. W e can, therefore, choo se an action tree at time 0 a nd receiv e a total accum ulated rew ard at time T . This brings us bac k to the situation of one ev en t and one rational choice. Deﬁnition 20 (Sequen tial decisions) Given a r a tional de cision m aker for the events ( j t ) T t =1 and the ﬁrst t − 1 symb ol s j 1 , ..., j t − 1 and de cisions a 1 , ..., a t − 1 , we deﬁne the inf o rm e d r ational de cis i o n m a k er at time t by c onditioning on the p ast a 1 , j 1 ..., a t − 1 , j t − 1 . Prop osition 21 ( Beliefs for sequen tial decisions) Supp ose that we have a r e- active envir o n ment and a r ational de cisi o n maker that wil l take T < ∞ de ci s ions. F urthermor e, s upp ose that the de ci s ions 0 ≤ t < T have b e en taken and r esulte d in history a 1 , j 1 ..., a t − 1 , j t − 1 . Then the de cision mak ers pr efer en c es at this time c an b e explaine d (thr ough e xp e cte d utility maxi m ization) by p r ob abilities P r ( j t , ..., j T | a 1 , j 1 ..., a t − 1 , j t − 1 , a t , a t +1 ..., a T ) . 9 Pro of. Deﬁnition 20 and Prop osition 18 immediately lead us to the conclusion that giv en a past up to a p oint t − 1 and a p olicy for the time t to T w e ha v e probabilistic b eliefs o v er the p ossible f uture sequence s from t ime t to T and the c hoice is catego r ized by maximizing exp ected accum ulated rew ard at time T . 3 Coun table Sets of Ev en ts Instead o f a ﬁnite set of p o ssible outcomes, we will in this section assume a countable set. W e supp o se that the set of con tracts is a ve ctor space of sequences x k , k = 0 , 1 , 2 , ... where w e use p o in t wise addition and m ultiplication with scalar. W e will deﬁne a space by ch o osing a norm and let the space consist of the sequenc es tha t ha v e ﬁnite norm as is common in Banac h space theory . If the nor m mak es the space complete it is called a Banac h sequence space [D ie84]. In teresting examples are ℓ ∞ of b o unded sequences with the maxim um norm k ( α k ) k ∞ = max | α k | , c 0 of sequence tha t conv erges to 0 equipp ed with t he same maxim um norm and ℓ p whic h for 1 ≤ p < ∞ is deﬁned b y the norm k ( α k ) k p = ( X | α k | p ) 1 /p . F or all of these spaces w e can consider we ighted v ersions ( w k > 0) where k ( α k ) k p,w k = k ( α k w k ) k p . This means that α ∈ ℓ p ( w ) iﬀ ( α k w k ) ∈ ℓ p , e.g. α ∈ ℓ ∞ ( w ) iﬀ sup k | α k w k | < ∞ . Giv en a Banach (sequence) space X w e use X ′ to denote the dual space that consists of all con tin uous linear functionals f : X → R . It is w ell known that a linear functional on a Banac h space is con tin uous if a nd only if it is b ounded, i.e. that there is C < ∞ suc h that | f ( x ) | k x k ≤ C ∀ x ∈ X . Equipping X ′ with the norm k f k = sup | f ( x ) | k x k mak es it in to a Banach space. Some examples are ( ℓ 1 ) ′ = ℓ ∞ , c ′ 0 = ℓ 1 and f or 1 < p < ∞ w e ha v e that ( ℓ p ) ′ = ℓ q where 1 / p + 1 /q = 1. These iden tiﬁcations are all based on form ulas of the form f ( x ) = X x i p i where the dual space is the space that ( p i ) must lie in to mak e the functional b oth w ell deﬁned and b ounded. It is clear that ℓ 1 ⊂ ( ℓ ∞ ) ′ but ( ℓ ∞ ) ′ also contains “stranger” ob jects. The existence of these o ther ob jects can b e deduced from the Hahn-Banac h theorem (see e.g. [Kre89] or [NB97]) that sa ys that if w e ha v e a linear function deﬁned on a subspace Y ∈ X and if it is b ounded on Y then t here is an extension to a b ounded linear functional on X . If Y is dense in X the extension is unique but in general it is not. One can use this Theorem b y ﬁrst lo oking at the subspace of all sequences in ℓ ∞ that con v erge and let f ( α ) = lim k →∞ α k . The Hahn-Banac h 10 theorem guar a n tees the existence of extensions to b ounded linear functionals that are deﬁned on all of ℓ ∞ . These are called Banac h limits. The space ( ℓ ∞ ) ′ can b e iden tiﬁed with the so called ba space of b ounded and ﬁnitely additiv e measures with the v ar iation nor m k ν k = | ν | ( A ) where A is the underlying set. Note that ℓ 1 can b e iden tiﬁed with the smaller space of coun tably additive b ounded measures with the same norm. The Hahn-Ba na c h Theorem has sev eral equiv alen t forms. One of these iden tiﬁes the h yp er-planes with the b ounded linear functionals [NB97]. Deﬁnition 22 (R ationalit y I I) Given a Banach se q uen c e sp ac e X of c ontr acts, we say that the de cision maker (subset Z of X deﬁning ac c eptable c ontr acts) is r ational if 1. Every c ontr act x ∈ X is either ac c eptable or r eje ctable or b oth; 2. x is a c c eptable if a nd only if − x is r eje ctable; 3. x, y ∈ Z , λ, γ ≥ 0 then λx + γ y ∈ Z ; 4. If x k ≥ 0 ∀ k then x = ( x k ) is ac c eptable w hile if x k > 0 ∀ k then x is not r eje ctable. Theorem 23 (Linear separation) Supp ose that we have a sp ac e of c o ntr acts X that is a Banach se quenc e sp ac e . Given a r a tion a l de cisio n make r ther e is a p ositive c ontinuous l i n e ar functional f : X → R such that { x | f ( x ) > 0 } ⊂ Z ⊆ { x | f ( x ) ≥ 0 } . (5) Pro of. The third prop ert y tells us that Z a nd − Z are con v ex cones. The second and fo urth prop ert y tells us that Z 6 = R m . Supp ose that there is a p oin t x that lies in b oth the interior of Z and of − Z . Then t he same is true for − x according to the second prop ert y and for the origin. That a ball around the origin lies in Z means that Z = R m whic h is not t r ue. Th us t he in teriors of Z and − Z a re disjoin t op en con v ex sets and can, therefore, b e separated b y a hy p erplane (according to the Hahn- Ba nac h theorem) whic h go es through the origin (since according to the second a nd fourt h prop ert y the or igin is b ot h acceptable and rejectable). The ﬁrst t w o prop erties tell us tha t Z ∪ − Z = R m . Given a separating hy p erplane (b etw een the interiors of Z a nd − Z ), Z m ust con tain ev erything o n one side. This means that Z is a half space whose b oundar y is a h yp erplane that go es through the orig in and t he closure ¯ Z o f Z is a closed half space and can b e written as { x | f ( x ) ≥ 0 } for some f ∈ X ′ . The fo urth prop ert y tells us that f is p ositiv e. Corollary 24 (Additivity) 1. If X = c 0 then a r ational de cis i o n maker is de- scrib e d by a c ountably additive (pr ob ability) me as ur e. 2. If X = ℓ ∞ then a r ational de cision maker is describ e d by a ﬁn itely a d ditive (pr ob ability) me asur e. 11 It seems from Corollary 24 tha t w e pay the price of losing coun table additivit y for expanding the space of con tracts fro m c 0 to ℓ ∞ but we can expand t he space ev en more b y lo oking a t c 0 ( w ) where w k → 0 whic h con t ains ℓ ∞ and X ′ is then ℓ 1 ((1 /w k )). This means that w e get countable additivit y back but w e instead hav e a restriction on how fast the probabilities p k m ust tend to 0 . Note that a b ounded linear functional on c 0 can alw ay s b e extended to a b ounded linear functional on ℓ ∞ b y the form ula f ( x ) = P p i x i but that is not the unique extension. Note a lso that eve ry b ounded linear f unctiona l on ℓ ∞ can b e restricted to c 0 and there b e represen ted as f ( x ) = P p i x i . Therefore, a r ational decision mak er on ℓ ∞ con tracts has probabilistic b eliefs (unless p i = 0 ∀ i ), though it migh t also take asymptotic b eha vior of a con tract into a ccoun t. F or example (and here p i = 0 ∀ i ), the decision mak er that mak es decisions based on asymptotic av erag es lim n →∞ 1 n P n i =1 x i when they exist. That strategy can b e extended to all of ℓ ∞ (a Bana ch limit). Th e follo wing prop osition will help us decide whic h decision mak er on ℓ ∞ is described with coun tably additiv e proba bilities. Prop osition 25 Supp ose that f ∈ ( ℓ ∞ ) ′ . F or any x ∈ ℓ ∞ , let x j i = x i if i ≤ j and x j i = 0 otherwise. If for any x , lim j →∞ f ( x j ) = f ( x ) , then f c an b e written as f ( x ) = P p i x i wher e p i ≥ 0 and P ∞ i =1 p i < ∞ . Pro of. The restriction of f to c 0 giv es us n um b ers p i ≥ 0 suc h tha t P ∞ i =1 p i < ∞ and f ( x ) = P p i x i for x ∈ c 0 . This means that f ( x j ) = P j i =1 p i x i for an y x ∈ ℓ ∞ and j < ∞ . Thus lim j →∞ f ( x j ) = P ∞ i =1 p i x i . Deﬁnition 26 (Monotone decisions) We deﬁne the c on c ept of a monotone deci- sion mak er in the fo llo wing wa y . Suppose that for ev ery x ∈ ℓ ∞ there is N < ∞ suc h that the decision is the same for all x j , j ≥ N (See Prop osition 25 f or deﬁnition) as for x . Then w e sa y that the decision ma ker is monotone. Example 27 L et f ∈ ℓ ∞ b e such that if lim α k → L then f ( α ) = L (i.e. f is a Banach limit). F urthermor e deﬁne a r ational de cision maker b y letting the set of ac c eptable c on tr acts b e Z = { x | f ( x ) ≥ 0 } . Then f ( x j ) = 0 (wher e we use notation fr o m Pr op osition 25) for al l j < ∞ and r e gar d less of which x we deﬁne x j fr om. Ther efor e, al l se quenc es that ar e eventual ly zer o ar e a c c ep tabl e c ontr acts. This me ans that this d e cision m a ker is not monotone sinc e ther e ar e c ontr acts that ar e not ac c eptable. Theorem 28 (Monotone rationality) Given a monotone r ational de cision maker for ℓ ∞ c ontr acts, ther e ar e p i ≥ 0 such that P p i < ∞ and { x | X x i p i > 0 } ⊂ Z ⊆ { x | X x i p i ≥ 0 } . (6) 12 Pro of. According to Theorem 23 t here is f ∈ ( ℓ ∞ ) ′ suc h that (the closure of Z ) ¯ Z = { x | f ( x ) ≥ 0 } . Let p i ≥ 0 b e suc h that P p i < ∞ and suc h that f ( x ) = P x i p i for x ∈ c 0 . Remem b er tha t x j (notation as in Prop osition 25) is alw a ys in c 0 . Supp ose that t here is x suc h that x is accepted but P x i p i < 0. This violate monotonicity since there exist N < ∞ suc h that P n i =1 x i p i < 0 for all n ≥ N and, therefore, x j is not accepted for j ≥ N but x is accepted. W e conclude that if x is accepted then P p i x i ≥ 0 and if P p i x i > 0 then x is accepted. 4 Rational Agents for Classes of Environmen ts W e will here study agen ts that are designed to deal with a larg e r a nge of situations. Giv en a class of en vironmen ts w e w an t to deﬁne agen ts tha t can learn to act w ell when placed in an y of them, assuming it is a t all p o ssible. Deﬁnition 29 (Universali ty for a class) We say that a de cision maker is uni- versal for a class o f env ir onmen ts M if for any outc ome se quenc e a 1 j 1 a 2 j 2 ... that given the actions would b e p r o duc e d by so m e envir onment in the c lass, ther e is c > 0 (dep en ding on the se quenc e) such that the de cision maker has pr o b abilities that sa t- isfy P r ( j 1 , ..., j t | a 1 , ..., a t ) ≥ c ∀ t. This is obviously true if the de cision maker’s pr ob abilistic b eliefs a r e a c onvex c om- bination P ν ∈M w ν ν , w ν > 0 and P ν w ν = 1 . W e will next discuss ho w to deﬁne some large classes o f env ironmen ts and agents that can succeed for them. W e assume that the total accum ulated rew ard from the en vironmen t will b e ﬁnite regar dless of our a ctions since we wan t any p o licy to ha v e ﬁnite utilit y . F urthermore, w e assume that rew ards are p ositiv e and that it is p ossible to achiev e strictly p o sitiv e rew ards in any en vironmen t. W e w ould like t he agen t to p erfo rm w ell regardless of whic h en vironmen t from the chose n class it is placed in. F or any p ossible p olicy (a ctio n tree) π and en vironmen t ν , t here is a total rew ard V π ν that follo wing π in ν would result in. This means that fo r any π there is a con tract sequence ( V π ν ) ν , assuming w e ha v e en umerated our set of en vironmen ts. L et V ∗ ν = max π V π ν . W e kno w that V ∗ ν > 0 for all ν . Ev ery con tract sequence ( V π ν ) ν lies in X = ℓ ∞ ((1 /V ∗ ν )) a nd k ( V π ν ) k X ≤ 1. The rationa l decision mak ers are the p ositiv e, con- tin uous linear functionals on X . X ′ con tains the space ℓ 1 ( V ∗ ν ). In other words if w ν ≥ 0 and P w ν V ∗ ν < ∞ then the sequence ( w ν ) deﬁnes a ra tional decision make r for t he contract space X . These are exactly t he mono t one rational decision make rs. Letting (whic h is the AIXI ag en t from [Hut05]) π ∗ ∈ arg max π X ν w ν V π ν (7) 13 w e hav e a c hoice with the prop erty tha t for any other π with X ν w ν V π ν < X ν w ν V π ∗ ν . Hence the con tract ( V π ∗ ν − V π ν ) is not rejectable. In other w ords π ∗ is strictly preferable to π . By letting p ν = w ν V ∗ ν , w e can rewrite (7) as π ∗ ∈ arg max π X ν p ν V π ν V ∗ ν . (8) If one furt her restricts the class of environme nts by assuming V ∗ ν ≤ 1 for all ν then for eve ry π , ( V π ν ) ∈ ℓ ∞ . Therefore, b y Theorem 28 the monotone rational agents for this setting can b e fo rm ulated as in (7) with ( w ν ) ∈ ℓ 1 , i.e. P ν w ν < ∞ . Ho w ev er, since ( p ν ) ∈ ℓ 1 , a form ulation of the form of (8) is also p ossible. Normalizing p a nd w individually to probabilities mak es (7) into a maxim um exp ected utility criterion and ( 8 ) into maxim um relativ e utilit y . As long as our w and p relate the w ay they do it is still the same decisions. If w e w ould ba se b oth exp ectations on the same probabilistic b eliefs it would b e diﬀeren t criteria. When w e ha v e an upp er b ound V ∗ ν < b < ∞ ∀ ν w e can alwa ys translate exp ected utilit y t o exp ected relative utility in this w a y , while w e need a lo w er b ound 0 < a < V ∗ ν to rewrite an exp ected relative utilit y as an exp ected utilit y . Note, the diﬀeren t criteria will start to deviate from eac h ot her af t er up dating the probabilistic b eliefs. 4.1 Asymptotic Optimalit y Denote a c hosen coun table class of en vironmen ts b y M . Let V π ν,k b e the rew ards ac hiev ed aft er time k using p olicy π in environmen t ν . W e suppress the dep endence on the history so far. Let W π ν,k = V π ν,k V ∗ ν,k denote the skill (relative rew ard) of π in en vironmen t ν fro m time k . The maxim um p ossible skill is 1. W e w ould like to ha v e a p olicy π suc h t hat lim k →∞ W π ν,k = 1 ∀ ν ∈ M . This would mean that the agen t asymptotically achiev e maxim um skill when placed in an y en vironmen t from M . Let I ( h k , ν ) = 1 if ν is consisten t with history h k and I ( h k , ν ) = 0 otherwise. F urthermore, let p ν,k = p ν, 0 P µ ∈M p µ, 0 I ( h k , µ ) b e the agent’s w eigh t for en vironmen t ν a t t ime k and let π p b e a p olicy that at time k acts according to a p olicy in arg max π X ν p ν,k V π ν,k V ∗ ν,k . (9) 14 In the following theorem, w e prov e that for ev ery env ironmen t ν ∈ M , the p olicy π p will asymptotically achie v e p erfect relativ e rewards . W e hav e to assume that there exists a sequence of p olicies π k > 0 with this prop erty (as f or the similar Theorem 5.34 in [Hut05] which dealt with discoun ted v alues). The con v ergence in W -v alues is the relev ant sense of optimality for our setting, since the V -v alues conv erge to zero for any p o licy . Theorem 30 (A symptotic optimality) Supp ose that we hav e a de cision maker that is universal (i. e . p ν > 0 ∀ ν ) with r esp e ct to the c ountable class M of envir on- ments (which c an b e sto chastic) and that ther e exists p olicies π k such that for al l ν , W π k ,ν k → 1 if ν is the actual envir onm ent (o r the se quenc e is c onsis tent with ν ). This implies that W π p ,µ k → 1 wher e µ is the actual envir onment. The pro o f tec hnique is similar to that of Theorem 5.34 in [Hut0 5]. Pro of. Let 0 ≤ 1 − W π k ,ν k =: ∆ k ν , ∆ k = X ν p ν,k ∆ k ν . (10) The a ssumptions tells us that ∆ k ν = W π k ,ν k − 1 → 0 for all ν that are consisten t with the sequence ( p ν,k = 0 if ν is inconsisten t with the history at time k ) and since ∆ k ν ≤ 1 , it f o llo ws that ∆ k = X ν p ν,k ∆ k ν → 0 . Note that p µ,k (1 − W π p ,µ k ) ≤ P ν p ν,k (1 − W ν π p ,k ) ≤ P ν p ν,k (1 − W k π k ,ν ) = P p ν,k ∆ k ν = ∆ k . Since w e a lso kno w that p µ,k ≥ p µ, 0 > 0 it follo ws t ha t (1 − W π p ,µ k ) → 0. 5 Conclus ions W e studied complete rational decision makers including the cases of actions that may aﬀect the env ironmen t and sequen tial decision making. W e set up simple common sense rationalit y axioms tha t imply that a complete rational decision mak er has preferences that can b e c haracterized as maximizing exp ected utility . Of part icular in terest is the countable case where our results follo w from identifying the Banac h space dual o f the space of contracts . Ac kno wledgemen t. This w ork w as supp orted by AR C grant D P0988049. 15 References [All53] M Allais. Le comp ortemen t de l’homme rationnel dev an t le r isque: Critique des p ostulats et axiomes de l’ecole americaine. Ec onometric a , 21(4):5 03–546, 1953. [Arr70] K Arr o w. Essays in the The ory of Risk-Be aring . North-Holland, 1970. [Co x46] R. T. Cox. P r obabilit y , frequency and r easonable exp ectation. Am. Jour. Phys , 14:1–1 3, 1946. [deF37] B. deFinetti. La p rvision: Ses lois logiques, ses sour ces sub j ective s. In Anna les de l’Institut Henri Poinc ar 7 , p ages 1–68. P aris, 1937. [Die84] Joseph Diestel. Se que nc es and series in Banach sp ac es . Spr inger-V erlag, 1984. [Ell61] Daniel Ellsb erg. Risk, Ambiguit y , and th e Sa v age Axioms. The Quarterly Journal of Ec onomics , 75(4):64 3–669, 1961. [Hal99] Joseph Y. Halp ern . A coun terexample to theorems of Co x and Fine. Journal of AI r ese ar ch , 10:67–8 5, 1999. [Hut05] Marcus Hutter. Unive rsal Artiﬁcial Intel ligenc e: Se quential De cisions b ase d on Algor ithmic Pr ob ability . Spr inger, Berlin, 2005. [Ja y03] E. T . Jaynes. P r ob ability the ory: the lo gic of scienc e . Cam bridge Univ ersit y Press, 2003. [Kre89] Er w in Kreyszig. Intr o ductory F unctional Analysis With Applic ations . Wiley , 1989. [NB97] La w rence Naricia and E dw ard Bec ke nstein. The Hahn-Banac h th eorem: th e life and times. T op olo gy and its Applic ations , 77(2):1 93–211, 1997. [NM44] J. Neumann and O. Morgenstern. The ory of Games and Ec onomic Behavior . Princeton Universit y Press, 1944. [P ar94] J. B. Pa ris. The u nc ertain r e asoner’s c omp anion: a mathematic al p ersp e ctive . Cam bridge Universit y Pr ess, New Y ork, NY, USA, 1994. [Ram31] F rank P . Ramsey . T ru th an d p robabilit y . In R. B. Braithw aite, editor, The F oundations of M athematics and other L o gic al Essays , c hapter 7, pages 156–198. Brace & C o., 1931. [Sa v54] L. S a v age. The F oundations of Statistics . Wiley , New Y ork, 1954. [Sug91] Rob ert Sugden . Rational c hoice: A surve y of cont ribu tions from economics and philosophy . Ec onomic Journal , 101(407 ):751–85 , July 1991. 16

Axioms for Rational Reinforcement Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment