Asymptotically Optimal Agents

Asymptoticall y Optimal Agen ts T or Lattimore 1 and Marcus Hutter 1 , 2 Researc h Sc ho ol of Computer Science 1 Australian National Univ ersit y a nd 2 ETH Z ¨ uric h { tor.lattimo re,marcus.hutter } @anu.edu.au 25 July 2011 Abstract Artiﬁcial general int elligence aims to create agent s capable of learning to solv e arbitrary in teresting prob lems. W e deﬁn e t w o v ersions of asymptotic optimalit y and pro ve th at no agen t can satisfy the strong v ersion w h ile in some cases, dep ending on discoun ting, there do es exist a non-computable w eak asymptotic ally optimal age nt. Con ten ts 1 In tro duction 2 2 Notation and Deﬁnit ions 3 3 Non-Existence of Asymptotically Optimal P olicies 6 4 Existence of W eak Asymptotically Optimal P olicies 10 5 Discussion 15 A T echnic al Pro ofs 19 B T able of Notation 21 Keyw ords Rational agen ts; sequential decision theory; artiﬁcial general int elligence; re- inforcemen t learning; asymptotic optimalit y; general discounting. 1 1 In tro du c tion The dream of artiﬁcial general in telligence is to create an a g en t that, starting with no kn owle dge of its en vironmen t, ev en tually learns to b eha v e optimally . This means it should b e able to learn c hess just b y pla ying, or Go, or how to driv e a car or mow the la wn, or an y task w e could conceiv ably b e interes ted in assigning it. Before considering the ex istence of univ ersally intelligen t agen ts, we mus t be precise ab o ut what is mean t b y o pt ima lity . If the en vironmen t and goal a re kno wn, then sub ject to computation issues, the optimal p olicy is easy to construct using an exp ectimax searc h from sequen tial decision theory [NR 0 3]. How ev er, if the true en vironment is unkno wn then the agen t will necessarily sp end some time exploring, and so cannot immediately play according to the optimal p olicy . Giv en a class of en vironments, we suggest t w o deﬁnitions of asymptotic optimality fo r a n agen t. 1. An agen t is strongly asymptotically optimal if for ev ery env ironmen t in the class it pla ys o ptimally in the limit. 2. It is w eakly asymptotic optimal if for eve ry en vironmen t in the class it plays optimally on ave r age in the limit. The ke y diﬀerence is that a strong asymptotically optimal agen t mus t ev en tually stop exploring, while a w eak asymptotically optima l agen t may explore forev er, but with decreasing frequency . In this paper w e consider the (non-)existence of w eak/strong asymptotically opti- mal agents in the class of a ll deterministic computable en vironmen ts. The restriction to deterministic is f o r the sak e of simplicit y and b ecause the results for this case are a lr eady suﬃcien tly non-trivial to b e in teresting. The restriction to computable is more philosophical. The Ch urch-T uring thesis is the unprov able hypothesis that an ything that can intuitiv ely b e computed can also b e computed b y a T uring ma- c hine. Applying this to ph ysics leads to the strong Ch urch-T uring thesis that the univ erse is computable (p ossibly sto chastically computable, i.e. computable when giv en a cces s to an oracle of random noise). Having made these assumptions, the largest interes ting class then b ecomes the class of computable (p ossibly sto chastic) en vironments. In [Hut04], Hutter conjectured that his univ ersal Ba y esian agen t , AIXI, w as w eakly asymptotically optimal in the class o f all computable sto c hastic environ- men ts. Unfortunately this was recen t ly sho wn to b e false in [Ors10], where it is pro v en that no Bay esian agen t ( with a static prior) can b e w eakly asymptoti- cally optimal in this class. 1 The key idea b ehind Orseau’s pro of w as to sho w that AIXI ev en tually stops exploring. This is somewhat surprising b ecause it is nor- mally assumed that Bay esian agents solv e the exploration/ exploitation dilemma in a principled wa y . This result is a bit reminiscen t of Bay esian (passiv e induction) 1 Or even the class o f computable deterministic environments. 2 inconsistency results [DF 86a, DF86b], a lthough the details of the failure a re v ery diﬀeren t. W e extend the w ork of [Ors10], where only Bay esian agents are considered, to sho w that non-computable w eak asymptotically optimal agen ts do exist in the class of deterministic computable env ironmen ts for some discoun t functions (including geometric), but not for others. W e also show that no asymptotically optimal a g en t can b e computable, and tha t f o r all “reasonable” discoun t functions there do es not exist a strong asymptotically optimal ag ent. The w eak asymptotically optimal ag en t w e construct is similar to AIXI, but with an exploration comp onen t similar to ǫ -learning for ﬁnite state Mark ov decision pro cesses or the UCB algorithm for bandits. The k ey is to explore suﬃcien tly often and deeply to ensure that the environme nt used for the mo del is an adequate appro ximation of the true en vironment. At the same time, the agen t mus t explore infrequen tly enough that it actually exploits its kno wledge. Whether or not it is p ossible to g et this balance righ t dep ends, somewhat surprisingly , on how forward lo oking the agent is (determined by the discoun t function). That it is sometimes no t p ossible to explore enough t o learn the true en vironmen t without damaging eve n a w eak form of asymptotic optimalit y is surprising and unexp ected. Note that the exploration/exploitation pro blem is w ell-understo o d in the Ban- dit case [A CBF02, BF85 ] and for ( ﬁnite- stat e stationary) Mark o v decision pro cesses [SL08]. In these restrictiv e settings, v arious satisfactory optimality criteria are a v ail- able. In t his w ork, w e do not mak e any assumptions lik e Marko v, stationar y , er- go dicit y , or else b esides computabilit y of t he en vironmen t. So far, no satisfactory optimalit y deﬁnition is av ailable fo r this general case. 2 Notation and Deﬁnitions W e use similar notation to [Hut0 4, Ors10] where the ag en t takes actions and the en vironment returns an observ ation/rew ard pair. Strings. A ﬁnite string a ov er a lpha b et A is a ﬁnite sequence a 1 a 2 a 3 · · · a n − 1 a n with a i ∈ A . An inﬁnite string ω o v er alphab et A is an inﬁnite sequenc e ω 1 ω 2 ω 3 · · · . A n , A ∗ and A ∞ are the sets of strings of length n , strings of ﬁnite length, and inﬁnite strings resp ectiv ely . Let x b e a string (ﬁnite or inﬁnite) then substrings are denoted x s : t := x s x s +1 · · · x t − 1 x t where s, t ∈ N and s ≤ t . Strings may b e concatenated. Let x, y ∈ A ∗ of length n and m resp ectiv ely , and ω ∈ A ∞ . Then deﬁne xy := x 1 x 2 · · · x n − 1 x n y 1 y 2 · · · y m − 1 y m and xω := x 1 x 2 · · · x n − 1 x n ω 1 ω 2 ω 3 · · · . Some useful shorthands, x p ) . An inﬁnite sequence of rew ards starting at time t , r t , r t +1 , r t +2 , · · · is giv en a v alue of 1 Γ t P ∞ i = t γ i r i . The term 1 Γ t is a normalisation term to ensure that v alues scale in suc h a w a y that they can still b e compared in the limit. A discoun t function is computable if there exists a T uring machine computing it. All w ell kno wn discoun t functions, suc h as geometric, ﬁxed ho rizon and h yp erb olic are computable. Note that H t ( p ) exists for all p ∈ [0 , 1 ) and represen ts the eﬀectiv e horizon of the ag en t. After H t ( p ) time-steps in to the future, starting at time t , the agen t stands to g ain/lose at most 1 − p . Deﬁnition 6 (V a lues and O pt ima l P olicy) . The v a lue of p olicy π when starting f rom history y x µ,π t ) 0 otherwise Since π is computable µ is as w ell. Therefore µ ∈ M . Now V ∗ µ ( y r c p t with c p > 0 for inﬁnitely man y t , but the pro of will lik ely b e messy . 4 Existence of W eak Asymptotically Optimal P olic ies In the previous section w e show ed there did not exist a strong asymptotically optimal p olicy (for most discoun t functions) and that an y w eak asymptotically optimal p olicy 10 m ust b e incomputable. In this section w e sho w that a w eak a symptotically optima l p olicy exists for geometric discounting (and is, of course, incomputable). The policy is reminiscan t of ǫ -exploration in ﬁnite state MDPs (or UCB for bandits) in that it sp ends most of it s time exploiting the information it a lready kno ws, while still exploring suﬃcien tly often (and for suﬃcien tly long) to detect any signiﬁcan t errors in its mo del. The idea will b e to use a mo del-ba sed p olicy that c ho oses its curren t mo del t o b e the ﬁr st en vironmen t in t he mo del class (all computable deterministic en vironmen ts) consisten t with the history seen so far. With increasing probabilit y it ta kes the b est action according to this p olicy , while still o ccasionally exploring randomly . When it explores it alwa ys do es so in bursts of increasing length. Deﬁnition 9 (History Consisten t) . A deterministic en vironmen t µ is c onsistent with history y x 0 and α i := [ [ a i > ǫ/ 2] ] then α i = χ i = 1 for inﬁnitely many i . Pr o of. 1. Let i ∈ N , ǫ > 0 and E ǫ i b e the ev ent that #1( ˙ χ h 1:2 i ) > 2 i ǫ . Using the deﬁnition of ˙ χ h to compute the exp ectation E  #1( ˙ χ h 1:2 i )  < i ( i + 1) h and applying the Marko v inequalit y giv es that P ( E ǫ i ) < i ( i + 1) h 2 − i /ǫ . Therefore P i ∈ N P ( E ǫ i ) < ∞ . Therefore the Borel- Cantelli lemma giv es that E ǫ i o ccurs for only ﬁnitely many i with probability 1. W e no w a ssume that lim sup n →∞ 1 n #1( ˙ χ h 1: n ) > 2 ǫ > 0 and show that E ǫ i m ust o ccur inﬁnitely o ften. By the deﬁnition of lim sup a nd our assumption w e hav e that there exists a sequence n 1 , n 2 , · · · such that #1( ˙ χ h 1: n i ) > 2 n i ǫ f o r all i ∈ N . Let n + := min k ∈ N  2 k : 2 k ≥ n  and note that #1( ˙ χ h 1: n + i ) > n + i ǫ , which is exactly E ǫ log n + i . Therefore there exist inﬁnitely many i suc h that E ǫ i o ccurs and so lim sup n →∞ 1 n #1( ˙ χ h 1: n ) = 0 with pr o babilit y 1. 2. The proba bilit y that α i = 1 = ⇒ χ i = 0 for all i ≥ T is P ( α i = 1 = ⇒ χ i = 0 ∀ i ≥ T ) = Q ∞ i = T  1 − α i i  =: p = 0, b y Lemma 13. Therefore the probability that α i = χ i = 1 for only ﬁnitely man y i is zero. Therefore t here exists inﬁnitely many i with α i = χ i = 1 with probability 1, as required. Lemma 15 (Appro ximation Lemma) . L e t π 1 and π 2 b e p olicies, µ an envir on- ment and h ≥ H t (1 − ǫ ) . L e t y x ǫ then µ is H t (1 − ǫ ) - diﬀer ent to ν on y x π ,µ . Pr o of. F ollo ws from the approx imation lemma. W e are now ready to pro v e the main theorem. 13 Pr o of of The or em 11. Let π b e the p olicy deﬁned in Deﬁnition 10 and µ b e the true ( unkno wn) en vironmen t. Recall t ha t ν t = µ i t with i t = min { i : µ i consisten t with history y x π ,µ 0 . (16) Let α ∈ B ∞ b e deﬁned b y α t := 1 if and only if,  V ∗ µ ( y x π ,µ 0 a nd P ( ∀ k ∃ i ∈ [ t k , t k + h ] with ψ i 6 = π ∗ µ ( y x π ,µ 0, h := H t (1 − ǫ ) and t ≥ T . If ˙ χ h t = 0 then b y t he deﬁnition of π a nd the appro ximation lemma w e obtain   V π ∗ ν µ ( y x π ,µ n ǫ 2 19 Pr o of. Let A > :=  a ∈ A : a ≥ ǫ 2  and A < := A − A > . Therefore nǫ ≤ X a ∈ A a = X a ∈ A < a + X a ∈ A > a ≤ X a ∈ A < ǫ 2 + X a ∈ A > 1 = | A < | ǫ 2 + | A > | By rearranging and a lg ebra,    a ∈ A : a ≥ ǫ 2    ≡ | A > | > n ǫ 2 as r equired. Pr o of of L e m ma 13. First, ∞ Y i =1 h 1 − α i i i ≤ exp " − ∞ X i =1 α i i # (25) Equation (25) follows since 1 − a ≤ exp( − a ) for all a . No w since lim sup n →∞ 1 n P n i =1 a n = ǫ , we hav e for an y N there exists an n > N suc h tha t 1 n P n i =1 a n > ǫ 2 . Let n 1 = 0 then inductiv ely c ho ose n i = min n n : n > 8( n i − 1 +1) ǫ ∧ 1 n P n i =1 a i > ǫ 2 o By Lemma 17 ,    n i ≤ n j : a i ≥ ǫ 4 o    ≥ n j ǫ 4 (26) Therefore n j +1 X i = n j +1 α i i ≥ n j +1 X i = n j +1 − ǫ 4 n j +1 + n j +1 1 n j +1 (27) ≥ n j +1 X i =(1 − ǫ 8 ) n j +1 1 n j +1 = ǫ 8 (28) Equation (27) follows from (26) and b ecause 1 i is a decreasing function. (28) follows from the deﬁnition o f n j and algebra. Therefore ∞ X i =1 α i i = lim k →∞ k X j =1 n j +1 X i = n j +1 α i i ≥ lim k →∞ k X j =1 ǫ 8 = ∞ (29) Finally , substituting Equation (29) into (2 5 ) gives , ∞ Y i =1 h 1 − α i i i = 0 as required. 20 B T able of Notation Sym b ol Description Y Set of p ossible actions O Set of p ossible observ ations R Set of p ossible rew ards µ, ν En vironments y An a ction. x An observ ation/rew ard pair r A r eward o An observ ation [ [ expr ] ] The delta function. [ [ expression ] ] = 1 if expression is true and 0 otherwise. ¬ b The not function. ¬ 0 = 1 and ¬ 1 = 0. π A p olicy . χ An inﬁnite binary string. χ k = 1 if an agen t star t s exploring at time- step k . ¯ χ An inﬁnite binary string. ¯ χ k = 1 if an agen t is exploring at time-step k . ˙ χ h An inﬁnite binary string. ˙ χ h k = 0 if an a gen t will not explore for the next h time-steps. α An inﬁnite binary string. ψ An inﬁnite random binary string sampled from the coin ﬂip measure. t, n, i, j, k Time indices. y x

Original Paper

Loading high-quality paper...

Related Papers

Loading...

Comments & Academic Discussion

Loading comments...

Leave a Comment

Twitter Facebook