Asymptotically Optimal Agents

Artificial general intelligence aims to create agents capable of learning to solve arbitrary interesting problems. We define two versions of asymptotic optimality and prove that no agent can satisfy the strong version while in some cases, depending o…

Authors: Tor Lattimore, Marcus Hutter

Asymptoticall y Optimal Agen ts T or Lattimore 1 and Marcus Hutter 1 , 2 Researc h Sc ho ol of Computer Science 1 Australian National Univ ersit y a nd 2 ETH Z ¨ uric h { tor.lattimo re,marcus.hutter } @anu.edu.au 25 July 2011 Abstract Artificial general int elligence aims to create agent s capable of learning to solv e arbitrary in teresting prob lems. W e defin e t w o v ersions of asymptotic optimalit y and pro ve th at no agen t can satisfy the strong v ersion w h ile in some cases, dep ending on discoun ting, there do es exist a non-computable w eak asymptotic ally optimal age nt. Con ten ts 1 In tro duction 2 2 Notation and Definit ions 3 3 Non-Existence of Asymptotically Optimal P olicies 6 4 Existence of W eak Asymptotically Optimal P olicies 10 5 Discussion 15 A T echnic al Pro ofs 19 B T able of Notation 21 Keyw ords Rational agen ts; sequential decision theory; artificial general int elligence; re- inforcemen t learning; asymptotic optimalit y; general discounting. 1 1 In tro du c tion The dream of artificial general in telligence is to create an a g en t that, starting with no kn owle dge of its en vironmen t, ev en tually learns to b eha v e optimally . This means it should b e able to learn c hess just b y pla ying, or Go, or how to driv e a car or mow the la wn, or an y task w e could conceiv ably b e interes ted in assigning it. Before considering the ex istence of univ ersally intelligen t agen ts, we mus t be precise ab o ut what is mean t b y o pt ima lity . If the en vironmen t and goal a re kno wn, then sub ject to computation issues, the optimal p olicy is easy to construct using an exp ectimax searc h from sequen tial decision theory [NR 0 3]. How ev er, if the true en vironment is unkno wn then the agen t will necessarily sp end some time exploring, and so cannot immediately play according to the optimal p olicy . Giv en a class of en vironments, we suggest t w o definitions of asymptotic optimality fo r a n agen t. 1. An agen t is strongly asymptotically optimal if for ev ery env ironmen t in the class it pla ys o ptimally in the limit. 2. It is w eakly asymptotic optimal if for eve ry en vironmen t in the class it plays optimally on ave r age in the limit. The ke y difference is that a strong asymptotically optimal agen t mus t ev en tually stop exploring, while a w eak asymptotically optima l agen t may explore forev er, but with decreasing frequency . In this paper w e consider the (non-)existence of w eak/strong asymptotically opti- mal agents in the class of a ll deterministic computable en vironmen ts. The restriction to deterministic is f o r the sak e of simplicit y and b ecause the results for this case are a lr eady sufficien tly non-trivial to b e in teresting. The restriction to computable is more philosophical. The Ch urch-T uring thesis is the unprov able hypothesis that an ything that can intuitiv ely b e computed can also b e computed b y a T uring ma- c hine. Applying this to ph ysics leads to the strong Ch urch-T uring thesis that the univ erse is computable (p ossibly sto chastically computable, i.e. computable when giv en a cces s to an oracle of random noise). Having made these assumptions, the largest interes ting class then b ecomes the class of computable (p ossibly sto chastic) en vironments. In [Hut04], Hutter conjectured that his univ ersal Ba y esian agen t , AIXI, w as w eakly asymptotically optimal in the class o f all computable sto c hastic environ- men ts. Unfortunately this was recen t ly sho wn to b e false in [Ors10], where it is pro v en that no Bay esian agen t ( with a static prior) can b e w eakly asymptoti- cally optimal in this class. 1 The key idea b ehind Orseau’s pro of w as to sho w that AIXI ev en tually stops exploring. This is somewhat surprising b ecause it is nor- mally assumed that Bay esian agents solv e the exploration/ exploitation dilemma in a principled wa y . This result is a bit reminiscen t of Bay esian (passiv e induction) 1 Or even the class o f computable deterministic environments. 2 inconsistency results [DF 86a, DF86b], a lthough the details of the failure a re v ery differen t. W e extend the w ork of [Ors10], where only Bay esian agents are considered, to sho w that non-computable w eak asymptotically optimal agen ts do exist in the class of deterministic computable env ironmen ts for some discoun t functions (including geometric), but not for others. W e also show that no asymptotically optimal a g en t can b e computable, and tha t f o r all “reasonable” discoun t functions there do es not exist a strong asymptotically optimal ag ent. The w eak asymptotically optimal ag en t w e construct is similar to AIXI, but with an exploration comp onen t similar to ǫ -learning for finite state Mark ov decision pro cesses or the UCB algorithm for bandits. The k ey is to explore sufficien tly often and deeply to ensure that the environme nt used for the mo del is an adequate appro ximation of the true en vironment. At the same time, the agen t mus t explore infrequen tly enough that it actually exploits its kno wledge. Whether or not it is p ossible to g et this balance righ t dep ends, somewhat surprisingly , on how forward lo oking the agent is (determined by the discoun t function). That it is sometimes no t p ossible to explore enough t o learn the true en vironmen t without damaging eve n a w eak form of asymptotic optimalit y is surprising and unexp ected. Note that the exploration/exploitation pro blem is w ell-understo o d in the Ban- dit case [A CBF02, BF85 ] and for ( finite- stat e stationary) Mark o v decision pro cesses [SL08]. In these restrictiv e settings, v arious satisfactory optimality criteria are a v ail- able. In t his w ork, w e do not mak e any assumptions lik e Marko v, stationar y , er- go dicit y , or else b esides computabilit y of t he en vironmen t. So far, no satisfactory optimalit y definition is av ailable fo r this general case. 2 Notation and Definitions W e use similar notation to [Hut0 4, Ors10] where the ag en t takes actions and the en vironment returns an observ ation/rew ard pair. Strings. A finite string a ov er a lpha b et A is a finite sequence a 1 a 2 a 3 · · · a n − 1 a n with a i ∈ A . An infinite string ω o v er alphab et A is an infinite sequenc e ω 1 ω 2 ω 3 · · · . A n , A ∗ and A ∞ are the sets of strings of length n , strings of finite length, and infinite strings resp ectiv ely . Let x b e a string (finite or infinite) then substrings are denoted x s : t := x s x s +1 · · · x t − 1 x t where s, t ∈ N and s ≤ t . Strings may b e concatenated. Let x, y ∈ A ∗ of length n and m resp ectiv ely , and ω ∈ A ∞ . Then define xy := x 1 x 2 · · · x n − 1 x n y 1 y 2 · · · y m − 1 y m and xω := x 1 x 2 · · · x n − 1 x n ω 1 ω 2 ω 3 · · · . Some useful shorthands, x p ) . An infinite sequence of rew ards starting at time t , r t , r t +1 , r t +2 , · · · is giv en a v alue of 1 Γ t P ∞ i = t γ i r i . The term 1 Γ t is a normalisation term to ensure that v alues scale in suc h a w a y that they can still b e compared in the limit. A discoun t function is computable if there exists a T uring machine computing it. All w ell kno wn discoun t functions, suc h as geometric, fixed ho rizon and h yp erb olic are computable. Note that H t ( p ) exists for all p ∈ [0 , 1 ) and represen ts the effectiv e horizon of the ag en t. After H t ( p ) time-steps in to the future, starting at time t , the agen t stands to g ain/lose at most 1 − p . Definition 6 (V a lues and O pt ima l P olicy) . The v a lue of p olicy π when starting f rom history y x µ,π t ) 0 otherwise Since π is computable µ is as w ell. Therefore µ ∈ M . Now V ∗ µ ( y r c p t with c p > 0 for infinitely man y t , but the pro of will lik ely b e messy . 4 Existence of W eak Asymptotically Optimal P olic ies In the previous section w e show ed there did not exist a strong asymptotically optimal p olicy (for most discoun t functions) and that an y w eak asymptotically optimal p olicy 10 m ust b e incomputable. In this section w e sho w that a w eak a symptotically optima l p olicy exists for geometric discounting (and is, of course, incomputable). The policy is reminiscan t of ǫ -exploration in finite state MDPs (or UCB for bandits) in that it sp ends most of it s time exploiting the information it a lready kno ws, while still exploring sufficien tly often (and for sufficien tly long) to detect any significan t errors in its mo del. The idea will b e to use a mo del-ba sed p olicy that c ho oses its curren t mo del t o b e the fir st en vironmen t in t he mo del class (all computable deterministic en vironmen ts) consisten t with the history seen so far. With increasing probabilit y it ta kes the b est action according to this p olicy , while still o ccasionally exploring randomly . When it explores it alwa ys do es so in bursts of increasing length. Definition 9 (History Consisten t) . A deterministic en vironmen t µ is c onsistent with history y x 0 and α i := [ [ a i > ǫ/ 2] ] then α i = χ i = 1 for infinitely many i . Pr o of. 1. Let i ∈ N , ǫ > 0 and E ǫ i b e the ev ent that #1( ˙ χ h 1:2 i ) > 2 i ǫ . Using the definition of ˙ χ h to compute the exp ectation E  #1( ˙ χ h 1:2 i )  < i ( i + 1) h and applying the Marko v inequalit y giv es that P ( E ǫ i ) < i ( i + 1) h 2 − i /ǫ . Therefore P i ∈ N P ( E ǫ i ) < ∞ . Therefore the Borel- Cantelli lemma giv es that E ǫ i o ccurs for only finitely many i with probability 1. W e no w a ssume that lim sup n →∞ 1 n #1( ˙ χ h 1: n ) > 2 ǫ > 0 and show that E ǫ i m ust o ccur infinitely o ften. By the definition of lim sup a nd our assumption w e hav e that there exists a sequence n 1 , n 2 , · · · such that #1( ˙ χ h 1: n i ) > 2 n i ǫ f o r all i ∈ N . Let n + := min k ∈ N  2 k : 2 k ≥ n  and note that #1( ˙ χ h 1: n + i ) > n + i ǫ , which is exactly E ǫ log n + i . Therefore there exist infinitely many i suc h that E ǫ i o ccurs and so lim sup n →∞ 1 n #1( ˙ χ h 1: n ) = 0 with pr o babilit y 1. 2. The proba bilit y that α i = 1 = ⇒ χ i = 0 for all i ≥ T is P ( α i = 1 = ⇒ χ i = 0 ∀ i ≥ T ) = Q ∞ i = T  1 − α i i  =: p = 0, b y Lemma 13. Therefore the probability that α i = χ i = 1 for only finitely man y i is zero. Therefore t here exists infinitely many i with α i = χ i = 1 with probability 1, as required. Lemma 15 (Appro ximation Lemma) . L e t π 1 and π 2 b e p olicies, µ an envir on- ment and h ≥ H t (1 − ǫ ) . L e t y x ǫ then µ is H t (1 − ǫ ) - differ ent to ν on y x π ,µ . Pr o of. F ollo ws from the approx imation lemma. W e are now ready to pro v e the main theorem. 13 Pr o of of The or em 11. Let π b e the p olicy defined in Definition 10 and µ b e the true ( unkno wn) en vironmen t. Recall t ha t ν t = µ i t with i t = min { i : µ i consisten t with history y x π ,µ 0 . (16) Let α ∈ B ∞ b e defined b y α t := 1 if and only if,  V ∗ µ ( y x π ,µ 0 a nd P ( ∀ k ∃ i ∈ [ t k , t k + h ] with ψ i 6 = π ∗ µ ( y x π ,µ 0, h := H t (1 − ǫ ) and t ≥ T . If ˙ χ h t = 0 then b y t he definition of π a nd the appro ximation lemma w e obtain   V π ∗ ν µ ( y x π ,µ n ǫ 2 19 Pr o of. Let A > :=  a ∈ A : a ≥ ǫ 2  and A < := A − A > . Therefore nǫ ≤ X a ∈ A a = X a ∈ A < a + X a ∈ A > a ≤ X a ∈ A < ǫ 2 + X a ∈ A > 1 = | A < | ǫ 2 + | A > | By rearranging and a lg ebra,    a ∈ A : a ≥ ǫ 2    ≡ | A > | > n ǫ 2 as r equired. Pr o of of L e m ma 13. First, ∞ Y i =1 h 1 − α i i i ≤ exp " − ∞ X i =1 α i i # (25) Equation (25) follows since 1 − a ≤ exp( − a ) for all a . No w since lim sup n →∞ 1 n P n i =1 a n = ǫ , we hav e for an y N there exists an n > N suc h tha t 1 n P n i =1 a n > ǫ 2 . Let n 1 = 0 then inductiv ely c ho ose n i = min n n : n > 8( n i − 1 +1) ǫ ∧ 1 n P n i =1 a i > ǫ 2 o By Lemma 17 ,    n i ≤ n j : a i ≥ ǫ 4 o    ≥ n j ǫ 4 (26) Therefore n j +1 X i = n j +1 α i i ≥ n j +1 X i = n j +1 − ǫ 4 n j +1 + n j +1 1 n j +1 (27) ≥ n j +1 X i =(1 − ǫ 8 ) n j +1 1 n j +1 = ǫ 8 (28) Equation (27) follows from (26) and b ecause 1 i is a decreasing function. (28) follows from the definition o f n j and algebra. Therefore ∞ X i =1 α i i = lim k →∞ k X j =1 n j +1 X i = n j +1 α i i ≥ lim k →∞ k X j =1 ǫ 8 = ∞ (29) Finally , substituting Equation (29) into (2 5 ) gives , ∞ Y i =1 h 1 − α i i i = 0 as required. 20 B T able of Notation Sym b ol Description Y Set of p ossible actions O Set of p ossible observ ations R Set of p ossible rew ards µ, ν En vironments y An a ction. x An observ ation/rew ard pair r A r eward o An observ ation [ [ expr ] ] The delta function. [ [ expression ] ] = 1 if expression is true and 0 otherwise. ¬ b The not function. ¬ 0 = 1 and ¬ 1 = 0. π A p olicy . χ An infinite binary string. χ k = 1 if an agen t star t s exploring at time- step k . ¯ χ An infinite binary string. ¯ χ k = 1 if an agen t is exploring at time-step k . ˙ χ h An infinite binary string. ˙ χ h k = 0 if an a gen t will not explore for the next h time-steps. α An infinite binary string. ψ An infinite random binary string sampled from the coin flip measure. t, n, i, j, k Time indices. y x

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment