Information-Theoretic Bounded Rationality
Bounded rationality, that is, decision-making and planning under resource limitations, is widely regarded as an important open problem in artificial intelligence, reinforcement learning, computational neuroscience and economics. This paper offers a c…
Authors: Pedro A. Ortega, Daniel A. Braun, Justin Dyer
Information-Theoretic Bounded Rationalit y P edro A. Ortega ope@seas.upenn.edu University of Pennsylvania Philadelphia, P A 19104, USA Daniel A. Braun daniel.bra un@tuebingen.mpg.de Max Planck Institute for Intel ligent Systems Max Planck Institute for Biolo gic al Cyb ernetics 72076 T¨ ubingen, Germany Justin Dy er jsdyer@google.com Go o gle Inc. Mountain View, CA 94043, USA Kee-Eung Kim kekim@cs.kaist.ac.kr Kor e a A dvanc e d Institute of Scienc e and T e chnolo gy Daeje on, Kor e a 305-701 Naftali Tish b y tishby@cs.huji.ac.il The Hebr ew University Jerusalem, 91904, Isr ael Abstract Bounded rationalit y , that is, decision-making and planning under resource limitations, is widely regarded as an imp ortan t op en problem in artificial in telligence, reinforcemen t learning, computational neuroscience and economics. This pap er offers a consolidated pre- sen tation of a theory of b ounded rationality based on information-theoretic ideas. W e pro vide a conceptual justification for using the free energy functional as the ob jectiv e func- tion for c haracterizing b ounded-rational decisions. This functional p ossesses three crucial prop erties: it con trols the size of the solution space; it has Monte Carlo planners that are exact, y et b ypass the need for exhaustive search; and it captures mo del uncertaint y arising from lac k of evidence or from in teracting with other agents having unknown inten tions. W e discuss the single-step decision-making case, and show ho w to extend it to sequen tial decisions using equiv alence transformations. This extension yields a very general class of decision problems that encompass classical decision rules (e.g. Expectimax and Minimax ) as limit cases, as well as trust- and risk-sensitive planning. c December 2015 by the authors. All rights reserved. Or tega et al. Con ten ts 1 In troduction 3 1.1 A Short Algorithmic Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Outlo ok . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Preliminaries in Exp ected Utility Theory 7 2.1 V ariational principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Sub jective exp ected utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Tw o op en problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 The Mathematical Structure of Boundedness 12 3.1 Meta-reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Incomplete Information and Interrupted Delib eration . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Commensurabilit y of utilit y and information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4 Single-Step Decisions 18 4.1 Bounded-rational decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2 Sto c hastic choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3 Equiv alence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5 Sequential Decisions 28 5.1 Bounded-rational decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.2 Deriv ation of the free energy functional . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.3 Bellman recursion and its solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.4 Recursiv e rejection sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6 Discussion 38 6.1 Relation to literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 A Proofs 41 A.1 Pro of to Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2 Informa tion-Theoretic Bounded Ra tionality 1. In tro duction It is hard to ov erstate the influence that the economic idea of p erfe ct r ationality has had on our w ay of designing artificial agents [Russell and Norvig, 2010]. T o da y , many of us in the fields of artificial intelligence, con trol theory , and reinforcement learning, design our agen ts by enco ding the desired b eha vior into an ob jective function that the agents must optimize in exp e ctation . By doing so, we are relying on the theory of subje ctive exp e cte d utility (SEU), the standard economic theory of decision making under uncertaint y [V on Neumann and Morgenstern, 1944, Sav age, 1954]. SEU theory has an immense in tuitive app eal, and its p erv asiv eness in today’s mindset is reflected in man y widely-spread b eliefs: e.g. that probabilities and utilities are orthogonal concepts; that t wo options with the same exp ected utility are equiv alen t; and that randomizing can never improv e up on an optimal deterministic choice. Put simply , if we find ourselv es violating SEU theory , we w ould feel strongly comp elled to revise our choice. Sim ultaneously , it is also w ell-understo od that SEU theory prescrib es p olicies that are in tractable to calculate sav e for v ery restricted problem classes. This was recognized so on after exp ected utilit y theory w as form ulated [Simon, 1956]. In agent design, it became esp e- cially apparent more recently , as w e con tin ue to struggle in tac kling problems of moderate complexit y in spite of our deeper understanding of the planning problem [Duff, 2002, Hutter, 2004, Legg, 2008, Ortega, 2011] and the v ast computing p o w er av ailable to us. F or instance, there are efficien t algorithms to calculate the optimal p olicy of a known Mark ov decision pro cess (MDP) [Bertsek as and Tsitsiklis, 1996], but no efficien t algorithm to calculate the exact optimal p olicy of an unknown MDP or a p artial ly observable MDP [P apadimitriou and Tsitsiklis, 1987]. Due to this, in practice we either make severe domain-specific simplifi- c ations , as in linear-quadratic-Gaussian con trol problems [Stengel, 1994]; or w e appr oximate the “gold standard” prescrib ed by SEU theory , exemplified by the reinforcement learning algorithms based on sto c hastic appro ximations [Sutton and Barto, 1998, Szepesv´ ari, 2010] and Mon te-Carlo tree searc h [Ko csis and Szep esv´ ari, 2006, V eness et al., 2011, Mnih et al., 2013]. Recen tly , there has b een a renew ed interested in mo dels of b ounde d r ationality [Si- mon, 1972]. Rather than approximating perfect rationalit y , these mo dels seek to formalize decision-making with limited resources suc h as the time, energy , memory , and computational effort allo cated for arriving at a decision. The spec ifi c w ay in which this is ac hiev ed v aries across these accounts. F or instance, epsilon-optimality only requires p olicies to b e “close enough” to the optim um [Dixon, 2001]; metalevel r ationality prop oses optimizing a trade-off b et w een utilities and computational costs [Zilb erstein, 2008]; b ounde d optimality restricts the computational complexit y of the programs implemen ting the optimal p olicy [Russell and Subramanian, 1995]; an approach that we might lab el pr o c e dur al b ounde d r ationality attempts to explicitly mo del the limitations in the decision-making pro cedures [Rubinstein, 1998]; and finally , the heuristics approac h argues that general optimalit y principles ough t to b e abandoned altogether in fav or of collections of simple heuristics [Gigerenzer and Selten, 2001]. Here w e are concerned with a particular flav or of b ounded rationality , whic h w e migh t call “information-theoretic” due to its underlying rationale. While this approach solv es man y of the shortcomings of p erfect rationality in a simple and elegan t wa y , it has not 3 Or tega et al. y et attained widespread acceptance from the mainstream communit y in spite of roughly a decade of research in the machine learning literature. As is the case with man y emerging fields of research, this is partly due to the lack of consensus on the interpretation of the mathematical quan tities inv olved. Nonetheless, a great deal of the basics are well-established and ready for their widespread adoption; in particular, some of the algorithmic implications are muc h b etter understo o d to da y . Our goal here is to pro vide a consolidated view of some of the basic ideas of the theory and to sketc h the intimate connections to other fields. 1.1 A Short Algorithmic Illustration P erfect Rationalit y . Let Π = { π 1 , π 2 , . . . , π N } b e a finite set of N ∈ N candidate c hoices or p olicies, and let U : Π → [0 , 1] b e a utility function mapping eac h p olicy in to the unit in terv al. Consider the problem of finding a maximizing element π ∗ ∈ Π. F or simplicity we assume that, giv en an element π ∈ Π, its utilit y U ( π ) can b e ev aluated in a constan t n um b er of computation steps. Imposing no particular structure on Π, we find π ∗ b y sequentially ev aluating each utilit y U ( π ) and returning the best found in the end. Pro cedure P erfectlyRationalChoice(Π , U ) π ∗ ← π 1 ∈ Π foreac h π ∈ { π 2 , π 3 , . . . , π N } do if U ( π ) ≥ U ( π ∗ ) then π ∗ ← π end return π ∗ This exhaustive evaluation algorithm works well if N is small, but it does not scale to the case when N is very large. In real-w orld tasks, suc h very lar ge de cision sp ac es are not the exception but rather the norm, and an agen t faces them in most stages of its information- pro cessing pip eline ( e.g. atten tion focus, mo del selection, action selection). In these cases, the agent do es not ha v e enough resources to exhaustiv ely ev aluate eac h elemen t in Π; rather, it only manages to inspect a negligible subset of K N elemen ts in Π, where K dep ends on the agent’s limitations. Bounded Rationality . W e mo del this limitation as a constraint on the agent’s infor- mation c ap acity of relating utility functions with decisions. This limitation c hanges the nature of the decision problem in a fundamental wa y; indeed, information constraints yield optimal choice algorithms that lo ok very unusual to some on e who is used to the traditional optimization paradigm. One simple and general wa y of implementing the decision pro cess is as follows. A b ounded rational agen t inspects the c hoices until it finds one that is satisficing . The order of insp ection is random: the agent dra ws (without replacement) the candidate p olicies from a distribution Q that reflects the prior know le dge ab out the goo d c hoices. Then, for each prop osed element π ∼ Q , the agent decides whether π is go od enough using a sto c hastic criterion that dep ends b oth on the utility and the capacity . Importantly , the agent can insp ect the policies in p ar al lel as is illustrated in the follo wing pseudo-co de. 4 Informa tion-Theoretic Bounded Ra tionality Pro cedure BoundedRationalChoice(Π , U, α, U ∗ ) rep eat in parallel π ∼ Q u ∼ U [0 , 1] if u ≤ exp { α ( U ( π ) − U ∗ ) } then return π end The stochastic criterion is parametrized by α ∈ R and U ∗ ∈ [0 , 1] whic h join tly determine the size of the solution space and thereby the c hoice difficult y . The parameter α is kno wn as the inverse temp er atur e , and it plays the role of the de gr e e of r ationality ; whereas U ∗ is an aspir ation level whic h is t ypically chosen to be the maximum utilit y that can b e attained in principle, i.e. U ∗ = 1 in our example. The b eha vior of this algorithm is exemplified in Fig. 1. In this simulation, the size N of the policy space Π was set to one million. The utilities were constructed b y first mapping N uniformly spaced p oin ts in the unit interv al into f ( x ) = x 2 and then randomly assigning these v alues to the p olicies in Π. F or the prior Q we chose a uniform prior. T ogether, the utility and the prior determine the shap e of the c hoice space illustrated in the left panel. The remaining panels sho w the p erformance of the algorithm in terms of the utility (cen ter panel) and the num b er of insp ected p olicies b efore a satisficing p olicy w as found (righ t panel). This simulation shows tw o imp ortan t asp ects of the algorithm. First, the p erformance in utilit y is marginally decreasing in the searc h effort, as illustrated b y the conca vity of U ( α ) and the roughly prop ortional dep endency b et ween searc h effort and α . Second, the parameter α essen tially controls the p erformance of the algorithm. In contrast, it is easy to see that this p erformance is unaffected by the size of the set of p olicies, as long as the shap e of the choice space is preserved. Prop erties. Although the algorithm seems coun ter-in tuitive at a first glance, it has the follo wing adv an tages: 1. Use of prior know le dge. If the agent p ossesses a distribution Q ov er Π, where Q ( π ) denotes the probability of π being the best p olicy , then it can exploit this kno wledge. In particular, if it already kno ws the maximizer π ∗ , i.e. if Q is a degenerate distribution giv en by a Kronec ker delta function δ [ π = π ∗ ], then it will only query said p olicy . 2. Pr ote ction fr om adversarial enumer ation. If the agent were to insp ect the p olicies using a fixed deterministic order, then it could fall prey to an adv ersarial shuffling of the utilities. The ab o v e randomization tec hnique protects the agent from suc h an attac k. 3. Contr ol of c omplexity. The agen t can control the size of the solution space by choosing an appropriate v alue for the α parameter. More precisely , | α | ≈ 0 corresp onds to an easy problem in whic h most prop osals π ∼ P will be accepted; and | α | 0 is a hard problem. 5 Or tega et al. Choice Index (Normalised) Inverse Temperature Inverse Temperature Utilit y Utilit y Inspected Polici es Choice Space Performance in Utili ty Performance in Search Figure 1: Simulation of b ounded-rational choices. The left panel sho ws the shap e of the utilit y function for 100 randomly chosen p olicies out of one million. These are sho wn both in the w a y the agent sees them ( ◦ -mark ers) and sorted in ascending order of the utilit y ( • -markers). The shaded area under the curve is an important factor in determining the difficult y of the p olicy searc h problem. The cen ter and righ t panels illustrate the p erformance as a function of the inv erse temperature, measured in terms of the utility (center panel) and the num b er of insp ected p olicies b efore acceptance (righ t panel). Both panels show the mean (black curve) and the bands capturing 80% of the c hoices excluding the first and last decile (shaded area). The v alues w ere calculated from 300 samples p er setting of the in verse temperature. 4. Par al lel. The p olicies can b e insp ected in parallel, in contrast to perfect rationality 1 . T echnically , this is possible b ecause “ π ∗ is satisficing” is a s tatemen t in prop ositional logic. Instead, “ π ∗ is a maximizer”, whic h can b e rewritten as “for ev ery policy π , π ∗ has larger utilit y than π ”, is a statement in first-order logic with a univ ersal quantifier that dep ends on exhaustiv e insp ection. The ab o v e basic choice algorithm is v ery general. Apart from mo deling in tractabilit y , it can b e used to represent c hoices under mo del uncertaint y , risk and ambiguit y sensitivit y , and other play ers’ attitudes to w ards the agen t. In the main text, w e will see ho w to deriv e the algorithm from first principles, how to mo del general information constraints in decision making, and how to extend the basic decision-making sc heme to sequential scenarios. 1.2 Outlo ok In the next section we briefly review the ideas b ehind SEU theory . W e touch up on tw o- pla yer games in order to motiv ate a cen tral concept: the certain ty-equiv alent. The section then finishes with tw o open problems of SEU theory: intractabilit y and mo del uncertain t y . 1. The running time of exhaustive searc h can b e brough t down to O (log N ) through dynamic programming and parallelism, but the n umber of inspected p olicies is still N . 6 Informa tion-Theoretic Bounded Ra tionality Section 3 la ys out the basic conceptual framew ork of information-theoretic b ounded ratio- nalit y . W e analyze the operational c haracterization of having limited resources for delib er- ation, and state the main mathematical assumptions. F rom these, we then deriv e the free energy functional as the replacemen t for the expected utility in decision making. Section 4 applies the free energy functional to single-step decision problems. It serv es the purp ose of further sharp ening the intuition behind the free energy functional, b oth from a theoretical and practical view. In particular, the notion of equiv alence of b ounded-rational decision problems is introduced, along with practical rejection sampling algorithms. Section 5 ex- tends the free energy functional to sequential decisions. W e will see that the notion of equiv alence pla ys a crucial role in the construction and in terpretation of b ounded ratio- nal decision trees. W e show how suc h decision trees subsume many sp ecial decision rules that represent risk-sensitivit y and model uncertaint y . The section finishes with a recursive rejection sampling algorithm for solving b ounded-rational decision trees. The last section discusses the relation to the literature and concludes. 2. Preliminaries in Exp ected Utility Theory Before we introduce the ideas that underlie b ounded-rational decision-making, in this section w e will set the stage by briefly reviewing SEU theory . Notation. Sets are denoted with calligraphic letters such as in X . The set ∆( X ) cor- resp onds to the simplex o ver X , i.e. the set of all probability distributions ov er X . The sym b ols Pr and E stand for the probabilit y measure and exp ectation resp ectiv ely , relative to some probability space with sample space Ω. Conditional probabilities are defined as Pr ( A | B ) := Pr ( A ∩ B ) Pr ( B ) (1) where A, B ⊂ Ω are measurable subsets, and where B is such that Pr ( B ) 6 = 0. 2.1 V ariational principles The b eha vior of an agen t is typically describ ed in one of t w o w ays: either dynamic al ly , b y directly sp ecifying the agent’s actions under any contingency; or tele olo gic al ly , by sp eci- fying a variational pr oblem ( e.g. a con vex ob jective function) that has the agen t’s p olicy as its optimal solution. While these t wo descriptiv e metho ds can b e used to represen t the same p olicy , the teleological description has a greater explanatory p o w er b ecause in addi- tion it enco des a pr efer enc e r elation which justifies wh y the p olicy w as preferred o ver the alternativ es 2 . Because of the explanatory p o wer of v ariational principles, they are widely used to justify design c hoices. F or example, virtually every sequential decision-making algorithm is con- ceiv ed as a maximum exp e cte d utility problem. This encompasses p opular problem classes suc h as multi-armed bandits, Marko v decision pro cesses (MDPs) and partially observ able Mark ov decision pro cesses (POMDPs)—see Russell and Norvig [2010] for a discussion. In 2. Notice also that in this sense, the qualifier “optimal” is a statement that only holds relative to the ob jective function. That is, “b eing optimal” is by no means an absolute statement, b ecause given any p olicy , one can alwa ys engineer a v ariational principle that is extremized by it. 7 Or tega et al. learning theory , r e gr et minimization [Lo omes and Sugden, 1982] is another p opular v aria- tional principle to choose betw een learning algorithms [Cesa-Bianchi and Lugosi, 2006]. 2.2 Sub jective exp ected utilit y T o da y , the theory of subje ctive exp e cte d utility [Sa v age, 1954] is the de facto theory of rationalit y in artificial intelligence and reinforcement learning [Russell and Norvig, 2010]. The bedro c k of the theory is a representation theorem stating that the preferences of a rational agent can b e describ ed in terms of comparisons b et w een exp ected utilities. The qualifiers “sub jective” and “exp ected” in its name derive from the fact that b oth the utility function and the b elief distribution are assumed to b e prop erties that are sp ecific to eac h agen t, and that the utility of a random realization is equated with the expected utilit y ov er the individual realizations. Decision problem. Let X and Y b e t wo finite sets, the former corresp onding to the set of actions av ailable to the agent and the latter to the set of observations generated by the en vironment in response to an action. A r e alization is a pair ( x, y ) ∈ X × Y . F urthermore, let U : ( X × Y ) → R b e a utility function , such that U ( x, y ) represen ts the desirability of the realization ( x, y ) ∈ ( X × Y ); and let Q ( ·|· ) b e a conditional probabilit y distribution where Q ( y | x ) represen ts the probability of the observ ation y ∈ Y given the action x ∈ X . Optimal p olicy . The agent’s goal is to c ho ose a p olicy that yields the highest exp ected utilit y; in other words, a probabilit y distribution P ∈ ∆( X ) o ver actions that maximizes the functional (see Fig. 2) E U [ ˜ P ] := X x ˜ P ( x ) E [ U | x ] = X x ˜ P ( x ) X y Q ( y | x ) U ( x, y ) . (2) Th us, an optimal p olicy is any distribution P with no supp ort o ver suboptimal actions, that is, P ( x ) = 0 whenev er E [ U | x ] ≤ max z E [ U | z ]. In particular, because the exp ected utility is linear in the p olicy’s probabilities, one can alw a ys c ho ose a solution that is a vertex of the probabilit y simplex ∆( X ): P ( x ) = δ x x ∗ = ( 1 if x = x ∗ := arg max x E [ U | x ] 0 otherwise, (3) where δ is the Kronec k er delta. Hence, there alw ays exists a deterministic optimal p olicy . Utilit y distribution. Giv en the optimal p olicy , the utility U b ecomes a w ell-defined random v ariable with probabilit y distribution Pr ( U = u ) = Pr ( x, y ) : U ( x, y ) = u = X U − 1 ( u ) P ( x ) Q ( y | x ) . Ev en though the optimal p olicy might be deterministic, the utilit y of the ensuing realization is in general a non-degenerate random v ariable. Notably , p erfectly rational agents are insensitiv e to the higher-order moments of the utility: so tw o actions yielding utilities ha ving very different v ariances are regarded as b eing equal if their expectations are the same. 8 Informa tion-Theoretic Bounded Ra tionality max min Figure 2: Exp ected Utility Theory . Left: A decision problem can b e represented as a tree with t wo levels. Cho osing a p olicy amounts to assigning transition probabilities for the decision no de lo cated at the ro ot ( 4 ), sub ject to the fixed transition probabilities in the c hance no des ( ) and the utilities at the leav es ( ). Right: Mathematically , this is equiv alent to choosing a member P of the simplex ov e r actions ∆( X ) that maximizes the con v ex com bination o v er conditional exp ected utilities E [ U | x ]. F riends and fo es. What happens when the environmen t is yet another agen t? According to game the ory [V on Neumann and Morgenstern, 1944], w e can mo del this situation again as a decision problem, but with the crucial difference that the agent lac ks the conditional probabilit y distribution Q ( ·|· ) ov er observ ations that it needs in order ev aluate exp ected utilities. Instead, the agent p ossesses a second utilit y function V : ( X × Y ) → R represen ting the desires of the en vironment. Game theory then inv ok es a solution c onc ept , most notably the Nash e quilibrium , in order to prop ose a substitute for the missing distribution Q ( ·|· ) based on U and V , thereby transforming the original problem in to as a well-defined decision problem. T o simplify , we assume that the agent chooses first and the environmen t second, and w e restrict ourselv es to tw o sp ecial cases: (a) the fully adversarial case U ( x, y ) = − V ( x, y ); and (b) the fully friend ly case U ( x, y ) = V ( x, y ). The Nash equilibrium then yields the decision rules [Osb orne and Rubinstein, 1999] P = arg max ˜ P X x ˜ P ( x ) min ˜ Q ˜ Q ( y | x ) U ( x, y ) , (4) P = arg max ˜ P X x ˜ P ( x ) max ˜ Q ˜ Q ( y | x ) U ( x, y ) (5) for the tw o cases resp ectiv ely . Comparing these to (2), w e observe tw o prop erties. First, the ov erall ob jective function app ears to arise from a mo dular comp osition of tw o nested c hoices: the outer for the agent and the inner for the environmen t. Second, there are essen tially three wa ys in which the utilities of a choice are aggregated—namely through maximization, exp ectation, and minimization—dep ending on the nature of the choice. Certain t y-equiv alen t. An agent plans ahead by recursiv ely aggregating future v alue in to presen t v alue. As we hav e seen, it do es so b y summarizing the v alue of eac h c hoice using one of three aggregation t yp es: minimization ( 5 ), represen ting an adversarial choice; exp e cta- tion ( ), representing an indifferen t (random) c hoice; and maximization ( 4 ), represen ting 9 Or tega et al. a friendly choice. This summarized v alue is known as the c ertainty-e quivalent , b ecause if the agen t were to substitute a choice with m ultiple alternatives with an equiv alen t (degenerate) c hoice with a single alternativ e, the latter w ould hav e this v alue (see Fig. 3). Notice that for planning purp oses, it is irrelev an t whether the decision is made by the agent or not; to the agen t, what matters is the degree to whic h an outcome (be it an action or an observ ation) con tributes p ositiv ely to its ob jective function (as enco ded b y one of the three p ossible ag- gregation op erators). Indeed, another (concise) w a y to describe sub jectiv e exp ected utilit y theory is that the certain t y-equiv alen t of random outcomes is giv en b y the exp ectation. -1 2 1 -1 2 1 -1 2 1 -1 2 a) b) c) Figure 3: Decision problems (top ro w) and their certaint y-equiv alents (b ottom row). 2.3 Tw o op en problems W e finish this section by presenting t wo v ery common situations in which the application of p erfect rationality do es not seem to b e the right thing to do: very large choice spaces and mo del uncertain ty . V ery large c hoice spaces. Consider the problem of c ho osing the longest stra w in a giv en collection. Exp ected utilit y theory w orks well when the n um b er of elemen ts is small enough so that the agen t can exhaustively measure all their lengths and rep ort the longest found. This is the case in the situation depicted in Fig. 4a, where it is easily seen that the top-most stra w is the longest. The situation of Fig. 4b ho w ever appears to b e v ery differen t. Here, the n umber of stra ws is v ery large and their arrangemen t is unstructured. The lac k of structure preven ts us from grouping the straws for simplifying our search, i.e. there are no symmetries for defining equiv alence classes that w ould reduce the cardinality of our searc h space. Consequen tly , there is no obvious impro vemen t o ver testing some elements b y picking at random until finding one that is sufficiently goo d. Mo del uncertain ty . Consider a task in whic h y ou m ust choose one of t wo boxes con tain- ing 100 blac k and white balls. After y our choice is announced, a ball is dra wn at random from the b o x, and y ou win $100 if the color is black (or nothing otherwise). If you get to see the conten ts of the tw o b o xes then you w ould choose the b o x containing a larger prop ortion of black balls. F or instance, in Fig. 5a you w ould c ho ose the left b o x. Ho wev er, which b o x w ould y ou choose in the situation shown in Fig. 5b? In his seminal paper, Ellsb erg [1961] 10 Informa tion-Theoretic Bounded Ra tionality a) b) Figure 4: V ery large choice spaces. a) P erfect rationality can capture decision problems in volving a small n um b er of c hoices. b) Ho wev er, p erfect rationalit y is in tractable in very large and unstructured c hoice spaces. sho wed that most of p eople b et on the left box. A simple explanation for this phenomenon is that p eople fill the missing information of the righ t b o x by enumerating all the p ossible com binations of 100 blac k and white balls, attaching prior probabilities to them in a risk- a verse wa y . Th us, such prior b eliefs assigns a larger marginal probabilit y to the lo osing color—in this case, “white”. No w, assume that you place your b et, but no ball is dra wn. Instead, you are ask ed to revise y our b et. The conten ts of the b o xes are assured to stay exactly the same, but now y ou are told that this time y ou win for “white” rather than “blac k”. W ould y ou c hange y our b et? Empirically , most of the p eople stick with their initial b et. Ho wev er, there is no single prior distribution ov er combinations that can predict this particular preference pattern, as this w ould contradict expected utility theory . Are p eople th us irrational? Perhaps, but it is worth y to p oin t out that L. J. Sav age, who form ulated sub jective exp ected utility theory , w as among the people who violated the theory [Ellsb erg, 1961]. 50/50 25/75 50/50 ? a) b) Figure 5: The Ellsb erg paradox. a) Two b o xes with known prop ortions of blac k and white balls. b) Two b o xes where only the prop ortions of the left b o x are known. 11 Or tega et al. One explanation for this apparent parado x is that the right b o x of second exp erimen t in volv es unknown probabilities 3 whic h affect the net v alue of a c hoice. In this case, p eople migh t actually ha ve a uniform mo del o ver com binations of blac k and white balls, but they nev ertheless discount the v alue of the b o x as an expression of distrust in the mo del. On the other hand, there are p eople who p erceiv e increased net v alues in some situations. Suc h could b e the case, for instance, if they knew and trusted the p erson who set up the exp erimen t. W e will see later in the text how this in teraction b et w een the trust in the b eliefs and the utilit y can be modeled using the bounded-rational framew ork. 3. The Mathematical Structure of Boundedness W e no w tac kle the problem of formalizing b oundedness. First, w e will pursue an obvious route, namely that of meta-r e asoning . While meta-reasoning models can mo del important asp ects of complex decision-making ( e.g. preliminary resource allo cation and reasoning ab out other agen ts), it ultimately falls short in adequately mo deling b ounded-rationalit y . 3.1 Meta-reasoning One straightforw ard attempt to address the problem of intractabilit y of exp ected utilit y theory is through meta-r e asoning [Zilb erstein, 2008], i.e. by letting the agen t reason ab out the v ery costs of choosing a p olicy . The rationale is that, an agent can av oid prohibitive costs by finding a trade-off b et ween the utilit y of a p olicy and the cost of ev aluating said p olicy . F ormal solution. The argument pro ceeds as follows. Let U : Π → R b e a b ounded utilit y function that maps each p olicy π ∈ Π into a v alue U ( π ) in the unit in terv al. A p erfectly-rational agen t then solves the problem max π ∈ Π U ( π ) , (6) and uses any solution π ∗ as its p olicy . Rather than solving (6) whic h is deemed to b e too hard, a meta-level r ational agent solv es the problem max π ∈ Π n U ( π ) − C ( π ) o , (7) where C ( π ) ∈ R + is a p ositiv e p enalization term due to the cost of ev aluating the p olicy π that is sp en t by the agent’s p olicy-searc h algorithm, e.g. time or space complexity of a T uring machine. Criticism. While the idea of meta-reasoning is intuitiv ely app ealing, it fails to simplify the original decision problem. This is seen b y defining U 0 ( π ) := U ( π ) − C ( π ) and noting that the agent’s meta-lev el reasoning maximizes the ob jectiv e function max π ∈ Π U 0 ( π ) , (8) 3. Imp ortan tly , the economic literature distinguishes b et w een risk (= known probabilities) and ambiguities (= unknown probabilities) [Knigh t, 1921, Ellsb erg, 1961]. Sa v age defined sub jective exp ected utilit y theory as a theory of decision making under risk . There is currently no widely agreed up on definition of am biguities, although there exist proposals, such as the framew ork of variational prefer enc es [Macc heroni et al., 2006]. 12 Informa tion-Theoretic Bounded Ra tionality whic h is itself another perfectly-rational decision problem. But then, the rationale of meta- reasoning tells us that ev aluating the meta-level utilities U 0 ( π ) should come with p enaliza- tions C 0 ( π ) due to the costs incurred by the p olicy-searc h algorithm that is used at the meta-lev el. . . There are t wo problems with this scheme. First, it is quickly seen that this line of reasoning leads to an infinite regress when carried to its logical conclusion. Every time the agen t instan tiates a meta-lev el to reason ab out lo wer-lev el costs, it generates new costs at the meta-level. Second, the problem at a meta-level is t ypically harder than the low er-lev el one. Ho w can w e circumv ent these problems? 3.2 Incomplete Information and Interrupted Delib eration The central problem with meta-reasoning lies in its self-referential logic: whenever an agen t attempts to reason ab out its own costs of reasoning, it creates a new p erfectly-rational decision problem at the meta-level that comes with additional costs that are left out. Due to this, w e conclude that it is imp ossible for an agent to fully apprehend its o wn resources of delib eration. Notice that this do es not prev ent agents from using meta-reasoning; it just excludes meta-reasoning as a formal solution to the b ounded-rationalit y problem. F ormal solution. The problem can b e solved by mo deling the agen t’s delib eration as a game of inc omplete information or Bayesian game [Harsanyi, 1967]. Such a game allows us to represent the agent’s ignorance ab out the v ery ob jective function it is supp osed to optimize. Lo osely sp eaking, it allows us to model the “unknown uncertain ties” (as opposed to “kno wn uncertainties”) that the agen t has ab out its goals. W e refer the reader to the texts on game theory by Osb orne and Rubinstein [1999] or Leyton-Brown and Shoham [2008] for an introduction to Ba yesian games. W e mo del the agen t’s decision as single-play er game of incomplete information. F or this, it is useful to distinguish b et ween the agent’s al le ge d ob jective and the p er c eive d ob jective as seen b y an external observ er. The agen t is equipped with an ob jectiv e function U ( π ) that it attempts to optimize with resp ect to the p olicy π ∈ Π. How ever, it unexp ectedly runs out of resources, effectively b eing in terrupted at an indeterminate p oin t in its deliberation, and forced to commit to a suboptimal policy π ◦ ∈ Π. F rom an external observer’s p oin t of view, π ◦ ∈ Π app e ars to be the result of an in ten tional deliberation seeking to optimize an alternativ e ob jectiv e function U ( π ) − C ( π ) that trades off utilities and delib eration costs. F ormally , let C b e a discrete set of p enalization functions of the form C : Π → R that mo del different interruptions. Then, the agent do es not solve a particular problem, but rather a c ol le ction of pr oblems given b y ∀ C ∈ C , max π ∈ Π n U ( π ) − C ( π ) o . (9) In other words, the agen t finds an optimal p olicy π ∗ C for any p enalization function C ∈ C . This abstract m ulti-v alued optimization migh t app ear unusual. How ever, one simple w ay of thinking ab out it is in terms of an any-time optimization. In this scheme, the agen t’s computation generates “in termediate” solutions π 1 , π 2 , . . . to U − C 1 , U − C 2 , U − C 3 , . . . 13 Or tega et al. resp ectiv ely , where C 1 , C 2 , . . . is an exhaustive en umeration of C . This ordering of C is not sp ecified b y (9); how ev er, one can link this to computational resources by demanding an ascending ordering C 1 ( π 1 ) ≤ C 2 ( π 2 ) ≤ . . . of p enalization. Nature secretly chooses a p enalization function C ◦ ∈ C , whic h is only rev ealed when the agent pro duces the solution π ◦ to U − C ◦ . This causes the computation to stop and the agen t to return π ◦ . No w, notice that since the agent do es not know C ◦ , its computational pro cess can be regarded as an a priori sp ecification of the solution to the m ulti-v alued optimization problem (9). 3.3 Commensurabilit y of utility and information The fact that finding a go od p olicy comes at a cost that c hanges the net v alue of the agent’s pa y-off suggests that utilities and search costs should b e translatable in to eac h other, i.e. they should be c ommensur able . The aim of this section is to presen t a precise relationship b et w een the tw o quantities. This relationship, while straigh tforward, has ric h and concep- tually non-trivial consequences that significantly challenge our familiar understanding of decision-making. Searc h. W e first introduce a search mo del that will serv e as a concrete example for our exp osition. In this model, the agen t searches by rep eatedly obtaining a random p oin t from a very lar ge and unstructur e d domain until a desired target is hit. F or this, we consider a finite sample sp ac e Ω and a probability measure Pr that assigns a probability Pr ( S ) to ev ery subset S ⊂ Ω; a r efer enc e set B ⊂ Ω with Pr ( B ) 6 = 0; and a tar get set A ⊂ B ⊂ Ω. In each round, the agent samples a p oin t ω from the conditional probability distribution Pr ( ·| B ). If ω falls within the target set A , then the agen t stops. Otherwise, the next round b egins and the sampling pro cedure is rep eated. The comp onen ts of this mo del are in terpreted as follo ws. The agen t’s prior know le dge is mo deled using the reference set B . The p oin ts that lie outside of B are those that are never insp ected by the agen t ( e.g. because it kno ws that these p oin ts are undesirable or b ecause they are simply inaccessible). Every c hoice of a reference set induces a prior distribution Pr ( ·| B ). F urthermore, the difficulty of finding the target is relative: The num b er of samples N that the agent has to insp ect up until hitting the target dep ends up on the relativ e size of A with resp ect to B , whic h is giv en b y the conditional probabilit y Pr ( A | B ). Thereb y , N is a random v ariable that follows a geometric distribution N ∼ G ( p ) with success probability p := Pr ( A | B ), having the expected v alue E [ N ] = 1 Pr ( A | B ) . (10) Decision complexit y . Based on the previous searc h mo del, w e no w define the de cision c omplexity C ( A | B ) as a measure of the cost required to sp ecify a target set A giv en a reference B , where A and B are arbitrary measurable subsets of the sample space Ω with Pr ( B ) 6 = 0. W e imp ose the following prop erties on the measure C : 14 Informa tion-Theoretic Bounded Ra tionality a) F unctional form: F or ev ery A, B such that A ⊂ B ⊂ Ω, C ( A | B ) := f ◦ Pr ( A | B ) , where f is a contin uous, real-v alued mapping. b) A dditivity: F or ev ery A, B , C suc h that A ⊂ B ⊂ C ⊂ Ω, C ( A | C ) = C ( B | C ) + C ( A | B ) . c) Monotonicity: F or every A, B , C , D such that A ⊂ B ⊂ Ω and C ⊂ D ⊂ Ω, Pr ( A | B ) > Pr ( C | D ) ⇐ ⇒ C ( A | B ) < C ( C | D ) . + = < a) b) c) Figure 6: Decision complexit y . In eac h figure, the reference and target sets are depicted by ligh t and dark-colored areas respectively . a) The complexity v aries contin uously with the target A and the reference B . b) The complexit y can b e decomp osed additiv ely . c) The complexit y decreases monotonically when increasing the con- ditional probability . Ho w do es C look like? W e first observ e that the desideratum (a) allo ws us to restrict our atten tion only to contin uous functions f that map the unit interv al [0 , 1] into real num b ers. Then, the desiderata (b) and (c) imply that any f must fulfill the functional equation f ( pq ) = f ( p ) + f ( q ) (11) for any p, q ∈ [0 , 1] sub ject to the constrain t f ( p ) < f ( q ) whenev er p > q . It is well-kno wn that any solution to the previous functional equation must b e of the form f ( p ) = − 1 α log p, (12) where α > 0 is an arbitrary p ositiv e num b er. Th us, the complexit y measure is given b y C ( A | B ) = − 1 α log P ( A | B ) = 1 α log E [ N ] , (13) that is, prop ortional to the Shannon information − log Pr ( A | B ) of A given B . This shows that the decision complexit y is prop ortional to the minimal amoun t of bits necessary to 15 Or tega et al. a) b) Figure 7: Delib eration as searc h. a) An agent’s delib eration is a transformation of prior c hoice probabilities Q ( x ) into p osterior c hoice probabilities P ( x ). The relative c hanges reflect the agen t’s preferences. b) T o understand the complexit y of this transformation, we mo del it as a search pro cess with the conv en tion Q ( x ) = Pr ( x | Q ) and P ( x ) = Pr ( x | P ), where P ⊂ Q ⊂ Ω. sp ecify a choice and prop ortional to the logarithm of the exp ected n umber of p oin ts E [ N ] that the agent has to sample b efore finding the target. Utilit y . Let us tak e one step back to revisit the concept of utilit y . Utilities w ere originally en visioned as auxiliary quantities for conceptualizing pr efer enc es in terms of simple numer- ical comparisons. In turn, a pr efer enc e is an empirical tendency for choosing one course of action o ver another that the agent rep eats under similar circumstances. In the economic literature, this rationale that links probabilit y of c hoice and preferences is kno wn as r eve ale d pr efer enc es [Samuelson, 1938]. It is interesting to note that sub jectiv e exp ected utility is a theory that is particularly stringen t in its b eha vioral demands, b ecause it assumes that an agent will unequivocally c ho ose the exact same course of action whenever it faces an equiv alent situation—rather than just displa ying a tendency to wards rep eating a choice. In tuitively though, it seems natural to admit weak er forms of preferences. F or instance, consider the situation in Fig. 7a depicting an agent’s choice probabilities Q ( x ) and P ( x ) b efore and after delib eration resp ectiv ely . Ev en though the agent do es not commit to any particular choice, it is plausible that its preference relation is suc h that x 2 x 3 x 1 , meaning that U ( x 2 ) > U ( x 3 ) > U ( x 1 ) for an appropriately defined notion of utilit y . F ree energy . W e ha ve discussed sev eral concepts such as decisions, c hoice probabilities, utilities, and so for th. Our next step consists in reducing these to a single primitiv e concept: decision complexity . This synthesis is desirable b oth due the theoretical parsimon y as w ell as the in tegrated picture of decision-making that it deliv ers. T o proceed, we first note that an agen t’s delib eration pro cess transforming a prior Q in to a p osterior P can b e cast in terms of a searc h pro cess where Q and P are the reference and target sets resp ectiv ely . As illustrated in Fig. 7b, the set of av ailable choices X forms a partition of the sample space Ω. The distributions P and Q are enco ded as nested subsets P ⊂ Q ⊂ Ω using the notational con ven tion P ( x ) := Pr ( x | P ) and Q ( x ) := Pr ( x | Q ) , (14) 16 Informa tion-Theoretic Bounded Ra tionality that is, where the t wo sets induce distributions o v er the choices through conditioning. With these definitions in place, we can now derive an expression for the complexity C ( P | Q ) in terms of choice-specific complexities C ( x ∩ P | x ∩ Q ): C ( P | Q ) = − 1 α log Pr ( P | Q ) = − 1 α X x Pr ( x | P ) log Pr ( P | Q ) Pr ( x | P ) Pr ( x | Q ) Pr ( x | P ) Pr ( x | Q ) = − 1 α X x Pr ( x | P ) log Pr ( x ∩ P | Q ) Pr ( x | Q ) + 1 α X x Pr ( x | P ) log Pr ( x | P ) Pr ( x | Q ) = X x Pr ( x | P ) C ( x ∩ P | x ∩ Q ) + 1 α X x Pr ( x | P ) log Pr ( x | P ) Pr ( x | Q ) . (15) The first equalit y is obtained through an application of the definition (13). The s econd equalit y is obtained b y multiplying the constan t term Pr ( P | Q ) with another term that equals one and subsequently taking the exp ectation with resp ect to P ( x | P ). Separating the terms in the logarithm into t wo differen t sums and using the pro duct rule Pr ( x ∩ P | Q ) = Pr ( x | P ) · Pr ( P | Q ) giv es the third equalit y (note that P ∩ Q = P ). The last step of the deriv ation is another application of (13). If w e now use the con ven tion (14) we get C ( P | Q ) = X x P ( x ) C ( x ∩ P | x ∩ Q ) + 1 α X x P ( x ) log P ( x ) Q ( x ) . (16) Ho w can w e interpret this last expression? In essence, it relates the decision complexities of t w o differen t states of knowledge. If the agent’s state of kno wledge is ( x ∩ Q ), then the complexit y of finding the target ( x ∩ P ) is equal to C ( x ∩ P | x ∩ Q ). This assumes that the agen t knows that the resulting c hoice will b e x for sure. How ev er, since the agent do es not kno w the final choice, then (16) says that the total complexity of delib eration is equal to the a verage choice-specific complexities plus a p enalt y (given by the KL-divergence) due to not kno wing of the future outcome. Put differen tly , the certain t y-equiv alen t complexit y of an uncertain c hoice is lar ger than the exp e cte d c omplexity . This constitutes the most imp ortan t deviation from sub jectiv e exp ected utilit y theory . Equation (16) deriv ed ab o ve can no w b e turned in to a v ariational principle by observing that the r.h.s. is conv ex in the p osterior choice probabilities. Thus, define utilities as quan tities that, up to a constant C ∈ R , are equal to negative complexities: U ( x ) := − C ( x ∩ P | x ∩ Q ) + C . Then, subtracting C from (16) and taking the negativ e gives the functional F [ ˜ P ] := X x ˜ P ( x ) U ( x ) − 1 α X x ˜ P ( x ) log ˜ P ( x ) Q ( x ) , (17) whic h no w is c onc ave in ˜ P and, when maximized, minimizes the complexit y of transform- ing Q in to P sub ject to the utilities U ( x ). Equation (17) is the so-called fr e e ener gy func- 17 Or tega et al. tional , or simply the fr e e ener gy 4 , and it will serve as the foundation for the approach to b ounded-rational decision making of this work. 4. Single-Step Decisions 4.1 Bounded-rational decisions In this section we will tak e the free energy functional (17) as the ob jectiv e function for mo deling b ounded-rational decision making. W e will first focus on simple one-step decisions and explore their conceptual and algorithmic implications. Decision problem. A b ounded-rational agent, when delib erating, transforms prior choice probabilities in to p osterior choice probabilities in order to maximize the exp ected utilit y— but it do es so sub ject to information constraints. F ormally , a b ounde d-r ational de cision pr oblem is a tuple ( α, X , Q, U ), where: α ∈ R is the inverse temp er atur e which acts as a rationalit y parameter; X is a finite set of possible outc omes ; Q ∈ ∆( X ) is a prior probabilit y distribution ov er X representing a prior p olicy ; and U : X → R is a real-v alued mapping of the outcomes called the utility function . Goal. Given a bounded-rational decision problem ( α, X , Q, U ), the go al consists in finding the p osterior p olicy P ∈ ∆( X ) that extr emizes the fr e e ener gy functional F [ ˜ P ] := X x ˜ P ( x ) U ( x ) | {z } Expected Utility − 1 α X x ˜ P ( x ) log ˜ P ( x ) Q ( x ) | {z } Information Cost . (18) Th us, the free energy functional captures a fundamental decision-theoretic trade-off: it corresp onds to the exp ected utilit y , regularized b y the information cost of representing the final distribution P using the base distribution Q . The functional is illustrated in Fig. 8. In v erse temp erature. The in v erse temp erature α con trols the trade-off b et ween utilities and information costs b y setting the exc hange rate betw een units of information (in bits ) and units of utility (in utiles ). An agen t’s delib eration pro cess can b e affected b y a n um b er of disparate factors imposing information constraints; nonetheless, here the cen tral assumption is that all of them can ultimately b e condensed in to the single parameter α . F urthermore, notice that w e ha ve extended the domain of α to the real v alues R . The sign of α determines the type of optimization: when α > 0 is p ositiv e, then the free energy functional F α is concav e in ˜ P and the posterior policy P is the maximizer ; and when α < 0 is negative, then F is conv ex in ˜ P and P is the minimizer . W e will further elab orate on the meaning of the negative v alues later when analyzing sequen tial decisions. Optimal choice. The optimal solution to (18) is given by the Gibbs distribution P ( x ) = 1 Z α Q ( x ) exp { αU ( x ) } , Z α = X x Q ( x ) exp { αU ( x ) } , (19) 4. Here w e adopt this terminology to relate to existing w ork on statistical mechanical approaches to con trol. T o b e precise how ever, in the statistical mechanical literature this functional corresp onds to (a shifted v ersion of ) the ne gative fr e e ener gy differ enc e . This is because utilities are negativ e energies, and b ecause (17) c haracterizes the differenc e betw een t wo free energy potentials. 18 Informa tion-Theoretic Bounded Ra tionality Expected Utilit y KL-Divergence Free Energy Figure 8: The F ree Energy F unctional. A b ounded-rational decision-problem combines a linear and a non-linear cost function, the first b eing the expected utility and the latter the KL-div ergence of the p osterior from the prior choice probabilities. The optimal distribution is the p oin t on the linear subspace (defined by the exp ected utilit y and the inv erse temp erature) that minimizes the KL-divergence to the prior. In information ge ometry , this p oin t is kno wn as the information pr oje ction of the prior onto the linear subspace [Csisz´ ar and Sc hields, 2004]. where the normalizing constan t Z α is the p artition function . Insp ecting (19), we see that the optimal c hoice probabilities P ( x ) is a standard Bay esian p osterior 5 obtained b y multiplying the prior Q ( x ) with a likeliho o d term that grows monotonically with the utilit y U ( x ). The in verse temp erature con trols the balance b et w een the prior Q ( x ) and the mo difier exp { αU ( x ) } . Certain t y-equiv alen t. T o understand the value that the agent assigns to a given decision problem, w e need to calculate its c ertainty-e quivalent . T o do so, w e insert the optimal c hoice probabilities (19) into the free energy functional (18), obtaining the expression F := F [ P ] = 1 α log Z α = 1 α log X x Q ( x ) e αU ( x ) . (20) In the text, w e will alw ays use the notation F [ · ] (with functional brack ets) for the free energy and F (without brack ets) for the certaint y-equiv alent. An interesting prop ert y of the certaint y-equiv alent is rev ealed when we analyze its c hange with the in verse temp erature α (Fig. 9). Obviously , the more the agen t is in control, the more effectively it can control the outcome, and thus the higher it v alues the decision problem. In particular, the v alue and the choice probabilities tak e the follo wing limits, α → + ∞ 1 α log Z α = max x U ( x ) P ( x ) = U max ( x ) α → 0 1 α log Z α = X x Q ( x ) U ( x ) P ( x ) = Q ( x ) α → −∞ 1 α log Z α = min x U ( x ) P ( x ) = U min ( x ) , 5. W e will further elab orate on this connection later in the text. 19 Or tega et al. a) b) Figure 9: Certaint y-equiv alent and optimal choice. a) The certaint y-equiv alent 1 α log Z α , seen as a function of the inv erse temp erature α ∈ R , has a sigmoidal shap e that mo ves b et w een min U and max U , passing through E [ U ] at α = 0. P anel (b) sho ws three optimal choice distributions for selected v alues of α . Notice that α 2 = 0, P ( x ) = Q ( x ). where U max and U min are the uniform distribution o ver the maximizing and minimizing subsets X max := { x ∈ X : U ( x ) = max x 0 U ( x 0 ) } X min := { x ∈ X : U ( x ) = min x 0 U ( x 0 ) } resp ectiv ely . Here we see that the inv erse temp erature α plays the role of a b oundedness parameter and that the single expression 1 α log Z is a generalization of the classical concept of value in reinforcement learning (see Fig. 10). 4.2 Sto c hastic choice A p erfectly-rational agent must alwa ys c ho ose the alternativ e with the highest expected utilit y . In general, this operation cannot b e done without exhaustive enumeration, which is more often than not intractable. In contrast, w e exp ect a b ounded-rational agen t to insp ect only a subset of alternatives until it finds one that is go o d enough—that is, a satisficing c hoice. F urthermore, the effort put into analyzing the alternatives should scale with the agen t’s level of rationality . A central feature of information-theoretic b ounded rationality is that this “algorithmic” property of analyzing just a subset is built righ t into the theory . 20 Informa tion-Theoretic Bounded Ra tionality -1 2 1 -1 2 1 -1 2 1 c) b) d) -1 2 1 a) -1 2 1 -1 2 1 -1 2 1 high low 0 Figure 10: Approximation of the classical decision rules using b ounded-rational decision problems. a) W e represen t a bounded-rational decision problem using a colored square no de, where the color enco des the inv erse temp erature. b–d) Classical decision rules and their bounded-rational approximations. More precisely , this simplification is achiev ed b y noticing that acting optimally amounts to obtaining just one r andom sample from the p osterior choice distribution. Any other c hoice sc heme that does not conform to the p osterior choice probabilities, suc h as pic king the mo de of the distribution for instance, violates the agen t’s information constraints mo deled b y the ob jectiv e function. W e review a basic sampling scheme that is readily suggested b y the sp ecific shape of the p osterior distribution. Rejection sampling. The simplest sampling scheme is immediately suggested by the form of the p osterior choice distribution (19), illustrated in Fig. 11. If we interpret Q as prior kno wledge that is readily a v ailable to the agen t in the form of random samples, then it can filter them to generate samples from P using rejection sampling. This works as follows: the agent first dra ws a sample x from Q and then accepts it with probabilit y A ( x | U ∗ ) = min 1 , e α [ U ( x ) − U ∗ ] , (21) where U ∗ ∈ R is a target v alue set by the agent. This is rep eated until a sample is accepted. Notice that this sampling scheme do es not require the agent to draw the samples sequen tially: indeed, rejection sampling can b e p ar al lelize d by drawing many samples and returning any of the accepted c hoices. The next theorem guarantees that rejection sampling do es indeed generate a sample from P . Theorem 1. R eje ction sampling with ac c eptanc e pr ob ability (21) pr o duc es the c orr e ct dis- tribution as long as U ∗ ≥ max x { U ( x ) } when α ≥ 0 and U ∗ ≤ min x { U ( x ) } when α ≤ 0 . Pr o of. First, we need a constant c suc h that for all x , P ( x ) ≤ c · Q ( x ). The smallest constan t is given by P ( x ) Q ( x ) = e αU ( x ) P x 0 Q ( x 0 ) e αU ( x 0 ) ≤ e αU ∗ P x 0 Q ( x 0 ) e αU ( x 0 ) = c. 21 Or tega et al. a) b) Figure 11: Cho osing optimally amoun ts to sampling from the p osterior choice distribution P ( x ) ∝ Q ( x ) exp { αU ( x ) } using rejection sampling. a) A utilit y function U ( x ) and a target utilit y U ∗ . b) The c hoice is phrased as a searc h pro cess where the agen t attempts to land a sample in to the target set (dark area). Generating any choic e corresp onds to obtaining a successful Bernoulli random v ariate Z ∼ B ( p α ), where the probability of success is equal to p α = Z α / exp { αU ∗ } . These inequalities hold whenever U ∗ is c hosen as U ∗ = max x U ( x ) if α ≥ 0 and U ∗ = min x U ( x ) if α ≤ 0. Hence, giv en a sample x from Q , the acceptance probability is P ( x ) c · Q ( x ) = 1 Z Q ( x ) e αU ( x ) 1 Z Q ( x ) e αU ∗ = e αU ( x ) e αU ∗ . Efficiency . Notice that rejection sampling is equal to the search pro cess defined in 3.3, where the probability of the target set is equal to p α = X x Q ( x ) e α [ U ( x ) − U ∗ ] = Z α e αU ∗ . (22) In other words, obtaining any sample is equiv alent to obtaining a successful Bernoulli random v ariate B ( p α ). Thus, the num b er of samples un til acceptance N α follo ws a geometric distribution G ( p α ), and the exp ected v alue is E [ N α ] = 1 /p α . F urthermore, the num b er of samples N α ( δ ) needed so as to guaran tee acceptance with a small failure probability δ > 0 is given by N α ( δ ) = log δ log(1 − p α ) . (23) This function is plotted in Fig. 12a. Limit efficiency . What is the most efficien t sampler? Assume w.l.g. that the inv erse temp erature α is fixed and strictly p ositiv e. F rom the definition of utilities (14) and the probabilit y-complexity equiv alence (13) we get U ( x ) = − C ( x ∩ P | x ∩ Q ) + C = 1 α log Pr ( x ∩ P | x ∩ Q ) + C. 22 Informa tion-Theoretic Bounded Ra tionality Insp ecting Fig. 11b, we see that in rejection sampling the probability of finding an y posterior c hoice is suc h that Pr ( x ∩ P | x ∩ Q ) = exp { U ( x ) − U ∗ } . Using this substitution implies U ( x ) = U ( x ) − U ∗ + C = ⇒ U ∗ = C, (24) that is, the target utility U ∗ is exactly equal to the offset C that transforms complexities in to utilities. Thus, c ho osing an in verse temp erature α and a target utility U ∗ indirectly determine the decision complexities and thereby also the probabilit y of accepting a sam- ple p α . Since decision complexities are p ositiv e and their absolute differences fixed through the utilities, the optimal c hoice of the target utilit y must b e U ∗ = max x { U ( x ) } . (25) Pic king a smaller v alue for U ∗ is not p ermitted, as it w ould violate the assumptions ab out the underlying search model (Sec. 3.3). 0 50 100 150 200 250 0 0.2 0.4 0.6 0.8 1 a) b) Figure 12: a) Number of samples un til acceptance. b) Gran ularity . Increasing the resolution of the choice set from X to ( X × Y ) do es not c hange the success probability . Gran ularit y . Typically , one would exp ect that the num b er of samples to b e insp ected b efore making a decision dep ends on the n umber of av ailable options. How ev er, in the b ounded-rational case, this is not so. Indeed, w e can augmen t the granularit y of the c hoice set without affecting the agent’s decision effort. This is seen as follows. Assume that w e extend the choice set from X to ( X × Y ), with the understanding that every pair ( x, y ) ∈ ( X × Y ) is a sub-c hoice of x ∈ X . Then, the utilities U ( x ) corresp ond to the certain ty-equiv alen ts of the utilities U ( x, y ), that is U ( x ) = 1 α log Z α ( x ) = 1 α log X y ∈ x Q ( y | x ) e αU ( x,y ) . (26) Inserting this into the partition function for x , we get Z α = X x Q ( x ) e αU ( x ) = X x,y Q ( x, y ) e αU ( x,y ) , 23 Or tega et al. that is, the partition function Z α is indep enden t of the lev el of resolution of the choice set. This, in turn, guaran tees that the success probability of rejection sampling p α = Z α / exp { αU ∗ } stays the same no matter how w e partition the choice set. 4.3 Equiv alence There is more than one wa y to represent a given choice pattern. Tw o different decision problems can lead to iden tical transformations of prior c hoice probabilities Q in to p osterior c hoice probabilities P , and ha v e the same certain ty-equiv alen t. When these t wo conditions are fulfilled, w e say that these decision problems are e quivalent . The concept of equiv alence is imp ortan t b ecause a given decision problem can b e re-expressed in a more conv enient form when necessary . This will prov e to b e essential when analyzing sequential decision problems later in the text. F ormal relation. Consider tw o equiv alent b ounded-rational decision problems ( α, X , Q, U ) and ( β , X , Q, V ) with non-zero inv erse temp eratures α, β 6 = 0. Then, their certain ty- equiv alents are equal, that is, 1 α log Z α = 1 β log Z β , (27) where the partition functions are Z α = P x Q ( x ) exp { αU ( x ) } and Z β = P x Q ( x ) exp { β V ( x ) } resp ectiv ely . The optimal choice probabilities of the second decision problem are equal to P ( x ) = 1 Z β exp β V ( x ) = exp β [ V ( x ) − 1 β log Z β ] = exp β [ V ( x ) − 1 α log Z α ] , where the last equality substitutes one certaint y-equiv alent for the other. Since these prob- abilities are equal to the ones of the first decision problem, w e hav e exp β [ V ( x ) − 1 α log Z α ] = exp α [ U ( x ) − 1 α log Z α ] } . Then, taking the logarithm and rearranging giv es V ( x ) = α β U ( x ) + 1 α − 1 β log Z α . (28) Th us, (28) is an explicit formula for the relation b et ween the inv erse temperatures α , β and utilities U ( x ) , V ( x ) of t wo equiv alen t bounded-rational decision problems. Essen tially , equiv alent decision problems ha v e utilities that are scaled versions of each other around the axis defined by the certain t y-equiv alent (see Fig. 13). F rom the figure, w e see that increasing the in v erse temp erature by a factor c requires decreasing the distance of the utilities to the certain ty-equiv alen t by a factor 1 /c , so that the pro duct is main tained at all times. Sampling from equiv alent decision problems. In rejection sampling, the success probabilities p α and p β of t wo equiv alen t decision problems ( α, X , Q, U ) and ( β , X , Q, V ) resp ectiv ely are the same, that is, p α = Z α e αU ∗ = Z β e β V ∗ = p β , (29) 24 Informa tion-Theoretic Bounded Ra tionality Figure 13: Equiv alen t b ounded-rational decision problems. The plot shows the utility curv es for three equiv alent decision problems with in v erse temp eratures α = 1 , 1 2 and − 1 resp ectiv ely . The resulting utilities are scaled v ersions of each other with resp ect to the symmetry axis given b y the certaint y-equiv alent 1 α log Z α . as long as their target utilities U ∗ and V ∗ ob ey the relation (28), i.e. V ∗ = α β U ∗ + 1 α − 1 β log Z α . (30) An important problem is to sample directly from a decision problem ( β , X , Q, V ) given the in verse temp erature α and the target utility U ∗ of an equiv alen t decision problem. A naiv e w ay of doing so consists in using (30) and 1 α log Z α = 1 β log Z β to deriv e an explicit formula for V ∗ : V ∗ = α β U ∗ + 1 − α β 1 β log Z β . This form ula requires integrating ov er the choice set to obtain the partition function Z β . Ho wev er, this is a costly op eration that do es not scale to very large choice spaces: recall that all we are allow ed to do is insp ecting the utilities V ( x ) for a few samples obtained from Q ( x ). Instead, w e can relate the success probabilit y p α to the samples obtained from ( β , X , Q, V ). This is seen b y rewriting p α as follows: p α = exp α h 1 α log Z α − U ∗ i = exp α h 1 β log Z β − U ∗ i . Then w e express the in v erse temp erature as α = ( α β ) · β and simplify the previous expression to p α = p α β , where p := Z β e β U ∗ . (31) This result has a con venien t op erational interpretation. The original problem, which con- sisted in obtaining one successful Bernoulli sample with success probabilit y p α has b een 25 Or tega et al. rephrased as the problem of obtaining α β suc c essful Bernoul li samples with probability of success p . Intuitiv ely , the reason behind this c hange is that the new success probabilit y p can b e larger/smaller than the original success probability p α , resulting in a more/less challeng- ing searc h problem: therefore, (31) equalizes the searc h complexit y b y demanding less/more successful samples. Note that this conv ersion only w orks if the resulting rejection sampling problem fulfills the conditions of Theorem 1, that is: U ∗ ≥ max x { V ( x ) } for strictly p ositiv e β , or U ∗ ≤ V ( x ) for strictly negative β . T o use (31) effectively , we need to identify algo- rithms to sample an arbitrary , p ossibly non-integer amoun t of ξ ∈ R consecutiv e Bernoulli successes based only on a sampler for B ( p ), which in turn dep ends on the c hoice sampler Q . W e first consider three base cases and then explain the general case. Case ξ ∈ N . If ξ is a natural num b er, then the Bernoulli random v ariate Z ∼ B ( p ξ ) is ob- tained in the obvious w a y by attempting to generate ξ consecutiv e Bernoulli B ( p ) successes. If al l of them succeed, then B is a success; otherwise Z is a failure (see Algorithm 1). Algorithm 1: Rejection-sampling trial for ξ ∈ N input : A target U ∗ , a num b er ξ ∈ N , and a c hoice sampler Q output : A sample x or failure for n = 1 , . . . , ξ do Dra w x ∼ Q ( x ) and u ∼ U (0 , 1) if u > exp { β [ V ( x ) − U ∗ ] } then return end return x Case ξ ∈ (0 , 1) . If ξ is in the unit interv al, then the Bernoulli success for B ( p ξ ) is easier to generate than for B ( p ). Thus, w e can tolerate a certain num b er of failures from B ( p ). The precise num b er is based the follo wing theorem. Theorem 2. L et Z b e a Bernoul li r andom variate with bias (1 − f N ) wher e f N = N X n =1 b n , and b n = ( − 1) n +1 ξ ( ξ − 1)( ξ − 2) · · · ( ξ − n + 1) n ! for 0 < ξ < 1 and wher e N is a Ge ometric r andom variate with pr ob ability of suc c ess p . Then, Z is a Bernoul li r andom variate with bias p ξ . An efficien t use of this sampling sc heme is as follo ws. First, we generate an upper b ound f ∗ ∼ U (0 , 1) that will fix the maxim um num b er of tolerated failures and initialize the “trial coun ter” to f = 0. Then, w e rep eatedly attempt to generate a successful Bernoulli sample Z ∼ B ( p ) as long as f < f ∗ . If it succeeds, w e return the sample; otherwise, we add a small p enalt y to the counter f and repeat—see Algorithm 2. 26 Informa tion-Theoretic Bounded Ra tionality Algorithm 2: Rejection-sampling trial for ξ ∈ (0 , 1) input : A target U ∗ , a num b er ξ ∈ (0 , 1), and a c hoice sampler Q output : A sample x or failure f ∗ ∼ U (0 , 1) Set b ← − 1, f ← 0, and k ← 1 Dra w x ∼ Q ( x ) and u ∼ U (0 , 1) while u > exp { β [ V ( x ) − U ∗ ] } do Set b ← − b · ( ξ − k +1) k , f ← f + b , and k ← k + 1 Dra w x ∼ Q ( x ) and u ∼ U (0 , 1) if f ∗ ≤ f then return end return x Case ξ = − 1 . If ξ is − 1, then the interpretation of the target utilit y U ∗ is flipp ed: an upp er-bound b ecomes a low er-b ound and vic e versa . The resulting success probability is then equal to p α = e β U ∗ Z β . W e do not kno w how to efficien tly generate a sample from the inverse partition function. Instead, w e use the following trick: we inv ert our acceptance criterion b y basing our com- parison on r e cipr o c al pr ob abilities as sho wn in Algorithm 3. Algorithm 3: Rejection-sampling trial for ξ = − 1 input : A target U ∗ and a choice sampler Q output : A sample x or failure Dra w x ∼ Q ( x ) and u ∼ U (0 , 1) if 1 /u < exp { β [ V ( x ) − U ∗ ] } then return return x General case. The rejection sampling algorithm for an arbitrary v alue of ξ ∈ R is con- structed from the preceding three cases. First, the sign of ξ determines whether w e will base our acceptance criterion on either probabilities or reciprocal probabilities. Then, w e decom- p ose the absolute v alue | ξ | into its in teger and unit-interv al parts, applying the associated algorithms to generate a c hoice. 4.4 Comparison W e finish this section with a brief comparison of decision-making based up on exp ected utility v ersus free energy . T able 1 tabulates the differences according to sev eral criteria explained in the following. The r ationality p ar adigm refers to general decision-making rationale of the agen t. W e hav e seen that p erfect rationality and b ounded rationalit y can b e regarded as c hoice mo dels that are v alid appro ximations to different scales or domain sizes . The different scales, in turn, suggest different se ar ch str ate gies and se ar ch sc op es : the p erfect rational agen t can use a fixed, deterministic rule that exploits the symmetries in the choice set in order to single out the optimal choice; while the b ounded-rational agent has to settle on a 27 Or tega et al. satisficing choice found through random insp ection—as if it were a search in a pile of straw. The tw o ob jectiv e functions also differ in their dep endency on the choice probabilities. The functional form is such that this dep endency is linear in the exp ected utilit y case and non- linear in the free energy case. Because of this, the p erfectly rational agent’s preferences dep end only upon the exp ected v alue of the utilit y; whereas a b ounded-rational agent also tak es in to accoun t the higher-order momen ts of the utility’s distribution ( utility sensitivity ), whic h is easily seen through a T aylor expansion of the KL-divergence term of the free energy . T able 1: Comparison of Decision Rules. Ob jective function Exp ected Utilit y F ree Energy Rationalit y paradigm p erfect b ounded Preferred domain size small large Searc h strategy deterministic randomized Searc h scop e complete incomplete F unctional form linear non-linear Utilit y sensitivity first moment all moments 5. Sequential Decisions In the previous section w e ha ve fleshed out the basic theory for single-step b ounded-rational decision making. In real-world applications how ev er, agen ts hav e to plan ahead ov er multiple time steps and interact with another system called the envir onment . In these se quential de cision pr oblems , agen ts hav e to devise a p olicy —that is, a decision plan—that prescribes ho w to act under an y situation that the agen t might encoun ter in the future. max min E max E max max max E min E min Figure 14: Decision trees Lik e in single-step decision problems, policies are chosen using a decision rule. The clas- sical decision rules that are used in the literature dep end up on the type of system the agent is interacting with. Three such decision rules that are p opular are: Expectimax , when the agen t is in teracting with a sto c hastic environmen t; Minimax , where the agent is playing against an adv ersary; and Expectiminimax in games with both an adversary and c hance elemen ts suc h as in Backgammon. The dynamical structure of the interactions b et w een the agent and the en vironment is captured in a de cision tr e e (or a game tr e e ) lik e those 28 Informa tion-Theoretic Bounded Ra tionality illustrated in Figure 14. These decision trees w ere built from comp osing the primitiv es 5 , , and 4 (discussed in Section 2.2) representing an adversarial, sto c hastic, and friendly transition resp ectiv ely . The optimal solution is then obtained using dynamic programming 6 b y recursiv ely calculating the certain t y-equiv alent, and then taking the transition promising the highest v alue. Notice that, for planning purp oses, it is immaterial whether the tran- sitions are even tually taken by the agent or by the environmen t: all what matters is the degree to which a particular transition contributes tow ards the agent’s ov erall ob jective. Th us, a max-node 4 can stand b oth for the agen t’s action or another pla yer’s coop erativ e mo ve for instance. max E E min a) b) Figure 15: A b ounded-rational appro ximation. Classical decision trees, suc h as the Expec- timinimax decision tree, can b e appro ximated with a b ounded-rational decision tree (b) by substituting the classical no des with b ounded-rational counterparts. Recall that the nodes are color-co ded, with darker no des corresp onding to higher in verse temperatures. These classical decision trees can b e appro ximated b y b ounded-rational decision trees. Recall from Section 2.2 that the classical certaint y-equiv alent op erators 5 , , and 4 can b e appro ximated b y b ounded-rational decision problems with appropriately chosen in verse temp eratures. The v ery same idea can b e used to substitute the no des in a classical decision tree (see e.g. Figure 15). This approximation immediately suggests a m uch broader class of b ounded-rational sequen tial decision problems that is interesting in its o wn righ t. The decision trees in this class are suc h that each no de can hav e its own inv erse temp erature to mo del a v ariety of information constrain ts that result from r esour c e limitations, risk , and trust sensitivity . Figure 16 illustrates some example decision types. Our next goal is to formalize these b ounded-rational decision trees, explain how they follow from the single-step case, and presen t a sampling algorithm to solve them. 5.1 Bounded-rational decision trees Definition. A b ounde d-r ational de cision tr e e is a tuple ( T , X , α, Q, R , F ) with the follo w- ing comp onen ts. T ∈ N is the horizon , i.e. the depth of the tree. X is the set of inter actions 6. Also known as “solving the Bel lman optimality e quations ” in the control and reinforcemen t learning literature, and b acktr acking in the economics literature. 29 Or tega et al. anti-rational rational risk-averse, adversarial risk-seeking, cooperative a) b) c) d) e) a b c d Figure 16: Map of b ounded-rational decision rules. P anels (a–d) depict several tw o-step decision problems, where the agen t and en vironmen t in teract with inv erse tem- p eratures α and β resp ectiv ely: a) case α 0 , β ≈ 0 approximates exp ected utilit y; b) case α 0 , β 0 is an appro ximation of a minimax/robust decision; c) case α > 0 , β ≈ 0 is a b ounded-rational con trol of a sto c hastic en viron- men t; and d) α > 0 , β > 0 corresp onds to risk-seeking b ounded-rational control. P anel (e) sho ws a map of the decision rules. or tr ansitions of the tree. W e assume that it is finite, but potentially very large. T ogether with the horizon, it gives rise to an associated set of states (or no des) of the tree S , defined as S := T [ t =0 X t . Eac h mem b er s ∈ S is a path of length t ≤ T that uniquely identifies a no de in the tree. Excepting the (empty) root no de ∈ S , every other mem b er s 0 ∈ S can b e reached from a preceding no de s ∈ S as long as sx = s 0 for a transition x ∈ X . The function α : S → R is the inverse temp er atur e function , and it assigns an in verse temp erature α ( s ) to each no de s ∈ S in the tree. W e will assume that α ( s ) 6 = 0 for all s ∈ S . The conditional probability distribution Q ( ·|· ) defines the prior tr ansition pr ob abilities : thus, Q ( x | s ) corresp onds to the probability of moving to no de s 0 = sx ∈ S from s ∈ S via the transition x ∈ X . Obviously , P x Q ( x | s ) = 1 for eac h no de s ∈ S . Every transition has, in turn, an associated c onditional r ewar d R ( x | s ) ∈ R . W e assume that these rew ards are additiv e, so that R ( uv | w ) = R ( u | w ) + R ( u | v, w ) for an y disjoint sets u, v , w . Finally , F : S → R is the terminal c ertainty-e quivalent function whic h attaches a v alue F ( s ) to each terminal state s ∈ X T . Fig. 17 shows an example. Sequence of in teractions. The bounded-rational decision tree mo dels a (prior) proba- bilit y distribution ov er random sequences X 1 , . . . , X T of length T . It suggests a c hronolog- ical generativ e mo del: at time t , i.e. after having observed the t − 1 previous in teractions X 0 or smaller than min x { R ( x | s ) + F ( sx ) } if α ( s ) < 0 resp ectiv ely . Second, the free energy and the partition function are related as F ( s ) = 1 α ( s ) log Z ( s ) . (43) If s ∈ S is an internal no de, then w e can use the latter equation to substitute the free energies in expression (42). This in turn allo ws us to rewrite the success probability recursively: Z ( s ) e α ( s ) U ∗ ( s ) = X x Q ( x | s ) exp n α ( s ) h R ( x | s ) + 1 α ( sx ) log Z ( sx ) io exp n α ( s ) U ∗ ( s ) o = X x Q ( x | s ) exp n α ( s ) h 1 α ( sx ) log Z ( sx ) io exp n α ( s ) h U ∗ ( s ) − R ( x | s ) io = X x Q ( x | s ) Z ( sx ) e α ( sx ) U ∗ ( sx ) α ( sx ) α ( s ) , (44) where we hav e defined U ∗ ( sx ) := U ∗ ( s ) − R ( x | s ) as the target utilit y for the subtree ro oted as sx ∈ S . Thus, (44) says that obtaining a sample from an internal no de s ∈ S amoun ts to first pic king a random transition x ∈ X and then obtaining α ( sx ) /α ( s ) successful samples from the subtree rooted at sx ∈ S , whic h w e kno w is an equiv alent subtree ha ving in v erse temp erature α ( sx )—see Figure 19b. W e ha ve already seen in Section 4.3 how to sample this op eration using (mo difications of ) the rejection sampling algorithm. If s ∈ S is a terminal no de instead, then we can treat it as a normal single-step decision problem and sample a c hoice using rejection sampling. A detailed pseudoco de for this recursive rejection sampling algorithm is listed in Algo- rithm 4. The function sample ( s, U ∗ , σ ) tak es as argumen ts a no de s ∈ S ; a target utility U ∗ for the whole path; and a sign-flag σ ∈ {− 1 , +1 } , whic h keeps track of whether we are using normal ( σ = +1) or recipro cal probabilities ( σ = − 1) in the rejection sampling step due to the p ossible c hange of sign in the in v erse temp erature. It returns either an accepted path τ ∈ X T or if the prop osal is rejected. Planning is inv ok ed on the ro ot no de by executing sample ( , U ∗ , 1), where U ∗ is the global target utilit y . F or this recursiv e sampler to return a path drawn from the correct distribution, the v alue of U ∗ m ust b e chosen so as to b e larger than the rewards of an y path, i.e. for all x ≤ T ∈ X T , U ∗ ≥ T X t =1 R ( x t | x 0 and u ≤ p ) then return sx if ( σ < 0 and 1 /u ≥ p ) then return sx return end R e cursion: σ ← σ · sign( α ( s ) /α ( sx )) ξ ← abs( α ( s ) /α ( sx )) b egin A ttempt to gener ate b ξ c suc c essful samples. for 1 , . . . , b ξ c do τ ← sample ( sx, U ∗ − R ( x | s ) , σ ) if τ = then return end if ξ = b ξ c then return τ end b egin A ttempt to gener ate ξ − b ξ c suc c essful samples. u ∼ U (0 , 1) a ← b ξ c − ξ set b ← − 1, f ← 0, and k ← 1 τ ← sample ( sx, U ∗ − R ( x | s ) , σ ) while τ = do set b ← − b · ( a − k + 1) /k , f ← f + b , and k ← k + 1 τ ← sample ( sx, U ∗ − R ( x | s ) , σ ) if u ≤ f then return end return τ end 37 Or tega et al. whenev er the ro ot no de’s inv erse temp erature α ( ) > 0 is strictly positive; or smaller than the rewards of an y path whenever α ( ) < 0. Finally we note that, as in the single-step case, the agent can parallelize the recursive rejection sampler b y sim ultaneously exploring many sto c hastically generated paths, and then return any of the accepted ones. 6. Discussion 6.1 Relation to literature In this pap er w e ha ve presented a summary of an information-theoretic mo del of b ounded- rational decision-making that has precursors in the economic literature [McKelvey and P alfrey, 1995, Mattsson and W eibull, 2002, W olp ert, 2004, Sims, 2003, 2005, 2006, 2011] and that has emerged through the application of information-theoretic metho ds to sto c hastic con trol and to perception-action systems [see e.g. Mitter and Newton, 2005, Kapp en, 2005, T o doro v, 2006, 2009, Theo dorou et al., 2010, Theodorou, 2015, Still, 2009, Still and Precup, 2012, v an den Bro ek et al., 2010, F riston, 2010, Peters et al., 2010, Tish by and P olani, 2011, Kapp en et al., 2012, Neumann and Peters, 2012, Rawlik and T oussain t, 2012, Rubin et al., 2012, F ox and Tish by, 2012, Neumann and Peters, 2013, Grau-Moy a et al., 2013, Tish by and Zasla vsky, 2015, T anak a et al., 2015, Mohamed and Rezende, 2015, Genewein et al., 2015, Leibfried and Braun, 2015]. In particular, the connection b et ween the free energy functional and b ounded rationality , the identification of the free energy extremum as the certain ty-equiv alen t in sequential games, and the implemen tation of the utility-complexit y trade-off based on sampling algorithms, were developed in a series of publications within this comm unity [see e.g. Braun et al., 2011, Ortega and Braun, 2011, 2013, Ortega et al., 2014, 2015]. The most distinguishing feature of this approach to b ounded rationality is the formalization of resources in terms of information. Historical ro ots. The problem of b ounded rational decision-making gained p opularit y in the 1950s originating in the w ork by Herb ert Simon [Simon, 1956, 1972, 1984]. Simon prop osed that b ounded rational agents do not optimize, but satisfic e —that is, they do not searc h for the absolute b est option, but rather settle for an option that is go o d enough. Since then the researc h field for b ounded rational decision-making has considerably div ersi- fied leading to a split b et ween optimization-based approac hes and approaches that dismiss optimization as a misleading concept altogether [Lipman, 1995, Russell, 1995, Russell and Subramanian, 1995, Aumann, 1997, Rubinstein, 1998, Gigerenzer and Selten, 2001]. In particular, the formulation as a constrained optimization problem is argued to lead to an infinite regress, and the paradoxical situation that a b ounded rational agent w ould hav e to solv e a more complex ( i.e. constrained) optimization problem than a p erfectly rational agen t. Information-theoretic b ounded rationalit y provides a middle ground here, as the agen t randomly samples choices, but do es not incur in to a meta-optimization, b ecause the random search simply stops when resources run out. In fact, the equations for information- theoretic b ounded rational agen t can b e interpreted as a sto c hastic satisficing pro cedure instan tiated by rejection sampling. KL-Con trol. F rom the v antage p oin t of an external observer, information-theoretic b ounded rational agen ts appear to trade off an y gains in utility against the additional information- 38 Informa tion-Theoretic Bounded Ra tionality theoretic complexity as measured by the KL-divergence betw een a prior decision strategy and a p osterior decision strategy after delib eration. Recen tly , a n umber of studies hav e suggested the use of the relativ e entrop y as a cost function for control, which is some- times referred to as KL-Contr ol [T o doro v, 2006, 2009, Kapp en et al., 2012]. In the work b y T o doro v [2006], the transition probabilities of a Mark o v decision process are con trolled directly , and the control costs are given b y the KL-divergence be t ween the controlled dy- namics and the passive dynamics describ ed by a baseline distribution. This framew ork has also b een extended to the con tinuous case, leading to the formulation of path in tegral con trol [Kapp en, 2005, Theo dorou et al., 2010]. Conceptually , the most imp ortan t difference to the b ounded-rational interpretation is that in KL-Control the stochasticit y of c hoice is though t to arise from environmen tal passive dynamics rather than being a direct consequence of limited information capacities. Psyc hology . In the psychological and econometric literature, sto c hastic choice rules ha ve extensiv ely b een studied starting with Luce [1959], extending through McF adden [1974], Meginnis [1976], F udenberg and Kreps [1993], McKelvey and P alfrey [1995] and Mattsson and W eibull [2002]. The v ast ma jorit y of mo dels has concentrated on lo git choic e mo dels based on the Boltzmann distribution which includes the softmax rule that is p opular in the reinforcement learning literature [Sutton and Barto, 1998]. McF adden [1974] has shown that such Boltzmann-like c hoice rules can arise, for example, when utilities are contaminated with additiv e noise following an extreme v alue distribution. F rom the ph ysics literature, it is also w ell-kno wn that Boltzmann distributions arise in the con text of v ariational principles in the free energy implying a trade-off b et ween utilit y and entropic resource costs. The information-theoretic mo del of bounded rationality generalizes these ubiquitous logit c hoice mo dels b y allowing arbitrary prior distributions, as for example in McKelvey and Palfrey [1995]. This corresponds to a trade-off in utilit y gains and additional resource costs incurred b y deviating from the prior, whic h ultimately en tails a v ariational principle in a free energy difference. Economics. In the economic literature, v ariational principles for c hoice hav e b een sug- gested in the context of variational pr efer enc e mo dels [Macc heroni et al., 2006]. In v ari- ational preference mo dels the certain ty-equiv alen t v alue of a choice consists of tw o terms: the exp ected utility and an am biguity index. A particular instance of the v ariational pref- erence mo del is the multiplier pr efer enc e mo del where the am biguity index is given b y a Kullbac k-Leibler divergence [Hansen and Sargent, 2008]. In particular, it has b een pro- p osed that multiplier preference mo dels allo w dealing with mo del uncertaint y , where the KL-div ergence indicates the degree of model uncertain t y . F ree energy v ariational principles also app ear in v ariational Ba yesian inference. In this case the utility function is giv en b y the negativ e log-likelihoo d and the free energy trade-off captures the transformation from Ba yesian prior to p osterior. The v ariational Ba yes framew ork has recently also b een pro- p osed as a theoretical framework to understand brain function [F riston, 2009, 2010] where p erception is mo deled as v ariational Ba yesian inference ov er hidden causes of observ ations. Computational approac hes. While the free energy functional do es not dep end on domain-sp ecific assumptions, the exact relationship b et w een information-theoretic con- strain ts and the standard measures of algorithmic complexity in computer science ( e.g. space and time) is not known. In contrast, the notion of b ounde d optimality defines the 39 Or tega et al. optimal p olicy as the program that achiev es the highest utility score on a particular ma- c hine giv en complexity constrain ts, and often times relies on meta-reasoning for practical implemen tations [Horvitz, 1989, Russell and Subramanian, 1995]. This view has recently exp erienced a reviv al in computational neuroscience under the name of c omputational r a- tionality [Lieder et al., 2014, Lewis et al., 2014, Griffiths et al., 2015, Gershman et al., 2015]. Another recent approac h to mo del b ounded resources with profound implications is sp ac e-time emb e dde d intel ligenc e in whic h agen ts are treated as lo c al c omputation p atterns within a global computation of the world [Orseau and Ring, 2012]. It remains an inter- esting challenge for the future to extend the framew ork of information-theoretic b ounded rationalit y to the realm of programs and to relate it to notions of algorithmic complexit y . 6.2 Conclusions The original question w e hav e addressed is: how do agents mak e decisions in very large and unstructured choice spaces? The need for solving this question is b ecoming increasingly critical, as these decision spaces are ubiquitous in mo dern agent systems. T o answer this, w e ha ve formalized resource limitations as information constraints, and then replaced the ob jective function of sub jectiv e exp ected utility theory with the free energy functional. An adv antage of the free energy functional is that its optimal solution has a clear operational in terpretation. As a result, the optimal solution is a fully parallelizable sto c hastic choice strategy that strikes a trade-off b et w een the utilit y and the search effort. P erhaps more fundamen tally though, the theory la ys out a general method to mo del rea- soning under information constraints that arise as symptoms of in tractabilit y , mo del uncer- tain ty , or other causes. This feature b ecomes esp ecially apparen t in the sequential decision case, where a b ounded-rational decision-tree captures an agen t’s dynamics of trust—b oth in its o wn abilit y to shap e & predict the future, and in the other play ers’ inten tions. W e ha ve se en that mo del uncertain ty biases the v alue estimates of an agen t, forcing it to pa y atten tion to the higher-order momen ts of the utilit y . F or the sake of parsimony of the exp osition, in this w ork w e hav e refrained from elab- orating on sp ecific applications or extensions. There are ob vious connections to Bay esian statistics that w e hav e not fleshed out. F urthermore, the ideas outlined here can b e applied to any c hoice pro cess that is typically sub ject to information constraints, among them: atten tion fo cus and generation of random features; mo del selection and inference in proba- bilistic programming; and planning in active learning, Ba yesian optimization, multi-agen t systems and partially-observ able Mark ov decision pro cesses. The success of the theory will ultimately dep end on its usefulness in these application domains. Ac kno wledgments The authors would like to thank Daniel P olani, Bert J. Kapp en, and Ev angelos Theo dorou, who provided innumerable insights during man y discussions. This study w as funded b y the Israeli Science F oundation Cen ter of Excellence, the D ARP A MSEE Pro ject, the In tel Collab orativ e Researc h Institute for Computational Intelligence (ICRI-CI), and the Emmy No ether Gran t BR 4164/1-1. 40 Informa tion-Theoretic Bounded Ra tionality App endix A. Pro ofs A.1 Pro of to Theorem 2 Pr o of. By expanding h ( p ) = p ξ around p 0 = 1, T aylor’s theorem asserts that p ξ = 1 − ξ (1 − p ) + ξ ( ξ − 1) 2! (1 − p ) 2 − ξ ( ξ − 1)( ξ − 2) 3! (1 − p ) 3 + · · · Since 0 < ξ < 1, each term after the first is negativ e, hence p ξ = 1 − ∞ X n =1 b n (1 − p ) n where b n := ( − 1) n +1 ξ ( ξ − 1)( ξ − 2) · · · ( ξ − n + 1) n ! (45) and where the 0 ≤ b n ≤ 1 are kno wn a priori. Hence, 1 − p ξ = ∞ X n =1 b n (1 − p ) n = ∞ X n =1 b n ∞ X k =0 (1 − p ) n + k p = ∞ X n =1 n X k =0 b k ! (1 − p ) n p, where the second equality follo ws from m ultiplying the term (1 − p ) n with 1 = p · p − 1 = p ∞ X k =0 (1 − p ) k and the third from a diagonal enumeration of the summands. Define f 0 := 0 and f n := P n k =0 b k for n ≥ 1. Since from (45), 1 = 1 − 0 ξ = ∞ X n =1 b n , w e know that 0 ≤ f n ≤ 1 for all n ≥ 0 as well. Finally , p ξ = 1 − ∞ X n =1 f n (1 − p ) n p = ∞ X n =1 (1 − f n )(1 − p ) n p corresp onds to the exp ectation of (1 − f n ), where n follo ws a Geometric distribution with probabilit y of success p . T o obtain a Bernoulli random v ariable u with bias p ξ , we can sample n first and then generate a Bernoulli random v ariable u | n with bias 1 − f n . References R. J. Aumann. Rationality and Bounded Rationalit y. Games and Ec onomic Behavior , 28: 42–67, 1997. D. P . Bertsek as and J. N. Tsitsiklis. Neur o-Dynamic Pr o gr amming . Athena Scien tific, 1996. ISBN 1886529108. D. A. Braun and P . A. Ortega. Information-theoretic b ounded rationality and epsilon- optimalit y. Entr opy , 16(8):4662–4676, 2014. 41 Or tega et al. D. A. Braun, P . A. Ortega, E. Theodorou, and S. Sc haal. Path integral control and b ounded rationalit y. In IEEE Symp osium on adaptive dynamic pr o gr amming and r einfor c ement le arning , pages 202–209, 2011. N. Cesa-Bianchi and G. Lugosi. Pr e diction, le arning, and games. Cambridge Universit y Press, 2006. I. Csisz´ ar and P . C. Sc hields. Information The ory and Statistics: A T utorial . NOW Pub- lishers, 2004. H. Dixon. Some thoughts on economic theory and artificial intelligence. In Surfing Ec o- nomics: Essays for the Enquiring Ec onomist . Palgra v e, 2001. M. O. Duff. Optimal le arning: c omputational pr o c e dur es for b ayes-adaptive markov de cision pr o c esses . PhD thesis, Univ ersity of Massach usetts Amherst, 2002. Director-Andrew Barto. D. Ellsb erg. Risk, Am biguity and the Sa v age Axioms. The Quaterly Journal of Ec onomics , 75:643–669, 1961. R. F o x and N. Tishb y . Bounded Planning in Passiv e POMDPs. In ICML , pages 1775–1782, 2012. K. F riston. The free-energy principle: a rough guide to the brain? T r ends in Co gnitive Scienc e , 13:293–301, 2009. K. F riston. The free-energy principle: a unified brain theory? Natur e R eview Neur oscienc e , 11:127–138, 2010. D. F uden b erg and D. Kreps. Learning mixed equilibria. Games and Ec onomic Behavior , 5: 320–367, 1993. T. Genewein, F. Leibfried, J. Grau-Mo ya, and D. A. Braun. Bounded rationality , abstraction and hierarchical decision-making: an information-theoretic optimalit y principle. F r ontiers in R ob otics and AI , 2(27), 2015. S. J. Gershman, E. J. Horvitz, and J. B. T enenbaum. Computational rationality: A con- v erging paradigm for in telligence in brains, minds, and machines. Scienc e , 349(6245): 273–8, 2015. G. Gigerenzer and R. Selten. Bounde d r ationality: the adaptive to olb ox . MIT Press, Cam- bridge, MA, 2001. J. Grau-Moy a, E. Hez, G. Pezzulo, and D. A. Braun. The effect of mo del uncertain ty on co operation in sensorimotor in teractions. Journal of The R oyal So ciety Interfac e , 10(87), 2013. T. L. Griffiths, F. Lieder, and N. D. Goo dman. Rational use of cognitive resources: Levels of analysis b et ween the computational and the algorithmic. T opics in Co gnitive Scienc e , 7(2):217–229, 2015. 42 Informa tion-Theoretic Bounded Ra tionality L.P . Hansen and T.J. Sargen t. R obustness . Princeton Universit y Press, Princeton, 2008. J. Harsanyi. Games with Incomplete Information Play ed by ”Bay esian” Play ers. Manage- ment Scienc e , 14(3):159–182, 1967. E. J. Horvitz. Reasoning ab out b eliefs and actions under computational resource constraints. In Unc ertainty in Artificial Intel ligenc e , 1989. M. Hutter. Universal Artificial Intel ligenc e: Se quential De cisions b ase d on Algorithmic Pr ob ability . Springer, Berlin, 2004. H. J. Kapp en. A linear theory for control of non-linear sto c hastic systems. Physic al R eview L etters , 95:200201, 2005. H. J. Kapp en, V. G´ omez, and M. Opp er. Optimal control as a graphical mo del inference problem. Machine L e arning , 1:1–11, 2012. F.H. Knight. Risk, Unc ertainty, and Pr ofit . Hough ton Mifflin, Boston, 1921. L. Kocsis and C. Szepesv´ ari. Bandit based Mon te-Carlo Planning. In Pr o c e e dings of ECML , pages 282–203, 2006. S. Legg. Machine Sup er Intel ligenc e . PhD thesis, Departmen t of Informatics, Universit y of Lugano, June 2008. F. Leibfried and D. A. Braun. A Rew ard-Maximizing Spiking Neuron as a Bounded Rational Decision Maker. Neur al Computation , 27(8):1686–1720, 2015. R. L. Lewis, A. How es, and S. Singh. Computational rationality: linking mechanism and b eha vior through b ounded utilit y maximization. T opics in Co gnitive Scienc e , 6(2):279– 311, 2014. K. Leyton-Bro wn and Y. Shoham. Essentials of Game The ory: A Concise Multidisciplinary Intr o duction . Syn thesis Lectures on Artificial Intelligence and Mac hine Learning. Morgan & Claypo ol Publishers, 2008. F. Lieder, D. Plunkett, J. B. Hamrick, S. J. Russell, N. J. Ha y , and T. L. Griffiths. Algorithm selection by rational metareasoning as a mo del of h uman strategy selection. In A dvanc es in Neur al Information Pr o c essing Systems 27 , 2014. B. Lipman. Information pro cessing and b ounded rationality: a surv ey. Canadian Journal of Ec onomics , 28:42–67, 1995. G. Lo omes and R. Sugden. Regret theory: An alternativ e approach to rational c hoice under uncertain ty. Ec onomic Journal , 92:805–824, 1982. R. D. Luce. Individual choic e b ehavior . Wiley , Oxford, 1959. F. Maccheroni, M. Marinacci, and A. Rustic hini. Ambiguit y a v ersion, robustness, and the v ariational representation of preferences. Ec onometric a , 74:1447–1498, 2006. 43 Or tega et al. L.-G. Mattsson and J. W. W eibull. Probabilistic choice and pro cedurally b ounded ratio- nalit y . Games and Ec onomic Behavior , 41(1):61–78, 2002. D. McF adden. Conditional logit analysis of qualitative choice b eha vior. In P . Zarembk a, editor, F r ontiers in e c onometrics . Academic Press, New Y ork, 1974. R. D. McKelvey and T. R. P alfrey . Quantal Response Equilibria for Normal F orm Games. Games and Ec onomic Behavior , 10(1):6–38, July 1995. J. R. Meginnis. A new class of symmetric utilit y rules for gambles, sub jective marginal probabilit y functions, and a generalized Bay es’ rule. In Pr o c e e dings of the Americ an Statistic al Asso ciation, Business and Ec onomic Statistics Se ction , pages 471–476, 1976. S. K. Mitter and N. J. Newton. Information and Entrop y Flow in the Kalman-Bucy Filter. Journal of Stat istic al Physics , 118:145–176, 2005. V. Mnih, K. Kavuk cuoglu, D. Silver, A. Gra ves, I. An tonoglou, D. Wierstra, and M. Ried- miller. Human-level con trol through deep reinforcemen t learning. Nautur e , 518(518): 529–533, 2013. S. Mohamed and D. J. Rezende. V ariational Information Maximisation for In trinsically Motiv ated Reinforcement Learning. In NIPS , 2015. D. C. Neumann and J. P eters. Hierarchical relativ e en tropy policy search. In Pr o c e e dings of the International Confer enc e on Artificial Intel ligenc e and Statistics , 2012. D. C. Neumann and J. P eters. Autonomous reinforcemen t learning with hierarc hical REPS. In Pr o c e e dings of the International Joint Confer enc e on Neur al Networks , 2013. L. Orseau and Mark Ring. Space-Time em b edded in telligence. A rtificial Gener al Intel li- genc e , pages 209–218, 2012. P . A. Ortega. A unifie d fr amework for r esour c e-b ounde d autonomous agents inter acting with unknown envir onments . PhD thesis, Department of Engineering, Univ ersit y of Cam- bridge, UK, 2011. P . A. Ortega and D. A. Braun. Information, utility and b ounded rationality. In L e ctur e notes on artificial intel ligenc e , volume 6830, pages 269–274, 2011. P . A. Ortega and D. A. Braun. F ree Energy and the Generalized Optimality Equations for Sequential Decision Making. In Eur op e an Workshop on R einfor c ement L e arning (EWRL’10) , 2012. P . A. Ortega and D. A. Braun. Thermo dynamics as a Theory of Decision-Making with Information Pro cessing Costs. Pr o c e e dings of the R oyal So ciety A 20120683 , 2013. P . A. Ortega and D. D. Lee. An Adversarial Interpretation of Information-Theoretic Bounded Rationalit y . In Twenty-Eight AAAI Confer enc e on Artificial Intel ligenc e (AAAI) , 2014. 44 Informa tion-Theoretic Bounded Ra tionality P . A. Ortega, D. A. Braun, and N. Tishb y . Monte Carlo Metho ds for Exact & Efficient Solution of the Generalized Optimality Equations. In IEEE International Confer enc e on R ob otics and A utomation (ICRA) , 2014. P . A. Ortega, K.-E. Kim, and D. D. Lee. Reactive Bandits with A ttitude. In 18th Interna- tional Confer enc e on A rtificial Intel ligenc e and Statistics (AIST A TS) , 2015. M. J. Osb orne and A. Rubinstein. A Course in Game The ory . MIT Press, 1999. C. H. P apadimitriou and J. N. Tsitsiklis. The Complexit y of Marko v Decision Pro cesses. Mathematics of Op er ations R ese ar ch , 12(3):441–450, 1987. J. Peters, K. M¨ ulling, and Y. Alt ¨ un. Relativ e entrop y p olicy searc h. In AAAI , 2010. K. Rawlik and M. Vijay akumar S. T oussaint. On stochastic optimal control and reinforce- men t learning b y appro ximate inference. In Pr o c e e dings R ob otics: Scienc e and Systems . MIT Press, 2012. J. Rubin, O. Shamir, and N. Tishb y . T rading v alue and information in MDPs. In De cision making with imp erfe ct de cision makers . Springer, 2012. A. Rubinstein. Mo deling Bounde d R ationality . MIT Press, Cam bridge, MA, 1998. S. J. Russell. Rationalit y and Intelligence. In Chris Mellish, editor, Pr o c e e dings of the F ourte enth International Joint Confer enc e on Artificial Intel ligenc e , pages 950–957, San F rancisco, 1995. Morgan Kaufmann. S. J. Russell and P . Norvig. A rtificial Intel ligenc e: A Mo dern Appr o ach . Pren tice-Hall, Englew o o d Cliffs, NJ, 3rd edition edition, 2010. S. J. Russell and D. Subramanian. Prov ably b ounded-optimal agen ts. Journal of Artificial Intel ligenc e R ese ar ch , 3:575–609, 1995. P . Samuelson. A Note on the Pure Theory of Consumers’ Behaviour. Ec onomic a , 5(17): 61–71, 1938. L. J. Sav age. The F oundations of Statistics . John Wiley and Sons, New Y ork, 1954. ISBN 0-486-62349-1. H. A. Simon. Rational choice and the structure of the environmen t. Psycholo gic al R eview , 63(2):129–38, 1956. H. A. Simon. Theories of Bounded Rationalit y. In C.B. Radner and R. Radner, editors, De cision and Or ganization , pages 161–176. North Holland Publ., Amsterdam, 1972. H. A. Simon. Mo dels of Bounde d R ationality . MIT Press, Cam bridge, MA, 1984. C. A. Sims. Implications of rational inattention. Journal of Monetary Ec onomics , 50(3): 665–690, April 2003. C. A. Sims. Rational inattention: A research agenda. In Pr o c e e dings of the Deutsche Bundesb ank . Deutsc he Bundesbank Research Cen ter, 2005. 45 Or tega et al. C. A. Sims. Rational inattention: Bey ond the linear-quadratic case. A meric an Ec onomic R eview , 96(2):158–163, 2006. C. A. Sims. Rational inatten tion and monetary economics. In Handb o ok of monetary e c onomics . Elsevier, 2011. R. F. Stengel. Optimal Contr ol and Estimation . Dov er Bo oks on Mathematics. Do ver Publications, 1994. S. Still. An information-theoretic approac h to in teractive learning. Eur ophysics L etters , 85: 28005, 2009. S. Still and D. Precup. An information-theoretic approac h to curiosity-driv en reinforcemen t learning. The ory in Bioscienc es , 131(3):139–148, 2012. R. S. Sutton and A. G. Barto. R einfor c ement L e arning: An Intr o duction . MIT Press, Cam bridge, MA, 1998. C. Szep esv´ ari. Algorithms for R einfor c ement L e arning . Synthesis Lectures on Artificial In telligence and Mac hine Learning. Morgan and Cla yp ool Publishers, 2010. T. T anak a, P . M. Esfahani, and S. K. Mitter. LQG Con trol with Minimal Information: Three-Stage Separation Principle and SDP-based Solution Synthesis. , 2015. E. Theo dorou, J. Buc hli, and S. Sc haal. A generalized path integral approach to reinforce- men t learning. Journal of Machine L e arning R ese ar ch , 11:3137–3181, 2010. E. A. Theo dorou. Nonlinear Sto c hastic Control and Information Theoretic Dualities: Con- nections, Interdependencies and Thermo dynamic Interpretations. Entr opy , 17(5):3352– 3375, 2015. N. Tish b y and D. Polani. Per c eption-A ction Cycle , chapter Information Theory of Decisions and Actions, pages 601–636. Springer New Y ork, 2011. N. Tishb y and N Zasla vsky . Deep learning and the information b ottlenec k principle. In ITW , pages 1–5, 2015. E. T o doro v. Linearly solv able Marko v decision problems. In A dvanc es in Neur al Information Pr o c essing Systems , v olume 19, pages 1369–1376, 2006. E. T o doro v. Efficien t computation of optimal actions. Pr o c e e dings of the National A c ademy of Scienc es U.S.A. , 106:11478–11483, 2009. B. v an den Broek, W. Wiegerinck, and H. J. Kapp en. Risk Sensitive Path In tegral Control. In UAI , pages 615–622, 2010. J. V eness, M. Ng, M. Hutter, W. Uther, and D. Silv er. A Mon te-Carlo AIXI Approximation. Journal of Artificial Intel ligenc e R ese ar ch , 40:95–142, 2011. 46 Informa tion-Theoretic Bounded Ra tionality J. V on Neumann and O. Morgenstern. The ory of Games and Ec onomic Behavior . Princeton Univ ersity Press, Princeton, 1944. ISBN 0691119937. D. H. W olp ert. Information theory - the bridge connecting bounded rational game theory and statistical physics. In D. Braha and Y. Bar-Y am, editors, Complex Engine ering Sys- tems , chapter Information theory - the bridge connecting bounded rational game theory and statistical physics. P erseus Bo oks, 2004. S. Zilb erstein. Metareasoning and Bounded Rationality. Aso ciation for the A dvanc ement of A rtificial Intel ligenc e , 2008. 47
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment