Value Under Ignorance in Universal Artificial Intelligence

Reading time: 5 minute
...

📝 Original Info

  • Title: Value Under Ignorance in Universal Artificial Intelligence
  • ArXiv ID: 2512.17086
  • Date: 2025-12-18
  • Authors: Cole Wyeth, Marcus Hutter

📝 Abstract

We generalize the AIXI reinforcement learning agent to admit a wider class of utility functions. Assigning a utility to each possible interaction history forces us to confront the ambiguity that some hypotheses in the agent's belief distribution only predict a finite prefix of the history, which is sometimes interpreted as implying a chance of death equal to a quantity called the semimeasure loss. This death interpretation suggests one way to assign utilities to such history prefixes. We argue that it is as natural to view the belief distributions as imprecise probability distributions, with the semimeasure loss as total ignorance. This motivates us to consider the consequences of computing expected utilities with Choquet integrals from imprecise probability theory, including an investigation of their computability level. We recover the standard recursive value function as a special case. However, our most general expected utilities under the death interpretation cannot be characterized as such Choquet integrals.

💡 Deep Analysis

📄 Full Content

The AIXI reinforcement learning (RL) agent [Hut00] is a clean and nearly parameter-free description of general intelligence. However, because of its focus on the RL setting it does not natively model arbitrary decision-theoretic agents, but only those that maximize an external reward signal. A generalization to other utility functions, provided in this paper, is interesting for decision theory and potentially important for AI alignment. We further show how our generalization of AIXI naturally leads to imprecise probability theory, while assigning utilities to events in an extended space under a certain associated probability distribution returns us to the domain of von Neumann-Morgenstern rationality [vNMR44].

Motivation. The AIXI policy decision-theoretically maximizes total expected returns (discounted reward sum) over its lifetime with respect to the universal distribution, which has the potential to encode essentially arbitrary tasks. Arguably, AIXI’s drive to maximize expected returns would lead it to instrumentally seek power even at the expense of its creators, no matter what form of rewards they might choose to administer [CHO22]. Therefore, it is reasonable to seek a more general class of agents whose terminal goals are parameterized [Mil23]. Indeed, choosing the expected returns as optimization target was motivated by the promise of reinforcement learning, while the primary (pre-)training method for frontier AI systems is now next-token prediction [HDN + 24], though RL still plays an important role [BBE25]. From a general decision-theoretic standpoint, we should allow as wide a class of utility functions as possible 4 -and for AI alignment, we may desire a modular and user-specifiable utility function.

Our focus is on the history-based setting of universal artificial intelligence, where agents learn to pursue their goals by exchanging actions and percepts with the environment; but without including rewards in general, we move beyond the RL paradigm. Because we are interested in universal agents that can succeed across a vast array of environments, we cannot rely on common simplifying assumptions (such as the Markov property) and it is difficult to prove objective optimality or convergence guarantees. We cannot even rely on the usual additivity of probabilities, but are forced to work with “defective” semimeasures (Definition 5). In this work, we aim to develop the mathematical tools to rigorously extend AIXI to more general utility functions and investigate the properties (e.g. computability level) that carry over from the standard case.

Main contribution. We introduce the basics of semimeasure theory intended to model filtrations with a chance of terminating at a finite time. In the context of history-based reinforcement learning [Hut00], we prove equivalence between the Choquet integral of the returns (with respect to the history distribution) and the recursive value function. This suggests a generalized version of AIXI which optimizes any (continuous) utility function. We prove the existence of an optimal policy under the resulting generalized value functions, and investigate their computability level, obtaining slightly better results for the Choquet integral than the ordinary expected utility. We use these results to analyze the consequences of treating semimeasure loss as a chance of death.

AIXI. Our exposition of AIXI is sufficient to understand our results but does not go into detail because AIXI is the standard approach to general reinforcement learning with many good introductions available in the literature [Hut00,Hut05].

Utility functions for AIXI. There have been several suggestions for alternative utility functions in AIXI-like models [Men12,Hib12,Ors14]. Orseau’s 4 Perhaps this is also preferable for modeling human cognition.

(square/Shannon) knowledge seeking agents (KSA) are a particularly interesting example, motivated to explore by an intrinsic desire for surprise. His general setting of universal agents A ρ hints at (but does not rigorously develop) a wide class of utility functions. Our work subsumes all of these examples. Computability aspects of (slight) variations on AIXI’s value function have been investigated by Leike et al. [LH18]. We seem to be the first to rigorously formulate a general class of utility functions and show how to define optimal policies in terms of the mathematical expectation of utility in the history-based RL framework.

Relating utility functions and rewards. The classical RL literature contains extensive discussions of Sutton’s reward hypothesis [Sut04]“That all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward).” In particular, see [BMAD23] for axioms on a preference ordering that allow it to be represented by a (discounted) reward sum. Typically, these results are based on Markov assumptions that do not apply to our setting.

The semime

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut