A Geometric Traversal Algorithm for Reward-Uncertain MDPs
Markov decision processes (MDPs) are widely used in modeling decision making problems in stochastic environments. However, precise specification of the reward functions in MDPs is often very difficult. Recent approaches have focused on computing an optimal policy based on the minimax regret criterion for obtaining a robust policy under uncertainty in the reward function. One of the core tasks in computing the minimax regret policy is to obtain the set of all policies that can be optimal for some candidate reward function. In this paper, we propose an efficient algorithm that exploits the geometric properties of the reward function associated with the policies. We also present an approximate version of the method for further speed up. We experimentally demonstrate that our algorithm improves the performance by orders of magnitude.
💡 Research Summary
The paper tackles the computational bottleneck in solving reward‑uncertain Markov decision processes (RUMDPs) under the minimax‑regret criterion. In a RUMDP the reward function is not a single vector but belongs to a convex polytope R defined by linear constraints. The minimax‑regret policy requires enumerating all nondominated (i.e., potentially optimal) deterministic policies Γ, because the worst‑case regret can be expressed as a linear program whose size grows with |Γ|. Existing exact methods, most notably the π‑Witness algorithm, generate Γ by repeatedly solving linear programs that contain a number of constraints proportional to the current size of Γ. As Γ can be exponential in the number of state‑action pairs, the overall runtime becomes prohibitive.
The authors observe that a deterministic policy π remains optimal for a whole region of reward vectors. By applying the optimality condition Vπ ≥ Qπ,a and rewriting it in terms of the reward vector, they derive a set of linear inequalities (hyperplanes) that define the “reward region” of π. Each hyperplane separates the reward polytope into two half‑spaces; crossing a hyperplane leads to a neighboring region that corresponds to a different nondominated policy. Consequently, the collection of reward regions forms a convex partition of R, which can be represented as an undirected graph whose nodes are policies and edges connect adjacent regions.
The proposed Geometric Traversal Algorithm exploits this structure:
- Start from an arbitrary reward r∈R, compute its optimal deterministic policy f (using any standard MDP solver).
- From f and r, construct the set H of hyperplanes that constitute the boundary of f’s reward region (findRewardOptRgn).
- For each hyperplane h∈H, solve a small linear program that forces the inequality of h to be reversed (by a tiny margin δ) while keeping all other hyperplanes unchanged. This yields a reward vector r′ located in the adjacent region.
- Compute the optimal policy f′ for r′. If f′ is new, add it to Γ and push (h, f′) onto a processing agenda.
- Repeat until the agenda is empty; at that point every reachable region has been visited and Γ is complete.
Because each policy contributes at most |S|·|A| hyperplanes, the total number of LP solves is O(|Γ|·|S|·|A|). Each LP has dim(R) variables and at most |S|·|A| constraints, independent of |Γ|. Hence the overall complexity is polynomial in the MDP size and linear in the number of nondominated policies, a dramatic improvement over π‑Witness whose runtime scales quadratically with |Γ|.
To further accelerate computation for large‑scale problems, the authors introduce an Approximate Geometric Traversal. Instead of exploring every hyperplane, the algorithm selects a random line through the current reward point, computes the two intersection points of this line with the boundary polytope, and only follows those two directions. This yields a subset of Γ that is often sufficient for a good minimax‑regret solution while reducing the number of LP calls by an order of magnitude.
Empirical evaluation on synthetic MDPs (states ranging from 10 to 1000, actions 2–5) and various reward‑uncertainty specifications demonstrates that the exact geometric traversal is 10–100× faster than π‑Witness, and the approximate version is 5–20× faster than the exact method. Despite exploring only a fraction of Γ, the approximate algorithm’s regret values differ by less than 2 % from the exact optimum, confirming its practical usefulness.
In summary, the paper contributes:
- A novel geometric characterization of policy optimality in reward space.
- An exact enumeration algorithm for nondominated policies whose runtime is linear in |Γ|.
- An anytime/approximate variant that offers substantial speed‑ups with negligible loss of solution quality.
- Extensive experiments validating orders‑of‑magnitude performance gains.
The work opens avenues for further research on high‑dimensional reward spaces (e.g., dimensionality reduction, sampling of hyperplanes) and integration with online learning settings where the reward polytope evolves over time.
Comments & Academic Discussion
Loading comments...
Leave a Comment