Speeding Up Planning in Markov Decision Processes via Automatically Constructed Abstractions
In this paper, we consider planning in stochastic shortest path (SSP) problems, a subclass of Markov Decision Problems (MDP). We focus on medium-size problems whose state space can be fully enumerated. This problem has numerous important applications, such as navigation and planning under uncertainty. We propose a new approach for constructing a multi-level hierarchy of progressively simpler abstractions of the original problem. Once computed, the hierarchy can be used to speed up planning by first finding a policy for the most abstract level and then recursively refining it into a solution to the original problem. This approach is fully automated and delivers a speed-up of two orders of magnitude over a state-of-the-art MDP solver on sample problems while returning near-optimal solutions. We also prove theoretical bounds on the loss of solution optimality resulting from the use of abstractions.
💡 Research Summary
The paper addresses planning in stochastic shortest‑path (SSP) problems, a subclass of Markov decision processes (MDPs) where the objective is to reach a designated goal state with minimum expected cumulative cost. While exact dynamic‑programming methods (e.g., value iteration, policy iteration) can solve SSPs when the state space is modest, they become infeasible for medium‑size domains that must be enumerated but still contain thousands of states, especially when planning must be performed repeatedly under tight time constraints (e.g., video‑game navigation, robotic manipulators).
The authors propose a fully automated method for constructing a multi‑level hierarchy of abstractions. The core idea is to cluster concrete states into abstract “macro‑states” and to represent transitions between clusters by options—a concept from hierarchical reinforcement learning that bundles a low‑level policy, an initiation set, and a termination condition into a single high‑level action. Each option is designed to start from any state within its source cluster, follow a deterministic low‑level policy, and terminate when the agent first reaches any state in the target cluster. The expected cost and the probability distribution over target clusters are estimated by sampling; the abstraction is accepted only if these quantities are sufficiently uniform across all states in the source cluster.
Formally, an option‑based abstraction is defined as a tuple (˜X,˜A,˜p,˜c) together with a mapping S from abstract states to concrete subsets of X and a mapping Ψ that assigns each abstract action to a concrete option. The authors prove a novel bound (Theorem 1) on the difference between the value function of any proper concrete policy π and the value function of an abstract policy ˜π after extending the latter back to the concrete state space. The bound consists of three components: (i) loss due to state aggregation, measured by the deviation of concrete values within each cluster; (ii) loss due to action abstraction, captured by επ,˜π, which aggregates the maximum discrepancy in expected immediate costs and transition probabilities between the concrete policy and the abstract option; and (iii) a factor λπ that quantifies how much extra time is incurred when, upon entering a new cluster, the concrete state is randomly perturbed according to the cluster’s internal distribution. When the abstract policy is optimal for the abstract MDP, the bound translates directly into a guarantee on how close the induced concrete policy is to the true optimal concrete policy.
The construction algorithm proceeds in three stages. First, a graph of the concrete MDP is built where edge weights combine transition probabilities and immediate costs. A community‑detection / graph‑partitioning heuristic is applied to produce a partition of X into clusters that respect both transition structure and cost similarity. Second, for each ordered pair of neighboring clusters (C_i, C_j) the algorithm samples trajectories from random start states in C_i using the original dynamics, records the empirical distribution of exit states in C_j, and computes the average cost incurred before exit. If the empirical variance of costs and exit probabilities across start states is below a user‑specified threshold, an option (π_ij, I=C_i, ψ_ij) is instantiated; otherwise the pair is left without a direct abstract action. Third, a special “goal‑approach” option is learned for each cluster that lies adjacent to the goal, ensuring that once the abstract planner reaches a “goal‑approach region” the concrete agent can reliably finish at the goal with high probability.
Planning then proceeds hierarchically. At the top level, a standard SSP solver (e.g., Dijkstra‑like value iteration) is run on the abstract MDP to obtain a policy that maps abstract states to abstract actions (options). The abstract policy yields a sequence of clusters from the start cluster to the goal‑approach cluster. The planner then recursively refines each abstract action: when the abstract policy selects option (π_ij), the low‑level policy π_ij is executed in the concrete MDP until termination, thereby moving the agent from some state in C_i to some state in C_j. This process continues down the hierarchy until the concrete level is reached, at which point the concrete actions are executed directly. Because each abstract transition abstracts away many concrete steps, the overall planning time is dramatically reduced.
Empirical evaluation on several benchmark domains (grid worlds of varying size, a robotic arm configuration space, and a commercial video‑game map) demonstrates speed‑ups of roughly two orders of magnitude compared with a state‑of‑the‑art SSP solver that operates directly on the concrete MDP. The sub‑optimality of the resulting policies is consistently below 5 % of the optimal cost, confirming that the theoretical bound is not merely pessimistic. Sensitivity analyses show that tighter clustering (smaller clusters) reduces the λπ factor but increases the number of abstract actions, while looser clustering reduces abstraction size but can increase επ,˜π. The authors discuss these trade‑offs and provide guidelines for selecting clustering thresholds based on available computation time and desired solution quality.
In the related‑work discussion, the paper positions its contribution relative to deterministic abstraction methods such as PR‑LRTS, hierarchical reinforcement‑learning approaches that use hand‑crafted options, and recent work on state aggregation with fixed action sets. The novelty lies in (1) automatically constructing options that serve as abstract actions, (2) providing a rigorous performance bound that explicitly accounts for both state and action abstraction, and (3) demonstrating that a multi‑level hierarchy can be built without any domain‑specific knowledge of the goal location.
In summary, the authors deliver a complete pipeline—from theoretical foundations to practical algorithms—that enables fast, near‑optimal planning for stochastic shortest‑path problems by automatically generating high‑quality option‑based abstractions and exploiting them in a hierarchical planning framework. This work opens avenues for real‑time decision making in stochastic domains where traditional exact solvers are too slow, and it offers a solid analytical basis for future extensions such as adaptive abstraction refinement or integration with learning‑based option discovery.
Comments & Academic Discussion
Loading comments...
Leave a Comment