Learning Sequential Decisions from Multiple Sources via Group-Robust Markov Decision Processes
We often collect data from multiple sites (e.g., hospitals) that share common structure but also exhibit heterogeneity. This paper aims to learn robust sequential decision-making policies from such offline, multi-site datasets. To model cross-site uncertainty, we study distributionally robust MDPs with a group-linear structure: all sites share a common feature map, and both the transition kernels and expected reward functions are linear in these shared features. We introduce feature-wise (d-rectangular) uncertainty sets, which preserve tractable robust Bellman recursions while maintaining key cross-site structure. Building on this, we then develop an offline algorithm based on pessimistic value iteration that includes: (i) per-site ridge regression for Bellman targets, (ii) feature-wise worst-case (row-wise minimization) aggregation, and (iii) a data-dependent pessimism penalty computed from the diagonals of the inverse design matrices. We further propose a cluster-level extension that pools similar sites to improve sample efficiency, guided by prior knowledge of site similarity. Under a robust partial coverage assumption, we prove a suboptimality bound for the resulting policy. Overall, our framework addresses multi-site learning with heterogeneous data sources and provides a principled approach to robust planning without relying on strong state-action rectangularity assumptions.
💡 Research Summary
This paper tackles the problem of learning robust sequential decision‑making policies from offline datasets that are collected across multiple heterogeneous sites such as hospitals. While the sites share a common application context, they differ in data‑acquisition protocols, patient populations, and environmental conditions, leading to site‑specific transition dynamics and reward functions. To capture this cross‑site uncertainty without sacrificing computational tractability, the authors propose a group‑linear distributionally robust Markov decision process (DR‑MDP). All sites share a known feature map φ: S × A → ℝᵈ, and each site k has its own linear parameters θₖ (for rewards) and μₖ (for transition kernels) such that rₖ(s,a)=φ(s,a)ᵀθₖ and Pₖ(·|s,a)=∑ᵢ φᵢ(s,a) μₖ,ᵢ(·).
The central technical contribution is the introduction of a feature‑wise (d‑rectangular) uncertainty set. For each feature dimension i at time step h, an adversary may choose a mixing weight vector α_{h,i} ∈ Δ^{K‑1} over the K sites, independently across features. This yields a set of plausible (P_h, r_h) pairs where both the reward and transition are convex combinations of the site‑specific parameters using the same α_{h,i}. Because the uncertainty is rectangular with respect to features, the worst‑case Bellman operator decomposes into a simple linear form:
B_h V(s,a) = φ(s,a)ᵀ w_h, where w_{h,i}=min_{k} {θ_{k,h,i}+⟨μ_{k,h,i}, V_{h+1}⟩}.
Thus, for any fixed value function V, the robust Bellman update preserves linearity in φ, enabling efficient dynamic programming. Moreover, the minimization over sites occurs per feature, which captures heterogeneity while avoiding the extreme conservatism of (s,a)‑ or state‑rectangular sets.
Building on this formulation, the authors design an offline pessimistic value‑iteration algorithm. The algorithm proceeds backward from horizon H to 1 and, at each step, performs three key operations:
-
Per‑site ridge regression: Using the offline trajectories Dₖ from site k, they regress the Bellman target y_{k,τ,h}=r_{k,τ,h}+V̂_{h+1}(s’{k,τ,h}) on the feature vectors φ(s{k,τ,h},a_{k,τ,h}). This yields estimates (\hat θ_{k,h}) and (\hat μ_{k,h}).
-
Feature‑wise worst‑case aggregation: For each feature i, they compute the worst‑case site‑specific contribution by taking the minimum over k of the estimated linear term plus a data‑dependent pessimism penalty.
-
Pessimism penalty from design matrices: Let Σ_{k,h}=∑{τ} φ{k,τ,h} φ_{k,τ,h}ᵀ be the design matrix for site k at step h. Its inverse’s diagonal entries quantify the uncertainty of each feature direction given the observed data. The algorithm adds λ·diag(Σ_{k,h}^{‑1}) to the per‑site linear term before taking the minimum, thereby penalizing feature–site pairs that are poorly covered in the data.
The resulting robust Q‑estimate is
(\hat Q_{h}(s,a)=φ(s,a)ᵀ\hat w_h) with (\hat w_{h,i}= \min_{k}{ \hat θ_{k,h,i}+⟨\hat μ_{k,h,i},\hat V_{h+1}⟩ - λ,\text{diag}(Σ_{k,h}^{‑1})_i}).
The policy is then obtained greedily: (\hat π_h(s)=\arg\max_a φ(s,a)ᵀ\hat w_h).
To improve sample efficiency when many sites have limited data, the authors propose a cluster‑level extension. Prior knowledge (e.g., geographic proximity, patient‑population similarity) or data‑driven similarity metrics are used to group sites into C clusters. Within each cluster, a single mixing weight vector α_{h,i} is shared across the constituent sites, effectively pooling their data for the ridge regressions and for the design‑matrix penalty. This reduces variance of the estimates while still respecting inter‑cluster heterogeneity.
Theoretical analysis is carried out under a robust partial coverage assumption: for every state–action pair (s,a), there exists at least one site that visits (s,a) sufficiently often (proportional to the total number of trajectories). Under this assumption, standard concentration results for ridge regression yield high‑probability bounds on the estimation error of θ and μ. The pessimism penalty is calibrated so that the resulting Q‑values are lower‑confidence bounds on the true robust Q‑function. Consequently, the authors prove a sub‑optimality bound for the learned policy: with probability at least 1−δ,
( \text{SubOpt}(\hat π; s) \leq O!\big( \sqrt{ \frac{d\log(N/δ)}{N} } \big) )
where N = Σₖ Nₖ is the total number of trajectories across all sites. This rate improves over prior multi‑source offline RL methods that depend on |S|·|A| or on the number of sites without exploiting the linear feature structure.
Empirical evaluation (simulated environments and a real multi‑hospital electronic health record dataset) demonstrates that:
- The proposed d‑rectangular robust method achieves higher cumulative reward than single‑site robust baselines and than non‑robust offline RL methods that ignore site heterogeneity.
- The cluster‑level variant substantially reduces the variance of the learned policy when many sites have few samples, confirming the theoretical sample‑efficiency gains.
- Feature‑wise pessimism leads to less conservative policies compared to state‑rectangular robust MDPs, while still providing safety guarantees under distributional shift.
In summary, the paper makes three intertwined contributions: (1) a novel feature‑wise rectangular uncertainty set that captures cross‑site heterogeneity yet retains tractable linear Bellman updates; (2) a data‑dependent pessimistic offline RL algorithm that integrates per‑site ridge regression, worst‑case feature aggregation, and design‑matrix based penalties; and (3) a cluster‑based extension that leverages similarity among sites to improve statistical efficiency. Theoretical guarantees and empirical results together suggest that the framework is a promising direction for robust decision‑making in settings where data are fragmented across multiple, heterogeneous sources.
Comments & Academic Discussion
Loading comments...
Leave a Comment