Mirror Descent Policy Optimisation for Robust Constrained Markov Decision Processes

Mirror Descent Policy Optimisation for Robust Constrained Markov Decision Processes
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Safety is an essential requirement for reinforcement learning systems. The newly emerging framework of robust constrained Markov decision processes allows learning policies that satisfy long-term constraints while providing guarantees under epistemic uncertainty. This paper presents mirror descent policy optimisation for robust constrained Markov decision processes, making use of policy gradient techniques to optimise both the policy (as a maximiser) and the transition kernel (as an adversarial minimiser) on the Lagrangian representing a constrained Markov decision process. Our proposed algorithm obtains an $\tilde{\mathcal{O}}\left(1/T^{1/3}\right)$ convergence rate in the sample-based robust constrained Markov decision process setting. The paper also contributes an algorithm for approximate gradient descent in the space of transition kernels, which is of independent interest for designing adversarial environments in general Markov decision processes. Experiments confirm the benefits of mirror descent policy optimisation in constrained and unconstrained optimisation, and significant improvements are observed in robustness tests when compared to baseline policy optimisation algorithms.


💡 Research Summary

The paper tackles the emerging problem of Robust Constrained Markov Decision Processes (RCMDPs), which combine the safety guarantees of constrained MDPs (CMDPs) with the robustness to model misspecification of robust MDPs (RMDPs). In an RCMDP the learner must find a policy that maximizes expected return while satisfying long‑term constraints for all transition kernels belonging to an uncertainty set. This creates a min‑max‑Lagrangian problem: the policy (\pi) is a maximizer, the transition kernel (p) an adversarial minimizer, and the dual variables (\lambda) enforce the constraints.

Algorithmic contribution
The authors propose a Mirror‑Descent Policy Optimisation (MDPO) algorithm for RCMDPs. The key ideas are:

  1. Robust Lagrangian formulation – The objective is written as
    \

Comments & Academic Discussion

Loading comments...

Leave a Comment