Markov Decision Processes of the Third Kind: Learning Distributions by Policy Gradient Descent
The goal of this paper is to analyze distributional Markov Decision Processes as a class of control problems in which the objective is to learn policies that steer the distribution of a cumulative reward toward a prescribed target law, rather than optimizing an expected value or a risk functional. To solve the resulting distributional control problem in a model-free setting, we propose a policy-gradient algorithm based on neural-network parameterizations of randomized Markov policies, defined on an augmented state space and a sample-based evaluation of the characteristic-function loss. Under mild regularity and growth assumptions, we prove convergence of the algorithm to stationary points using stochastic approximation techniques. Several numerical experiments illustrate the ability of the method to match complex target distributions, recover classical optimal policies when they exist, and reveal intrinsic non-uniqueness phenomena specific to distributional control.
💡 Research Summary
**
The paper introduces a novel class of control problems called “distributional Markov Decision Processes of the third kind,” where the objective is not to maximize the expected cumulative reward nor a risk‑adjusted functional, but to steer the entire distribution of the cumulative reward toward a prescribed target law. To address this problem in a model‑free setting, the authors propose a policy‑gradient algorithm that parameterizes randomized policies with neural networks on an augmented state space that includes the cumulative reward (often called the “stock”). The loss function is defined as a weighted L² distance between the characteristic function of the induced return distribution and the characteristic function of the target distribution. This choice yields a differentiable objective, unlike Wasserstein distances which are typically non‑smooth.
The algorithm proceeds by sampling trajectories, computing Monte‑Carlo estimates of the characteristic function at a set of frequencies, and using the re‑parameterization trick to obtain unbiased gradient estimates with respect to the network parameters. A Robbins‑Monro stochastic approximation scheme with a diminishing step‑size guarantees convergence under standard smoothness, bounded variance, and ergodicity assumptions. The authors prove that the expected loss converges to a stationary point and, under additional convexity conditions on the weighted characteristic‑function loss, almost‑sure convergence can be established.
A series of experiments illustrate the method’s capabilities. In a pedagogical torus example, the algorithm exactly matches target circular‑Gaussian or multimodal distributions by appropriately shaping the policy’s random component. In continuous control benchmarks (e.g., CartPole, a continuous‑time portfolio allocation problem), the method successfully drives the return distribution to a user‑specified shape that may differ substantially from the distribution induced by the expected‑value optimal policy, thereby highlighting the intrinsic non‑uniqueness of distributional control: multiple distinct policies can generate the same target distribution. The paper also discusses connections to distributional reinforcement learning (e.g., C51, QR‑DQN), optimal transport and Schrödinger bridge formulations, emphasizing that the proposed approach operates on the distribution of a derived reward variable rather than on the state trajectory itself.
Overall, the contribution lies in (i) formalizing a distribution‑matching objective for MDPs, (ii) designing a practical, fully model‑free policy‑gradient method based on characteristic‑function losses, (iii) providing a convergence analysis rooted in stochastic approximation theory, and (iv) empirically demonstrating that complex target laws can be learned, that classical optimal policies are recovered when the target coincides with the optimal return distribution, and that the solution set may be non‑unique. Limitations include the need for careful frequency sampling, the lack of theoretical sample‑complexity bounds, and potential scalability issues in high‑dimensional state‑action spaces. Future work is suggested on adaptive weighting schemes, extensions to continuous‑time settings, multi‑agent scenarios, and applications in finance and robotics where matching a prescribed risk‑return profile is essential.
Comments & Academic Discussion
Loading comments...
Leave a Comment