Consistent inverse optimal control for infinite time-horizon discounted nonlinear systems under noisy observations
Inverse optimal control (IOC) aims to estimate the underlying cost that governs the observed behavior of an expert system. However, in practical scenarios, the collected data is often corrupted by noise, which poses significant challenges for accurate cost function recovery. In this work, we propose an IOC framework that effectively addresses the presence of observation noise. In particular, compared to our previous work \cite{wang2025consistent}, we consider the case of discrete-time, infinite-horizon, discounted MDPs whose transition kernel is only weak Feller. By leveraging the occupation measure framework, we first establish the necessary and sufficient optimality conditions for the expert policy and then construct an infinite dimensional optimization problem based on these conditions. This problem is then approximated by polynomials to get a finite-dimensional numerically solvable one, which relies on the moments of the state-action trajectory’s occupation measure. More specifically, the moments are robustly estimated from the noisy observations by a combined misspecified Generalized Method of Moments (GMM) estimator derived from observation model and system dynamics. Consequently, the entire algorithm is based on convex optimization which alleviates the issues that arise from local minima and is asymptotically and statistically consistent. Finally, the performance of the proposed method is illustrated through numerical examples.
💡 Research Summary
This paper addresses the challenging problem of recovering the underlying cost function that governs an expert’s behavior in discrete‑time, infinite‑horizon, discounted Markov Decision Processes (MDPs) when the observed demonstration data are corrupted by measurement noise. Building on the authors’ previous work, the new contribution lies in (i) relaxing the strong Feller assumption on the transition kernel to a weak Feller condition, thereby encompassing deterministic dynamics; (ii) explicitly modeling additive observation noise with a known non‑degenerate distribution; and (iii) developing a robust estimation pipeline that remains statistically consistent despite the noise.
The methodology proceeds in four main stages. First, the optimal control problem is reformulated using the occupation‑measure framework. The infinite‑horizon discounted cost minimization is expressed as an infinite‑dimensional linear program (LP) (equations (4)–(5)) together with its dual. Strong duality guarantees that the Karush‑Kuhn‑Tucker (KKT) conditions are necessary and sufficient for optimality. The cost is assumed to be linear in known feature functions φ(x,a) with unknown parameters θ, i.e., ℓ(x,a)=θᵀφ(x,a).
Second, the authors introduce a misspecified Generalized Method of Moments (GMM) estimator to recover the moments of the true occupation measure from noisy observations. Each observed trajectory yₜ is modeled as the true state‑action pair (xₜ,aₜ) plus independent noise vₜ drawn from a known distribution ν with positive‑definite covariance Σᵥ. By stacking M i.i.d. trajectories of length N+1, the GMM constructs moment equations that simultaneously incorporate the observation model and the system dynamics (the linear constraints (Pₓ−αQ)µ=µ_init). The estimator is shown to be consistent: as M→∞ and N grows, the estimated moments converge in probability to the true moments of the occupation measure.
Third, to make the infinite‑dimensional LP tractable, the authors approximate the continuous function spaces with polynomial subspaces of degree d. The feature functions φ and the value function V are represented as polynomials, and the moments of the occupation measure become linear functionals of the polynomial coefficients. This yields a finite‑dimensional convex optimization problem—typically a semidefinite program (SDP)—in the variables θ and the polynomial coefficients of V. Because the problem is convex, any solution is globally optimal, eliminating the local‑minimum issues that plague many inverse reinforcement learning approaches.
Fourth, the paper provides rigorous theoretical guarantees. It proves (a) the consistency of the GMM‑based moment estimator under the stated noise model; (b) that, for sufficiently large polynomial degree, the solution of the finite‑dimensional SDP converges to the solution of the original infinite‑dimensional LP; and (c) the overall algorithm yields a statistically consistent estimator of the true cost parameters θ. The regularity condition (Assumption 2.5) ensures that the expert’s policy is non‑trivial, avoiding degenerate cases where any cost would be optimal.
Experimental validation is performed on several benchmark systems: a second‑order nonlinear system, a third‑order nonlinear system, and both deterministic (weak‑Feller) and stochastic transition kernels. The authors vary the noise level and compare their method against a baseline that minimizes KKT‑violation directly on noisy data. Results show that the proposed approach maintains low parameter estimation error and accurately reproduces the expert’s policy even under substantial observation noise, whereas the baseline deteriorates rapidly.
In summary, the paper delivers a comprehensive inverse optimal control framework that (1) works for general nonlinear discounted MDPs with weak‑Feller dynamics, (2) robustly handles additive observation noise via a misspecified GMM, (3) leverages occupation‑measure theory and polynomial approximation to obtain a globally optimal convex program, and (4) provides provable asymptotic and statistical consistency. These contributions significantly advance the state of the art in inverse reinforcement learning and have immediate applicability to robotics, autonomous systems, and any domain where expert demonstrations are noisy and the underlying dynamics are nonlinear.
Comments & Academic Discussion
Loading comments...
Leave a Comment