Causal Imitation Learning Under Measurement Error and Distribution Shift
We study offline imitation learning (IL) when part of the decision-relevant state is observed only through noisy measurements and the distribution may change between training and deployment. Such settings induce spurious state-action correlations, so standard behavioral cloning (BC) – whether conditioning on raw measurements or ignoring them – can converge to systematically biased policies under distribution shift. We propose a general framework for IL under measurement error, inspired by explicitly modeling the causal relationships among the variables, yielding a target that retains a causal interpretation and is robust to distribution shift. Building on ideas from proximal causal inference, we introduce \texttt{CausIL}, which treats noisy state observations as proxy variables, and we provide identification conditions under which the target policy is recoverable from demonstrations without rewards or interactive expert queries. We develop estimators for both discrete and continuous state spaces; for continuous settings, we use an adversarial procedure over RKHS function classes to learn the required parameters. We evaluate \texttt{CausIL} on semi-simulated longitudinal data from the PhysioNet/Computing in Cardiology Challenge 2019 cohort and demonstrate improved robustness to distribution shift compared to BC baselines.
💡 Research Summary
Imitation learning (IL) seeks to recover a decision policy from expert demonstrations without an explicit reward function. The most common approach, behavioral cloning (BC), simply treats the problem as supervised learning, fitting a policy that predicts the expert’s action from the observed state. This paper points out a critical failure mode of BC when (i) part of the decision‑relevant state is latent and only observed through noisy measurements, and (ii) the distribution of the latent state or the measurement process changes between the training (source) and deployment (target) environments. In such settings, the correlation between the observed measurements and the expert’s actions can be spurious; BC may latch onto these unstable predictors and produce systematically biased policies that do not improve even with more data or longer trajectories.
To address this, the authors propose a causal formulation of IL that explicitly models the relationships among the latent state (U_t), the observed state (S_t), the noisy proxy measurement (W_t), and the expert action (A_t). The expert’s policy is assumed to depend on the current observed state and the previous latent state: (A_t \sim \pi_E(\cdot \mid S_t, U_{t-1})). The proxy (W_{t-1}) carries information about (U_{t-1}) but is corrupted by measurement error that may differ across domains. The causal graph (Figure 1) encodes conditional independences such as (S_{t-1} \perp!!!\perp A_t \mid (S_t, U_{t-1})) and ((S_{t-1}, S_t) \perp!!!\perp W_{t-1} \mid U_{t-1}).
The central object of interest is an interventional optimal imitation policy: \
Comments & Academic Discussion
Loading comments...
Leave a Comment