On the Sample Efficiency of Inverse Dynamics Models for Semi-Supervised Imitation Learning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Semi-supervised imitation learning (SSIL) consists in learning a policy from a small dataset of action-labeled trajectories and a much larger dataset of action-free trajectories. Some SSIL methods learn an inverse dynamics model (IDM) to predict the action from the current state and the next state. An IDM can act as a policy when paired with a video model (VM-IDM) or as a label generator to perform behavior cloning on action-free data (IDM labeling). In this work, we first show that VM-IDM and IDM labeling learn the same policy in a limit case, which we call the IDM-based policy. We then argue that the previously observed advantage of IDM-based policies over behavior cloning is due to the superior sample efficiency of IDM learning, which we attribute to two causes: (i) the ground-truth IDM tends to be contained in a lower complexity hypothesis class relative to the expert policy, and (ii) the ground-truth IDM is often less stochastic than the expert policy. We argue these claims based on insights from statistical learning theory and novel experiments, including a study of IDM-based policies using recent architectures for unified video-action prediction (UVA). Motivated by these insights, we finally propose an improved version of the existing LAPO algorithm for latent action policy learning.

💡 Research Summary

This paper investigates why inverse dynamics models (IDMs) provide a sample‑efficient advantage in semi‑supervised imitation learning (SSIL), where a small set of expert trajectories with action labels is complemented by a large pool of unlabeled state‑transition data. Two dominant IDM‑based approaches are examined: (1) VM‑IDM, which trains a video model (VM) to predict the next state and then samples an action from a learned IDM; and (2) IDM labeling, which uses the IDM to generate pseudo‑actions for the unlabeled transitions and then applies standard behavior cloning (BC). The authors first prove that, under the idealized conditions of infinite unlabeled data and sufficient model capacity, both methods converge to the same deterministic policy, termed the “IDM‑based policy.” This result follows from a KL‑divergence minimization argument that shows the VM learns the true transition distribution v*(s′|s) and the IDM learns the true conditional action distribution h*(a|s,s′). Consequently, the combined policy π̂_{v*,ĥ}(a|s)=∫ĥ(a|s,s′)v*(s′|s)ds′ is identical for both pipelines.

The core contribution lies in explaining why learning h* is statistically easier than learning π*. Two complementary reasons are offered: (i) Hypothesis‑class complexity – the true IDM often resides in a lower‑complexity function class than the expert policy because it conditions on both current and next states, which can make the mapping more deterministic and structurally simpler. This yields lower variance and bias when fitting h*, whereas π* typically requires a richer class to capture multimodal expert behavior, leading to higher variance. (ii) Stochasticity – the expert policy π* may be highly stochastic (e.g., due to human variability), whereas the IDM h* is frequently near‑deterministic, especially in environments where the next state largely determines the action. The authors formalize these intuitions using statistical learning theory, showing that the expected KL divergence between h* and its estimate is smaller than that between π* and its BC estimate given the same number of labeled samples. They further prove that the KL error of the IDM‑based policy is bounded above by the IDM error, guaranteeing that any advantage in learning h* translates directly into a better policy.

Empirically, the paper conducts extensive experiments on 16 ProcGen games, the Push‑T robotic manipulation suite, and the Libero benchmark. Results confirm that IDM‑based policies consistently outperform pure BC, especially as environment or goal complexity grows. A notable experiment replaces the VM with a state‑of‑the‑art unified video‑action prediction architecture (UVA), demonstrating that a more accurate v* further amplifies the IDM advantage. Additionally, the authors propose an enhanced version of the LAPO algorithm for latent‑action learning. By initializing the latent action decoder with a pretrained IDM and leveraging the sample‑efficient IDM training on unlabeled data, the modified LAPO achieves higher success rates across ProcGen tasks, with improvements ranging from 4% to 7% absolute.

The study also discusses limitations: the theoretical equivalence assumes infinite unlabeled data and perfect optimization, which may not hold in practice; environments with highly stochastic dynamics could diminish the complexity gap; and the reliance on accurate video models may be problematic when visual observations are noisy or occluded. Nonetheless, the work provides a rigorous framework for understanding when and why IDM‑based SSIL is preferable, offering concrete guidelines for practitioners seeking to reduce labeling costs in robotics, video game agents, and other sequential decision‑making domains.

On the Sample Efficiency of Inverse Dynamics Models for Semi-Supervised Imitation Learning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment