In-Context Reinforcement Learning From Suboptimal Historical Data
Transformer models have achieved remarkable empirical successes, largely due to their in-context learning capabilities. Inspired by this, we explore training an autoregressive transformer for in-context reinforcement learning (ICRL). In this setting, we initially train a transformer on an offline dataset consisting of trajectories collected from various RL tasks, and then fix and use this transformer to create an action policy for new RL tasks. Notably, we consider the setting where the offline dataset contains trajectories sampled from suboptimal behavioral policies. In this case, standard autoregressive training corresponds to imitation learning and results in suboptimal performance. To address this, we propose the Decision Importance Transformer(DIT) framework, which emulates the actor-critic algorithm in an in-context manner. In particular, we first train a transformer-based value function that estimates the advantage functions of the behavior policies that collected the suboptimal trajectories. Then we train a transformer-based policy via a weighted maximum likelihood estimation loss, where the weights are constructed based on the trained value function to steer the suboptimal policies to the optimal ones. We conduct extensive experiments to test the performance of DIT on both bandit and Markov Decision Process problems. Our results show that DIT achieves superior performance, particularly when the offline dataset contains suboptimal historical data.
💡 Research Summary
The paper introduces Decision Importance Transformer (DIT), a novel framework for in‑context reinforcement learning (ICRL) that can be pretrained solely on suboptimal historical trajectories and then deployed to solve unseen RL tasks without any further parameter updates. Traditional ICRL approaches such as Algorithm Distillation (AD) and Decision‑Pretrained Transformer (DPT) rely on either the full learning histories of RL algorithms or optimal action labels, which are often unavailable in real‑world settings where only noisy, suboptimal data from non‑expert users exist.
DIT tackles this limitation in two stages. First, it trains a transformer‑based advantage estimator that predicts the advantage function A_b(s,a) of the behavioral policy that generated each trajectory. Because the pretraining data comprise many tasks with unknown identifiers, the model leverages the self‑attention mechanism to interpolate across trajectories from different tasks, effectively performing in‑context estimation of advantages for each (state, action) pair.
Second, using the estimated advantages, DIT performs weighted maximum‑likelihood estimation (WMLE) to train a task‑conditioned policy π(a|s;τ). Each transition (s,a) receives a weight proportional to exp(A_b(s,a)/η), where η controls the trade‑off between aggressive improvement and staying close to the behavior policy. The authors show that this WMLE objective is equivalent to a KL‑regularized policy‑optimization problem that simultaneously maximizes expected advantage and penalizes divergence from the behavior policy. Consequently, actions with higher estimated advantage receive more influence during training, steering the learned policy toward optimal behavior even though the raw data are suboptimal.
The authors evaluate DIT on a suite of bandit problems and four challenging MDP benchmarks, including two sparse‑reward navigation tasks and two continuous‑control environments. In bandits, DIT matches the performance of the theoretically optimal Thompson Sampling algorithm. In the MDP experiments, DIT consistently outperforms baselines that simply imitate the suboptimal data and achieves performance comparable to or better than DPT, despite never seeing optimal action labels during pretraining. Both offline (using a fixed dataset) and online (collecting new trajectories with the learned policy) deployments are tested, demonstrating that DIT can improve over the behavior policies and generalize to new tasks.
The paper also discusses limitations: the quality of the advantage estimator directly impacts the weighting scheme, η must be tuned for each domain, and extremely high‑dimensional action spaces may cause weight explosion. Future work is suggested on more robust advantage estimation, adaptive weighting, and large‑scale validation on real industrial logs. Overall, DIT provides a practical pathway to leverage abundant suboptimal historical data for scalable, zero‑shot reinforcement learning across diverse tasks, potentially reducing the data collection burden and accelerating deployment of RL systems in real‑world applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment