Action-Free Offline-to-Online RL via Discretised State Policies
Most existing offline RL methods presume the availability of action labels within the dataset, but in many practical scenarios, actions may be missing due to privacy, storage, or sensor limitations. We formalise the setting of action-free offline-to-online RL, where agents must learn from datasets consisting solely of $(s,r,s’)$ tuples and later leverage this knowledge during online interaction. To address this challenge, we propose learning state policies that recommend desirable next-state transitions rather than actions. Our contributions are twofold. First, we introduce a simple yet novel state discretisation transformation and propose Offline State-Only DecQN (\algo), a value-based algorithm designed to pre-train state policies from action-free data. \algo{} integrates the transformation to scale efficiently to high-dimensional problems while avoiding instability and overfitting associated with continuous state prediction. Second, we propose a novel mechanism for guided online learning that leverages these pre-trained state policies to accelerate the learning of online agents. Together, these components establish a scalable and practical framework for leveraging action-free datasets to accelerate online RL. Empirical results across diverse benchmarks demonstrate that our approach improves convergence speed and asymptotic performance, while analyses reveal that discretisation and regularisation are critical to its effectiveness.
💡 Research Summary
The paper introduces a novel framework for “action‑free offline‑to‑online reinforcement learning,” addressing the realistic scenario where offline datasets contain only state‑reward‑next‑state triples (s, r, s′) and no action labels. This situation arises in domains such as healthcare, finance, and robotics due to privacy, storage constraints, or sensor failures. The authors propose to learn a state policy—a mapping from a current state to a desirable next‑state transition—rather than a conventional action policy.
Core Technical Contributions
-
State Discretisation Transformation
- Each state dimension is first z‑score normalised.
- The change Δs = s′ − s is then discretised into three symbols {‑1, 0, 1} based on a small threshold ε: decrease, stay, or increase.
- This yields a discrete target Δs that is scale‑invariant and eliminates the instability of continuous regression.
- Theoretical analysis (Theorem 1) shows that with k evenly spaced bins per dimension, the value loss between the original MDP and the discretised MDP is bounded by O(H √(M/k!)), where M is the state dimension and H is the range of mean increments. Thus, finer discretisation arbitrarily reduces approximation error.
-
Offline State‑Only DecQN (OSO‑DecQN)
- Extends the Decoupled Q‑Network (DecQN) architecture, which decomposes Q‑values across dimensions, to operate on discretised state differences instead of actions.
- The Q‑function is defined as Q(s, Δs) and factorised as the average of per‑dimension utilities U_i(s, Δs_i).
- An ensemble of Q‑networks and double‑Q learning are employed for stability.
- A conservative regularisation term R_θ = E_{(s,Δs)∼D}
Comments & Academic Discussion
Loading comments...
Leave a Comment