DOLCE: Decomposing Off-Policy Evaluation/Learning into Lagged and Current Effects

DOLCE: Decomposing Off-Policy Evaluation/Learning into Lagged and Current Effects
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Off-policy evaluation and learning in contextual bandits use logged interaction data to estimate and optimize the value of a target policy. Most existing methods require sufficient action overlap between the logging and target policies, and violations can bias value and policy gradient estimates. To address this issue, we propose DOLCE (Decomposing Off-policy evaluation/learning into Lagged and Current Effects), which uses only lagged contexts already stored in bandit logs to construct lag-marginalized importance weights and to decompose the objective into a support-robust lagged correction term and a current, model-based term, yielding bias cancellation when the reward-model residual is conditionally mean-zero given the lagged context and action. With multiple candidate lags, DOLCE softly aggregates lag-specific estimates, and we introduce a moment-based training procedure that promotes the desired invariance using only logged lag-augmented data. We show that DOLCE is unbiased in an idealized setting and yields consistent and asymptotically normal estimates with cross-fitting under standard conditions. Our experiments demonstrate that DOLCE achieves substantial improvements in both off-policy evaluation and learning, particularly as the proportion of individuals who violate support increases.


💡 Research Summary

The paper introduces DOLCE (Decomposing Off‑policy evaluation/learning into Lagged and Current Effects), a novel framework for off‑policy evaluation (OPE) and off‑policy learning (OPL) that remains reliable when the standard overlap (support) assumption is violated. In contextual bandits, most estimators (IPS, DR, DM) require that every action assigned positive probability by the target policy also receives positive probability under the logging policy for each context. Real‑world systems often break this condition due to eligibility rules, safety constraints, or high‑dimensional sparsity, leading to biased and unstable estimates.

Key Insight – Leveraging Lagged Contexts
Logs in many applications contain not only the current context (X) but also a set of past contexts (X^{(1)},\dots,X^{(K)}) observed at predefined time lags. DOLCE replaces the usual importance weight (\pi_{\theta}(A|X)/\pi_{0}(A|X)) with a lag‑marginalized weight \


Comments & Academic Discussion

Loading comments...

Leave a Comment