Choice-Model-Assisted Q-learning for Delayed-Feedback Revenue Management
We study reinforcement learning for revenue management with delayed feedback, where a substantial fraction of value is determined by customer cancellations and modifications observed days after booking. We propose \emph{choice-model-assisted RL}: a calibrated discrete choice model is used as a fixed partial world model to impute the delayed component of the learning target at decision time. In the fixed-model deployment regime, we prove that tabular Q-learning with model-imputed targets converges to an $O(\varepsilon/(1-γ))$ neighborhood of the optimal Q-function, where $\varepsilon$ summarizes partial-model error, with an additional $O(t^{-1/2})$ sampling term. Experiments in a simulator calibrated from 61{,}619 hotel bookings (1{,}088 independent runs) show: (i) no statistically detectable difference from a maturity-buffer DQN baseline in stationary settings; (ii) positive effects under in-family parameter shifts, with significant gains in 5 of 10 shift scenarios after Holm–Bonferroni correction (up to 12.4%); and (iii) consistent degradation under structural misspecification, where the choice model assumptions are violated (1.4–2.6% lower revenue). These results characterize when partial behavioral models improve robustness under shift and when they introduce harmful bias.
💡 Research Summary
This paper tackles a practical challenge in hotel revenue management: a substantial portion of the revenue (20‑40 %) is determined by cancellations and modifications that occur days after the initial booking. Traditional reinforcement‑learning (RL) approaches suffer from a severe credit‑assignment problem because the delayed component of the reward is unavailable at the decision time, forcing the algorithm to wait for the outcome before updating its value estimates. The authors propose “choice‑model‑assisted RL” (CA‑RL), which leverages a discrete choice model (DCM) calibrated on historical booking data to predict the distribution of delayed outcomes from observable order features (customer profile, room type, competitor prices, etc.). By treating the DCM as a fixed partial world model, the expected delayed reward can be imputed instantly, allowing immediate Q‑learning updates without waiting for the actual shocks to materialize.
The paper is organized around two complementary analysis regimes. In the theoretical regime, the DCM is assumed to be pre‑trained and held fixed during deployment. The authors define a model‑error term ε that captures the worst‑case bias between the true delayed reward distribution and the DCM’s prediction. They then prove that tabular Q‑learning with model‑imputed targets converges with high probability to a neighborhood of the optimal Q‑function:
‖Q_t − Q*‖∞ = O(ε/(1−γ)) + O(t⁻¹ᐟ²).
The first term is an irreducible bias floor proportional to the DCM error, while the second term is the usual stochastic‑approximation sampling error. This result extends classic Q‑learning convergence analysis to a setting where a parametric, possibly misspecified, model supplies part of the learning signal.
Empirically, the authors build a high‑fidelity simulator calibrated from 61,619 real hotel bookings. They conduct 1,088 independent training runs under three experimental questions: (Q1) does CA‑DQN match a maturity‑buffer DQN (MB‑DQN) when the DCM is correctly specified? (Q2) does CA‑DQN improve robustness under in‑family parameter shifts (e.g., demand scaling, competition intensity, seasonality)? (Q3) does CA‑DQN degrade under structural misspecification (e.g., quadratic price sensitivity, IIA violations, out‑of‑family customer heterogeneity)?
Results show that in a stationary environment (Q1) there is no statistically detectable performance difference between CA‑DQN and MB‑DQN, confirming that the model‑imputed targets do not harm learning when the DCM is accurate. Under in‑family shifts (Q2), CA‑DQN yields significant revenue gains in 5 of 10 scenarios after Holm‑Bonferroni correction, with improvements up to 12.4 %—the DCM’s immediate predictions enable the policy to adapt faster to changed demand patterns. Conversely, when the DCM’s structural assumptions are violated (Q3), CA‑DQN consistently underperforms, losing 1.4–2.6 % of revenue, illustrating that a biased partial model can inject harmful systematic error.
The paper also discusses practical deployment: a two‑phase workflow (offline DCM calibration → fixed‑model online Q‑learning) mirrors real‑world constraints where full customer journeys are needed for DCM training, and online systems require low‑latency decisions. The authors provide extensive supplementary material on DCM identifiability, Fisher information, doubly‑robust reward construction, and detailed proofs of convergence and extrapolation properties.
In summary, the work demonstrates that embedding a calibrated choice model as a partial world model can substantially mitigate the credit‑assignment problem in delayed‑feedback revenue management, delivering faster learning and improved robustness to parameter shifts. However, the benefits hinge on the correctness of the choice model’s structural assumptions; when these are violated, the induced bias outweighs the variance reduction, leading to degraded performance. The authors suggest future directions such as periodic DCM re‑training, handling multi‑stage delays, and extending the approach to deep RL and multi‑objective settings.
Comments & Academic Discussion
Loading comments...
Leave a Comment