SACO: Sequence-Aware Constrained Optimization Framework for Coupon Distribution in E-commerce
Coupon distribution is a critical marketing strategy used by online platforms to boost revenue and enhance user engagement. Regrettably, existing coupon distribution strategies fall far short of effectively leveraging the complex sequential interactions between platforms and users. This critical oversight, despite the abundance of e-commerce log data, has precipitated a performance plateau. In this paper, we focus on the scene that the platforms make sequential coupon distribution decision multiple times for various users, with each user interacting with the platform repeatedly. Based on this scenario, we propose a novel marketing framework, named \textbf{S}equence-\textbf{A}ware \textbf{C}onstrained \textbf{O}ptimization (SACO) framework, to directly devise coupon distribution policy for long-term revenue boosting. SACO framework enables optimized online decision-making in a variety of real-world marketing scenarios. It achieves this by seamlessly integrating three key characteristics, general scenarios, sequential modeling with more comprehensive historical data, and efficient iterative updates within a unified framework. Furthermore, empirical results on real-world industrial dataset, alongside public and synthetic datasets demonstrate the superiority of our framework.
💡 Research Summary
The paper addresses the practical problem of coupon distribution in large‑scale e‑commerce platforms, where a marketing budget must be allocated across many users over multiple interaction rounds. Existing approaches either treat each user‑round as an isolated decision (single‑round bandit or two‑stage prediction + optimization) or rely on reinforcement learning that assumes unconstrained reward maximization. Both families ignore the temporal dependencies of user responses and the hard budget constraint, leading to a performance plateau.
To overcome these limitations, the authors propose SACO (Sequence‑Aware Constrained Optimization), a unified end‑to‑end framework that directly learns a policy for sequential coupon issuance under a global budget. The key ideas are:
-
Problem Reformulation – The coupon allocation task is cast as a linear program with decision variables (x_{i t k}) (issue coupon (k) to user (i) at round (t)). Introducing a Lagrange multiplier (\lambda) for the budget constraint transforms the objective into maximizing “reward − (\lambda)·cost”. Strong duality guarantees that solving the dual yields the optimal primal solution.
-
Trajectory‑Based Dataset – Historical logs are reorganized into per‑user trajectories (\tau_i = {(s_{i t}, a_{i t}, RTG_{i t}, CTG_{i t}, \lambda_i)}_{t=0}^T). Here (s) encodes user context, (a) the coupon action, (RTG) and (CTG) are cumulative discounted reward and cost, and (\lambda) is randomly sampled (10× per trajectory) to expose the model to a wide range of budget levels.
-
Decision‑Transformer Architecture – The authors adapt the causal Decision Transformer (Chen et al., 2021) to the constrained setting. Input embeddings combine state, action, RTG, CTG, time step, and the dual variable (\lambda). These tokens pass through stacked transformer blocks, allowing the model to capture long‑range causal effects of past coupons on future user behavior while simultaneously conditioning on the budget pressure.
-
Direct Policy Learning – The model predicts the next coupon action (\hat a_t) directly, eliminating the need for a separate uplift predictor and a downstream combinatorial optimizer. Because (\lambda) is part of the input, the policy automatically adjusts its aggressiveness as the remaining budget shrinks.
-
Efficient Inference – By leveraging model‑parallel inference, SACO can issue coupons for many users in a single forward pass, meeting real‑time latency requirements of production systems.
Experimental Evaluation
The authors evaluate SACO on three datasets: (a) a proprietary ByteDance industrial log, (b) a public e‑commerce benchmark, and (c) synthetic data generated to stress test sequential dependencies. Baselines include classic two‑stage pipelines, recent Decision‑Focused Learning (DFL) methods, and a single‑round Decision Transformer. Across all settings, SACO achieves an average revenue lift of 3.6 % over the strongest baseline. Ablation studies show that removing the (\lambda) embedding reduces performance by ~1.8 %, while omitting the cost‑to‑go (CTG) signal drops it by ~2.3 %, confirming the importance of explicit constraint conditioning. Inference speed is reported to be roughly five times faster than traditional mixed‑integer programming solvers used in the two‑stage approach.
Contributions and Impact
The paper makes three primary contributions: (i) a generalized formulation that handles simultaneous multi‑user requests and multi‑round interactions, (ii) an efficient transformer‑based policy that jointly learns reward maximization and budget compliance, and (iii) extensive empirical validation demonstrating both effectiveness and scalability. By integrating causal sequence modeling with constrained optimization, SACO bridges a gap between reinforcement learning theory and the practical needs of large‑scale marketing operations. Future work may extend the framework to multi‑objective settings (e.g., user lifetime value, churn reduction) and to non‑linear, stochastic budget constraints.
Comments & Academic Discussion
Loading comments...
Leave a Comment