DARA: Few-shot Budget Allocation in Online Advertising via In-Context Decision Making with RL-Finetuned LLMs

DARA: Few-shot Budget Allocation in Online Advertising via In-Context Decision Making with RL-Finetuned LLMs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Optimizing the advertiser’s cumulative value of winning impressions under budget constraints poses a complex challenge in online advertising, under the paradigm of AI-Generated Bidding (AIGB). Advertisers often have personalized objectives but limited historical interaction data, resulting in few-shot scenarios where traditional reinforcement learning (RL) methods struggle to perform effectively. Large Language Models (LLMs) offer a promising alternative for AIGB by leveraging their in-context learning capabilities to generalize from limited data. However, they lack the numerical precision required for fine-grained optimization. To address this limitation, we introduce GRPO-Adaptive, an efficient LLM post-training strategy that enhances both reasoning and numerical precision by dynamically updating the reference policy during training. Built upon this foundation, we further propose DARA, a novel dual-phase framework that decomposes the decision-making process into two stages: a few-shot reasoner that generates initial plans via in-context prompting, and a fine-grained optimizer that refines these plans using feedback-driven reasoning. This separation allows DARA to combine LLMs’ in-context learning strengths with precise adaptability required by AIGB tasks. Extensive experiments on both real-world and synthetic data environments demonstrate that our approach consistently outperforms existing baselines in terms of cumulative advertiser value under budget constraints.


💡 Research Summary

The paper addresses the problem of allocating a fixed advertising budget across multiple time periods under the emerging AI‑Generated Bidding (AIGB) paradigm, where advertisers often have personalized objectives but only a few historical interaction records. Traditional reinforcement‑learning (RL) approaches require large amounts of interaction data and extensive training, making them unsuitable for the cold‑start or few‑shot scenarios common in real‑world ad platforms. Recent advances in large language models (LLMs) demonstrate strong in‑context learning capabilities that can generalize from a handful of examples, yet LLMs lack the numerical precision needed for fine‑grained budget optimization.

To bridge this gap, the authors propose DARA (Dual‑phase Adaptive Reasoning and Allocation), a two‑stage framework that explicitly separates the high‑level generalization task from the low‑level numerical refinement task. In the first stage, a “Few‑shot Reasoner” receives a structured prompt containing the objective (minimize variance of marginal ROI while respecting the total budget), a few historical budget‑ROI pairs, prior trial records, and a clear output specification. Using only these few examples, the LLM generates an initial allocation vector, leveraging its language understanding and pattern‑recognition abilities to infer a plausible high‑level plan.

The second stage, the “Fine‑grained Optimizer,” takes the initial plan and real‑time feedback (actual marginal ROI observed in each period) and refines the allocation with numerical precision. This stage is powered by a novel RL‑fine‑tuning algorithm called GRPO‑Adaptive. While the original Group Relative Proximal Optimization (GRPO) stabilizes policy updates via group‑wise KL regularization, it relies on a fixed reference policy. GRPO‑Adaptive periodically replaces the reference policy with a more recent one, allowing the current policy to be compared against an up‑to‑date baseline. This dynamic reference mechanism yields more flexible and sample‑efficient improvements, effectively enhancing both the reasoning quality of the LLM and the numerical accuracy required for budget allocation.

The authors also construct two complementary environments for training and evaluation. The “Real‑World Data Environment” is built directly from enterprise‑scale advertising logs, preserving authentic cost‑consumption patterns and ROI dynamics across 24‑hour cycles. The “Synthetic Data Environment” uses controllable polynomial or exponential functions to model marginal ROI as a decreasing function of allocated budget, enabling unlimited generation of diverse scenarios for robustness testing.

Extensive experiments on both environments demonstrate that DARA consistently outperforms strong baselines, including Q‑MCKP (a discretized Q‑learning approach), HiBid (a hierarchical RL planner), ABPlanner (LSTM‑based RL planner), and pure LLM prompting methods. Across a range of budget sizes and few‑shot settings (as few as five historical episodes), DARA achieves 12‑18 % higher cumulative ROI and significantly lower variance of marginal ROI, confirming its ability to balance long‑term return maximization with budget stability. Ablation studies show that the dual‑phase decomposition is essential: a single‑stage prompt fails to capture both the high‑level pattern and the fine‑grained adjustments, while GRPO‑Adaptive yields a 7 % gain over standard GRPO, highlighting the benefit of dynamic reference updates.

In summary, the paper makes three key contributions: (1) a dual‑phase architecture that aligns the strengths of LLMs (few‑shot generalization) with RL (numerical precision) for budget allocation; (2) the GRPO‑Adaptive algorithm that dynamically updates the reference policy during RL fine‑tuning, improving sample efficiency and policy flexibility; and (3) a realistic simulation framework that, together with real‑world data, validates the robustness and generalizability of the proposed method. DARA offers a practical solution for advertisers operating under data‑scarce conditions, delivering interpretable allocation plans while maintaining the numerical rigor needed for optimal ad spend.


Comments & Academic Discussion

Loading comments...

Leave a Comment