Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization
Large language models show potential in task-oriented dialogue systems, yet existing training methods often rely on token-level likelihood or preference optimization, which poorly align with long-horizon task success. To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent. The Expert Agent optimizes multi-turn goal preferences at the dialogue-trajectory level, while the Customer Service Agent generates responses strictly aligned with the selected strategy. We evaluate GOPO on public benchmarks and e-commerce customer service datasets, and introduce Task-focused Sequential Engagement (TSE), a sequence-level metric derived from real e-commerce interaction data. On the Mgshop dataset, GOPO improves TSE by 7.7% and 10.3% over PPO and Memento, with consistent gains in sequence-level reward and generation quality. Furthermore, a 14B model trained with GOPO achieves 2.7% and 1.5% higher TSE than Qwen-235B and GPT-5.2, respectively. Ablation studies confirm the Expert Agent’s critical role in long-horizon optimization. GOPO demonstrates consistent improvements across other datasets as well. This work establishes a new paradigm for task-oriented dialogue systems in commercial scenarios, with code and datasets to be made public.
💡 Research Summary
The paper tackles a critical gap in task‑oriented dialogue systems: the misalignment between language‑model‑driven fluency and the long‑horizon business objectives that matter in real‑world e‑commerce customer service (conversion rate, first‑contact resolution, SOP compliance). Existing training paradigms—token‑level likelihood maximization, RLHF, or preference‑based methods such as DPO—optimise short‑term linguistic signals but fail to capture multi‑turn strategic planning. To bridge this, the authors propose Goal‑Oriented Preference Optimization (GOPO), a hierarchical reinforcement‑learning framework that explicitly separates strategy selection from response generation through two cooperating agents.
Expert Agent (E) operates at the macro level. It receives a global dialogue state comprising recent history, user intent, emotion, and the previously chosen skill. From a predefined skill pool, it selects one or more skills to form a macro‑action (a_E^t). The reward for E is a trajectory‑level preference signal based on a normalized Discounted Cumulative Gain (ESNDCG) that measures how well the predicted skill sequence matches a teacher‑generated reference. This design provides a ranking‑aware, position‑sensitive supervision that aligns with business goals without requiring turn‑by‑turn human labels.
Customer Service Agent (A) works at the micro level. Conditioned on the current user utterance, business context, and the hard constraints derived from the Expert’s chosen skill, it generates a token sequence as the system response. A’s reward is a weighted sum of four dimensions—fluency, factual correctness, SOP compliance, and response diversity—automatically scored by a GPT‑4 based evaluator. Its loss combines a policy‑gradient term, a compliance loss that penalises deviation from the hard constraints, and an entropy‑based diversity term.
Both agents share a joint reward R_t = R_E^t + R_A^t and are trained with an actor‑critic architecture that computes advantage estimates for each level. The hierarchical dependence reduces exploration complexity and variance, enabling stable long‑horizon optimisation.
A novel evaluation metric, Task‑focused Sequential Engagement (TSE), is introduced. TSE aggregates real‑world e‑commerce signals such as conversion success, average handling time, and SOP violation count across an entire dialogue, providing a more business‑relevant assessment than BLEU or ROUGE.
Experiments are conducted on four datasets: Mgshop, MultiMoz, and two Tmall brand datasets, all derived from actual customer‑service logs. Baselines include PPO, Memento, DPO‑style methods, ReAct‑style single‑agent LLMs, and large proprietary models (Qwen‑235B, GPT‑5.2). GOPO‑Qwen3‑14B (14 B parameters) outperforms PPO by 7.7% and Memento by 10.3% on TSE, while also achieving higher GRE and G‑Eval scores. Remarkably, despite being an order of magnitude smaller, GOPO‑Qwen3‑14B surpasses Qwen‑235B and GPT‑5.2 by 2.7% and 1.5% respectively on TSE, demonstrating the efficiency gains from strategic decoupling. SOP violation rates drop by roughly 45% compared with baselines, and user satisfaction surveys show statistically significant improvements.
Ablation studies reveal that removing the Expert Agent or replacing ESNDCG with a simple accuracy reward degrades TSE by 4–6% and increases SOP breaches. Using only soft‑prompt constraints (no hard transmission) leads to higher linguistic naturalness but substantially worse business alignment, confirming the necessity of hard‑constraint enforcement.
The authors acknowledge limitations: constructing the skill pool and teacher references incurs upfront cost, and the current framework focuses on textual SOPs, leaving multimodal constraints for future work. They also note that reinforcement‑learning stability remains a challenge, suggesting richer offline data and more sophisticated advantage estimators as avenues for improvement.
In conclusion, GOPO establishes a new paradigm for commercial dialogue systems by jointly optimising long‑horizon strategic preferences and turn‑level response compliance. The hierarchical dual‑agent design, trajectory‑level reward formulation, and the TSE metric together enable smaller models to outperform much larger ones on real business metrics, paving the way for more efficient, reliable, and commercially viable conversational AI. Future research directions include automated skill discovery, cross‑domain transfer, and extending hard‑constraint mechanisms to multimodal interactions.
Comments & Academic Discussion
Loading comments...
Leave a Comment