Harmonizing Dense and Sparse Signals in Multi-turn RL: Dual-Horizon Credit Assignment for Industrial Sales Agents
Optimizing large language models for industrial sales requires balancing long-term commercial objectives (e.g., conversion rate) with immediate linguistic constraints such as fluency and compliance. Conventional reinforcement learning often merges these heterogeneous goals into a single reward, causing high-magnitude session-level rewards to overwhelm subtler turn-level signals, which leads to unstable training or reward hacking. To address this issue, we propose Dual-Horizon Credit Assignment (DuCA), a framework that disentangles optimization across time scales. Its core, Horizon-Independent Advantage Normalization (HIAN), separately normalizes advantages from turn-level and session-level rewards before fusion, ensuring balanced gradient contributions from both immediate and long-term objectives to the policy update. Extensive experiments with a high-fidelity user simulator show DuCA outperforms the state-of-the-art GRPO baseline, achieving a 6.82% relative improvement in conversion rate, reducing inter-sentence repetition by 82.28%, and lowering identity detection rate by 27.35%, indicating a substantial improvement for an industrial sales scenario that effectively balances the dual demands of strategic performance and naturalistic language generation.
💡 Research Summary
The paper tackles a fundamental challenge in applying large language models (LLMs) to industrial sales chatbots: the coexistence of dense, turn‑level linguistic rewards (fluency, compliance, style) and sparse, high‑value session‑level business rewards (conversion, regulatory compliance). Traditional reinforcement learning (RL) methods collapse these heterogeneous signals into a single scalar reward, which leads to gradient dominance—high‑variance, large‑magnitude session rewards drown out subtle turn‑level signals, while over‑emphasis on turn‑level rewards can cause reward‑hacking and neglect long‑term objectives.
To resolve this, the authors propose Dual‑Horizon Credit Assignment (DuCA), a multi‑turn RL framework that treats the two horizons separately throughout the learning pipeline. The core innovation is Horizon‑Independent Advantage Normalization (HIAN). DuCA maintains two distinct value heads, V_turn and V_session, each estimating expected returns for its respective horizon. Generalized Advantage Estimation (GAE) is applied independently: A_turn uses γ=0.99, λ=0.95 to capture short‑term language quality, while A_session uses γ=1.0, λ=1.0 so that the terminal business reward propagates without decay across the entire episode.
HIAN then normalizes the two advantage streams independently within each minibatch:
\
Comments & Academic Discussion
Loading comments...
Leave a Comment