Lightweight Adaptation for LLM-based Technical Service Agent: Latent Logic Augmentation and Robust Noise Reduction

L I G H T W E I G H T A D A P T A T I O N F O R L L M - BA S E D T E C H N I C A L S E R V I C E A G E N T : L A T E N T L O G I C A U G M E N T A T I O N A N D R O B U S T N O I S E R E D U C T I O N Y i Y u 1 ∗ , Junzhuo Ma 1 ∗ , Chenghuang Shen 2 ∗ , 3 , Xingyan Liu 3 ∗ , Jing Gu 1 † , Hangyi Sun 1 † , Guangquan Hu 4 , Jianfeng Liu 3 , 4 ‡ , W eiting Liu 4 , Mingyue Pu 3 , Y u W ang 3 , Zhengdong Xiao 3 , Rui Xie 3 , Longjiu Luo 3 , Qianrong W ang 3 , Gurong Cui 3 , Honglin Qiao 3 , W enlian Lu 1 , 2 , 3 , 4 , 5 ‡ 1 School of Mathematics and Sciences, Fudan Univ ersity , Shanghai, China 2 Shanghai Center for Mathematical Sciences, Fudan Univ ersity , Shanghai, China 3 Alibaba Group, Hangzhou, China 4 Institute of Science and T echnology for Brain-Inspired Intelligence, Fudan Univ ersity , Shanghai, China 5 Center for Applied Mathematics & Shanghai Ke y Laboratory of Contemporary Applied Mathematics, Fudan Univ ersity , Shanghai, China jiawei.ljf@alibaba-inc.com wenlian@fudan.edu.cn ∗ Contributed equally to this research. † Contributed equally to this research. ‡ Corresponding author . A B S T R AC T Adapting Large Language Models in comple x technical service domains is constrained by the absence of e xplicit cogniti ve chains in human demonstrations and the inherent ambiguity arising from the div ersity of valid responses. These limitations sev erely hinder agents from internalizing latent decision dynamics and generalizing effecti vely . Moreover , practical adaptation is often impeded by the prohibitiv e resource and time costs associated with standard training paradigms. T o overcome these challenges and guarantee computational efﬁcienc y , we propose a lightweight adaptation frame work comprising three ke y contributions. (1) Latent Logic Augmentation: W e introduce Planning-A ware T rajectory Modeling and Decision Reasoning Augmentation to bridge the gap between surface-lev el supervision and latent decision logic. These approaches strengthen the stability of Supervised Fine- T uning alignment. (2) Robust Noise Reduction: W e construct a Multiple Ground Truths dataset through a dual-ﬁltering method to reduce the noise by validating di verse responses, thereby capturing the semantic di versity . (3) Lightweight Adaptation: W e design a Hybrid Re ward mechanism that fuses an LLM-based judge with a lightweight relev ance-based Reranker to distill high-ﬁdelity re ward signals while reducing the computational cost compared to standard LLM-as-a-Judge reinforcement learning. Empirical ev aluations on real-world Cloud service tasks, conducted across semantically diverse settings, demonstrate that our frame work achie ves stability and performance gains through Latent Logic Augmentation and Robust Noise Reduction. Concurrently , our Hybrid Reward mechanism achiev es alignment comparable to standard LLM-as-a-judge methods with reduced training time, underscoring the practical value for deplo ying technical service agents. K eywords Large Language Model · Latent Logic Augmentation · Lightweight Adaptation · Noise Reduction PRIME AI paper 1 Introduction The adaptation of Large Language Models (LLMs) to comple x technical service domains presents challenges distinct from those in general-purpose con versational settings [ 1 , 2 , 3 , 4 , 5 , 6 ]. While non-parametric methods such as In-Context Learning (ICL) [ 7 , 8 , 9 , 10 , 11 ] and Retriev al-Augmented Generation (RA G) [ 12 , 13 , 14 ] facilitate inference-time adaptation, deep domain specialization requires parameter updates. Howe ver , tw o fundamental obstacles impede the efﬁcac y of current training paradigms in complex technical service domains. First, human expert demonstrations often lack explicit cognitive chains, presenting only the ﬁnal action rather than the underlying reasoning process [ 15 , 16 ]. Standard Supervised Fine-Tuning (SFT) on these trajectories, alongside methods such as human preference alignment [ 17 ], Continual Pre-training [ 18 ] and Parameter -Efﬁcient Fine-T uning (PEFT) [ 19 ], promotes a myopic imitation of surface-le vel responses, failing to instill necessary decision reasoning capabilities. Although researchers have explored equipping LLMs with foresight by incorporating future dialogue context [ 20 ] or interaction histories [ 21 ], these methods generally rely on environment-mediated feedback [ 22 , 23 ]. The absence of latent logic in training data prev ents the model from learning complex decision dynamics required for effecti ve generalization without a liv e environment, leading to “myopic” imitation which degrades performance in complex, domain-speciﬁc tasks. Second, technical service tasks are characterized by an inherent div ersity of valid responses, where a single query may admit multiple valid resolutions [ 24 , 25 , 26 ]. Conv entional training paradigms rely on a single ground truth, which erroneously penalizes v alid but distinct resolutions, leading to “kno wledge collapse” and output homogenization [ 24 ]. While Reinforcement Learning (RL) methods such as RLHF [ 27 ], PPO [ 28 ], DPO [ 29 ] and DFPO [ 30 ] are po werful for alignment, they struggle when benchmarked against a single, arbitrary “gold” reference [ 25 ]. Recent works ha ve attempted to extend RL framew orks to accommodate multiple references [ 31 , 32 , 33 ], yet systematically capturing this semantic div ersity remains difﬁcult. Furthermore, obtaining reliable re ward signals for RL in this context is challenging. The LLM-as-a-J udge paradigm utilizes powerful LLMs as scalable proxies for human ev aluation [ 34 , 35 , 36 ] and rew ard generation [ 37 , 38 ]. Howe ver , this approach is susceptible to reward hacking, wherein policies exploit judge imperfections [ 39 , 17 ]. While ensemble- based re ward models can mitigate re ward hacking, the y are often impractical for efﬁcient, large-scale training due to prohibitiv e computational costs [40, 41, 42]. T o address these challenges, we propose a computationally efﬁcient adaptation framework comprising three ke y contributions: 1. Latent Logic A ugmentation : W ithout latent decision logic, models are prone to performance de gradation during SFT alignment in comple x tasks. Our approach enriches training data with explicit reasoning structures, including forward-looking reasoning via Planning-A war e T rajectory Modeling (P A TM) and backward-looking reasoning via Decision Reasoning Augmentation (DRA) , compelling the model to internalize the en vironment’ s transition dynamics, thereby enhancing the stability of SFT alignment. 2. Robust Noise Reduction : T o counter the single-reference bias, we construct a Multiple Gr ound T ruths (Multi-GT) dataset using a novel dual-ﬁltering method. This process identiﬁes and curates a div erse set of valid responses for each query , reducing supervision noise and enabling the model to learn a richer semantic space of valid solutions. 3. Lightweight Adaptation : W e introduce a Hybrid Rewar d mechanism (HRM) that fuses an LLM-based judge with a lightweight, relev ance-based Reranker . This design provides high-ﬁdelity re ward signals for RL while maintaining computational efﬁcienc y , facilitating rob ust and scalable model alignment without the prohibitiv e costs of traditional RL methods that rely solely on LLM-based judges. W e v alidate our framew ork on real-w orld Cloud service tasks, demonstrating the ef ﬁcacy of our methods in semantically div erse settings. 2 Related W ork LLM Adaptation. Methods for specializing LLMs are broadly categorized into non-parameter and parameter -update approaches. Non-parameter methods like In-Context Learning (ICL) [ 7 , 8 ] and Retrie v al-Augmented Generation (RA G) [ 12 ] adapt models at inference time. In contrast, our work focuses on parameter-update methods for deeper adaptation. Supervised Fine-T uning (SFT) aligns models with labeled data [ 17 ], but as we argue, it can lead to myopic imitation in planning-intensi ve tasks. Reinforcement Learning (RL) methods, including RLHF [ 27 ], PPO [ 28 ], DPO [ 29 ] and DFPO [ 30 ] optimize models against a re ward function. Our work builds upon adv ances in RL stability , such as D APO 2 PRIME AI paper [ 33 ], but shifts the focus from the optimization algorithm itself to impro ving the quality and efﬁcienc y of the training data and rew ard signals. Future-A ware Planning in LLMs. T o impro ve multi-turn consistenc y , researchers hav e e xplored equipping LLMs with foresight. This includes incorporating future dialogue context [ 20 ], latent thought traces [ 15 , 16 ], or interaction histories to guide generation [ 21 ]. While these methods demonstrate that models can acquire implicit planning capabilities, they often rely on en vironment-mediated feedback [ 22 , 23 ]. Our Latent Logic Augmentation contributes a more direct approach by explicitly structuring training data to teach the model to reason about future states without requiring a liv e en vironment. Handling Diverse V alid Responses. The reliance on a single ground truth can lead to kno wledge collapse and output homogenization [ 24 , 25 ]. This is particularly problematic in domains where multiple valid solutions exist. Recent work has focused on creating benchmarks with multiple references [ 26 ] and extending RL framew orks to accommodate them [ 31 , 32 , 33 ]. Our contribution, Robust Noise Reduction, introduces a principled and automated dual-ﬁltering method to construct a Multi-GT dataset from real-world data, systematically capturing semantic di versity . LLM-as-a-Judge f or Reward Modeling. The LLM-as-a-Judge paradigm uses powerful LLMs as scalable proxies for human ev aluation [ 34 , 35 , 36 ]. This has been extended to generate reward signals for RL [ 37 , 38 ]. Howe ver , a key challenge is re ward hacking, where the polic y exploits imperfections in the judge [ 17 , 39 ]. While ensemble-based re ward models can enhance rob ustness, the y are computationally expensiv e for online training [ 40 , 42 ]. Our Lightweight Adaptation directly addresses this trade-off by designing a Hybrid Re ward mechanism that combines the ﬁdelity of an LLM judge with the efﬁcienc y of a lightweight Reranker , achie ving both robustness and computational tractability . 3 Methodology In this section, we present our frame work for adapting Lar ge Language Models (LLMs) to complex technical service domains. W e ﬁrst establish a Multiple Ground T ruths (Multi-GT) paradigm for ev aluation and data construction. W e then describe a two-phase adaptation process: (1) Latent Logic A ugmentation via Planning-A war e T rajectory Modeling (P ATM) and Decision Reasoning Augmentation (DRA) , and (2) Lightweight Adaptation via RL training with a Hybrid Rewar d Mechanism (HRM) . PATM Data Source PATM Prompts PATM GT Rewrite PATM instruction PATM GT w/ Reasoning Decision Reasoning Augmentation PATM Data Decision Data Source Decision Prompts Decision instruction Decision Data SFT RL Data Source Offline Candidates Online Candidates Candidates Pool Accepted Candidates Consistency Judge Policy π Reranker Rollout Fast Pass? Judge Zero Judge Score Yes No Hybrid Reward Decision GT w/ Reasoning Rollout Inference Accepted Candidates Utility Judge Multi - GT Data DAPO Verified Optimizing Computing Unprocessed Preproces sed Processed LLM Instructing Training Latent Logic Aug mentation Robust Noise Reduction SFT Adaptation Lightweight Adaptatio n for RL Figure 1: Ov erview of the Proposed Framework. The framework consists of four stages. T op-left: Latent Logic Augmentation (pre-computed); Bottom-left: Multi-GT data (pre-computed); T op-right: SFT training; Bottom-right: RL training with HRM. 3.1 Foundation: Multi-GT Data and Evaluation The foundation of our framework is a Multi-GT paradigm designed to address the inherent ambiguity of technical services, where a single query can hav e multiple v alid responses. Traditional e v aluation, which uses a single logged agent response as the unique gold reference, is unreliable because it unfairly penalizes valid but distinct solutions. T o overcome this, we expand the single gold reference into a set of valid responses Y ⋆ ( x ) = { y ⋆ 1 , . . . , y ⋆ m } for both ev aluation and training. Core Instruments: Consistency Judge and Utility Judge. Since manual annotation is impractical, we automate the construction of the Multi-GT training set via a pre-computation pipeline. Central to our pipeline are two specialized LLM-based judges: (1) Consistency Judg e : Evaluates whether a response follo ws business logic consistent with the expert, achie ving 92% alignment with human labels (see prompt in Appendix D.1). (2) Utility Judge : Uses privileged context (summary of service ticket) to determine if an alternati ve response ef fectiv ely resolves the customer’ s issue, 3 PRIME AI paper capturing valid solutions that dif fer from the history , achieving 83% alignment with human labels (see prompt in Appendix D.2). A utomated Construction via Dual-Filtering Expansion. W e construct the candidates for Multi-GT dataset through two complementary streams: (1) Ofﬂine Exploration: W e use a lightweight model (e.g., Qwen3-4B) with high temperature ( T = 1 . 2 ) to generate di verse candidates. This injects novel linguistic patterns and reasoning angles; (2) Online Adaptation: W e harvest high-likelihood rollouts from a preliminary RL run of a policy model. This exposes “hard positiv e” responses fa vored by the model that are functionally v alid but dif fer from the human reference. All candidates undergo a Dual-F iltering process comprising a Consistency Judge and a Utility Judge: The Consistency Judge ensures policy adherence, while the Utility Judge validates the effecti veness of alternativ e solutions. Only candidates passing one of these ﬁlters are added to Y ⋆ ( x ) . For transparenc y , we report the detailed dataset composition and the expansion breakdo wn by source in Appendix A (T able 3). Overall, Multi-GT expansion roughly doubles the number of references (e.g., T rain: 5,120 → 10,127), where newly added references come from three channels: consistency-judge-approv ed online rollouts, consistency-judge- approv ed ofﬂine candidates, and utility-judge-appro ved alternativ es. For ev aluation with Multi-GT dataset, we deﬁne the Ensemble-Consistency Score (ECS) as the evaluation of the Consistency Judge (normalized to [0 , 1] ) against the expanded set Y ⋆ ( x ) : S ECS ( x, y ) = max y ⋆ ∈Y ⋆ ( x ) J con ( x, y , y ⋆ ) . (1) where J con is the mean score across an ensemble Consistency Judges (DeepSeek-R1[ 43 ], DeepSeek-V3.2[ 44 ], Qwen3- Max[45], and QwQ-Plus[46]). 3.2 SFT Stage: Latent Logic A ugmentation Standard SFT on expert trajectories often leads to myopic imitation, leading to performance de gradation in complex technical service tasks. T o instill latent decision reasoning, we augment training data through two methods. W e model the interaction as a Markov Decision Process (MDP) M = ⟨S , A , P , R , γ ⟩ and augment the training data to make the latent decision processes visible. Decision Reasoning A ugmentation. T o teach immediate reasoning, we process raw state-action pairs. For each agent response a t corresponding to the state q t in tickets, a powerful LLM generates a “backward” chain-of-thought rationale c t justifying the action. The model is trained to predict the rationale and the response: L Decision = − E ( q t ,c t ,a t ) ∼D DRA [log p θ ( c t , a t | q t )] . (2) This objectiv e forces the model not just to mimic the action a t , but to ﬁrst generate the underlying thought process c t , thereby internalizing the decision logic (see prompt in Appendix D.3.4). Planning-A ware T rajectory Modeling. T o equip the agent with foresight, we construct Planning-A ware trajectories for short future interactions. W e extract three-step future sequences ( a t , q t +1 , a t +1 ) following the current customer query q t in the tickets, representing an agent response, corresponding en vironment response, and corresponding next agent response. A powerful LLM re writes this into a structured planning form y P A TM t = ( q t , ˜ a t , ˜ q t +1 , ˜ a t +1 ) , where ˜ a t is the predicted response to query q t , ˜ q t +1 is the predicted response of the environment (customer or tool) to ˜ a t , and ˜ a t +1 is the next predicted response to ˜ q t +1 . The learning objectiv e autoregressi vely generates this trace: L P A TM = − E ( y P A TM t ) ∼D P A TM [log p θ (˜ a t , ˜ q t +1 , ˜ a t +1 | q t )] . (3) By explicitly predicting Planning-A ware T rajectory , the model internalizes the en vironment’ s transition dynamics, which forms the foundation of decision reasoning dynamics, mo ving beyond simple pattern matching. This autoregressi ve objectiv e naturally decomposes into four coupled sub-objecti ves: (i) reasoning π θ ( c t | q t ) , (ii) policy ex ecution π θ (˜ a t | q t , c t ) , (iii) implicit world modeling P θ ( ˜ q t +1 | q t , c t , ˜ a t ) , and (iv) contingency planning π θ (˜ a t +1 | q t , c t , ˜ a t , ˜ q t +1 ) . T erm (iii) is the core signal for capturing the stochastic en vironment dynamics (see prompts in Appendix D.3.1 and D.3.2). Furthermore, we can apply DRA to P A TM data to augment P A TM data with reasoning y P A TM+R t = ( q t , c t , ˜ a t , ˜ q t +1 , ˜ a t +1 ) . In practice, D DRA and D P A TM can be jointly utilized for SFT alignment. As empirically veriﬁed in our ablations (T able 1), this hybrid approach yields a superior balance of response quality and strategic foresight, pro viding a robust starting point for the subsequent online adaptation phase. 3.3 RL Stage: Lightweight Adaptation with Hybrid Reward T o reﬁne the policy π SFT obtained from SFT alignment, we address the challenge of reward sparsity and computational cost in Reinforcement Learning. During interactions, the agent primarily executes either tool in vocations ( call_tool ) 4 PRIME AI paper or textual replies ( reply ). Any action-type mismatch strictly yields a zero rew ard. F or matched actions, while tool calls are deterministically scored via exact parameter matching, assessing the semantic consistency of free-form textual replies is highly subjecti ve. Using powerful LLMs as judges for e very RL rollout incurs signiﬁcant latency . T o optimize the trade-off between ef ﬁciency and ﬁdelity , we propose a HRM that fuses a lightweight Reranker with an LLM-based Judge. Components: Reranker and Judge. (1) Reranker ( S R ): Rather than relying on dense embeddings, we use an instruction-augmented Qwen3-4B consistency reranker (non-thinking). It serves as a computationally cheap discriminator ( S R ) that captures logical contradictions often missed by v ector similarity (see Appendix C for a detailed comparison with embedding-based methods). (2) LLM-based Judge ( S J ): A larger Qwen3-32B model (thinking-enabled) provides expert-le vel consistenc y scores, using the same prompt template as the Consistency Judge in the Dual-Filtering process. T o stabilize training, we use a soft-score derived from token probabilities: Score = P ( “Yes” ) + 0 . 5 · P ( “Part” ) . Single-Interval Cascade Strategy . Running the 32B Judge for e very RL rollout is prohibiti vely expensi ve, incurring approximately 10 × the inference latency compared with the lightweight Reranker . T o mitigate this, we employ a cascade strategy R θ ( S R , S J ) to approximate the oracle re ward. W e deﬁne a “trust interval” [ τ a , τ b ] for the cheap Reranker: R θ ( S R , S J ) =    w 1 S R + (1 − w 1 ) S J , S R < τ a (Mix) S R , τ a ≤ S R ≤ τ b (Fast P ass) w 2 S R + (1 − w 2 ) S J , S R > τ b (Mix) (4) The parameters θ = { τ , w } are optimized to maximize Spearman’ s rank correlation coefﬁcient with the ensemble Consistency Judge on a held-out set. Formally , each e valuation instance is x = ( q , y a , y b ) with two model-produced scores S R ( x ) ∈ [0 , 1] (reranker) and S J ( x ) ∈ [0 , 1] (32B judge), and a teacher score Y ( x ) ∈ [0 , 1] (Ensemble- Consistency Score). Given a ﬁtting set S = { x i } N i =1 , we seek a mapping R θ : [0 , 1] 2 → [0 , 1] that maximizes: θ ∗ = arg max θ ∈ Θ ρ spearman  { R θ ( S R ( x i ) , S J ( x i )) } N i =1 , { Y ( x i ) } N i =1  . (5) By routing clear-cut samples (F ast Pass) to the fast Rerank er and reserving the costly Judge for ambiguous cases, this hybrid mechanism reduces the ov erall rew ard computation time while maintaining alignment ﬁdelity . 4 Experiments W e conduct extensi ve e xperiments to ev aluate the ef fectiveness of our proposed frame work based on the Qwen3-4B model [ 47 ]. W e adopt a progressive validation strategy: ev aluating Latent Logic Augmentation in the SFT phase, followed by Lightweight Adaptation and Rob ust Noise Reduction in the RL phase. Speciﬁcally , we aim to answer the following research questions (RQs): RQ1 SFT Strategy: Do DRA and P A TM data effecti vely enhance the model’ s capacity in complex tasks? (Sec. 4.2) RQ2 Reward Design: How do the SFT initialization and the proposed HRM impact the effecti veness of RL adaptation? (Sec. 4.2) RQ3 Data Construction: Does the Multi-GT paradigm successfully prevent kno wledge collapse and cov er diverse valid resolutions? (Sec. 4.2) 4.1 Experimental Setup Dataset & Metrics. W e utilize a proprietary technical service dataset, consisting of 10k queries each for decision and planning SFT , alongside 5120/1k/1k queries for RL training/v alidation/test. Using our dual-ﬁltering pipeline, we expand the single logged reference into a Multi-GT dataset across all splits for RL training and for both RL and SFT e valuation (e.g., expanding the test references from 1k to 1,975, as detailed in Sec. 3.1). Accordingly , we report the ECS for Multi-GT (Multi-ECS) (as detailed in Sec. 3.1) as the primary ev aluation metric, and additionally report Single-ECS to measure alignment with the single logged reference. As established in Sec. 3.1, our Ensemble-Consistency Score has been veriﬁed to achie ve 92% alignment with human e xpert judgments. T raining Setup & Baselines. For SFT strategies, we inv estigate the impact of DRA (w/ or w/o DRA) and P A TM ( SFT-Decision vs. SFT-Mix ). In the RL stage, we employ the D APO algorithm [ 33 ]. F or re ward signals, we compare: (1) Reranker , (2) Hard Judge , (3) Soft Judge , and (4) Hybrid Reward . For data construction, we compare standard Single-GT RL against our Multi-GT expansion. T o ensure a fair comparison, all RL models are 5 PRIME AI paper consistently e valuated at a ﬁx ed training checkpoint (after 20 episodes). Detailed hyperparameter conﬁgurations for both SFT and RL stages are provided in Appendix B. T able 1: SFT Strategy Analysis (RQ1). Comparison of Latent Logic Augmentation. Applying DRA to decision data and P A TM data (Mix) dramatically improves capabilities. Method Multi-ECS Single-ECS Call T ool Acc Original Model 0.299 0.178 0.082 DRA SFT -Decision (w/o DRA) 0.293 0.193 0.096 SFT -Decision (w/ DRA) 0.319 0.224 0.139 + P A TM SFT -Mix (w/o DRA) 0.326 0.233 0.149 SFT -Mix (w/ DRA) 0.337 0.242 0.279 T able 2: RL Adaptation and Robust Noise Reduction Analysis (RQ2-3). W e establish a default conﬁguration ( Ours ) and independently ablate its components. Upgrading the data to Multi-GT yields the ﬁnal peak performance. Conﬁguration V ariant Multi-ECS Single-ECS Ours (SFT -Mix w/ DRA + Hybrid + Single-GT) 0.429 0.357 (1) V arying SFT Initialization → replace with non-SFT 0.348 0.231 → replace with SFT -Decision (w/o DRA) 0.353 0.331 → replace with SFT -Decision (w/ DRA) 0.406 0.336 → replace with SFT -Mix (w/o DRA) 0.407 0.346 (2) V arying Reward Signal → replace with Reranker Only 0.389 0.309 → replace with Hard Judge 0.406 0.336 → replace with Soft Judge 0.413 0.344 (3) V arying Training Data Construction → upgrade to Multi-GT (Full Framework) 0.441 0.347 4.2 Main Results W e isolate and validate the proposed components below , summarizing SFT phase results in T able 1 and RL phase results in T able 2 . RQ1: Impact of SFT Strategies and Latent Logic. T able 1 sho ws standard SFT without reasoning (SFT -Decision w/o DRA) slightly degrades Multi-ECS, highlighting “myopic” imitation. While backward-looking reasoning (w/ DRA) mitigates this, incorporating forward-looking planning (SFT -Mix w/ DRA) achiev es peak Multi-ECS (0.337) and doubles Call T ool Accuracy ( 0 . 139 → 0 . 279 ). These results demonstrate the signiﬁcance of latent decision logic in complex technical service tasks. W e default to SFT -Mix (w/ DRA) for subsequent RL experiments. RQ2: Impact of SFT Base and Reward Design on RL. T able 2 ev aluates our strong RL default (Ours: SFT -Mix w/ DRA + Hybrid Reward + Single-GT ; 0.429 Multi-ECS). SFT initialization strictly dictates RL upper bounds (Block 1): skipping SFT entirely (non-SFT) yields poor alignment (0.348), proving its foundational necessity . Even with SFT , lacking explicit reasoning or planning caps scores at 0.407 and 0.406 respectively , while lacking both drops to 0.353. This prov es that our Latent Logic Augmentation provides a superior, indispensable starting point for RL, successfully pre venting surface-le vel trial-and-error . Regarding re wards (Block 2), the standalone Reranker underperforms (0.389) due to le xical reliance. The Hard Judge (0.406) and token-probability-deri ved Soft Judge (0.413) improve alignment via binary and smoothed signals, respectiv ely . Howe ver , our Hybrid Reward achiev es superior performance (0.429) while reducing rew ard time by 30% , thereby preserving computational efﬁcienc y . RQ3: Effectiveness of Multi-GT Expansion. Upgrading to the Multi-GT dataset (Block 3) achie ves Multi-ECS of 0.441 . The concurrent slight drop in Single-ECS ( 0 . 357 → 0 . 347 ) highlights the ﬂaw of Single-GT training: it forces the policy to collapse into arbitrary reference phrasings, penalizing valid alternati ves. By rewarding di verse and veriﬁed paths, Multi-GT mitigates output homogenization by re warding semantically div erse yet valid resolutions. 6 PRIME AI paper                      Figure 2: T raining policy entropy comparison on the 4B model. Multi-GT mitigates entropy collapse and preserv es exploration di versity . T o further in vestigate this, we compare the policy entropy curves during RL training. As shown in Figure 2, RL + Multi-GT maintains a consistently higher policy entropy , while RL + Single-GT exhibits a noticeable entropy collapse, indicating reduced e xploration and increased mode-seeking behavior . This further conﬁrms that Multi-GT expansion successfully pre vents kno wledge homogenization. 5 Conclusion and Discussion In this work, we addressed the critical challenges of adapting Large Language Models to complex technical service domains, namely the absence of explicit reasoning in human demonstrations and the inherent ambiguity of valid responses. W e proposed a holistic and computationally ef ﬁcient adaptation frame work built on three synergistic contributions: Latent Logic Augmentation, Robust Noise Reduction, and Lightweight Adaptation. Our framew ork ﬁrst enriches the training data with explicit planning and reasoning structures, then constructs a div erse Multi-Ground-Truth dataset to reduce supervision noise, and ﬁnally employs a nov el Hybrid Reward mechanism for ef ﬁcient and ef fective reinforcement learning. Our empirical ev aluation yielded several k ey insights. First, on the foundational lev el of supervised ﬁne-tuning, we demonstrated that enriching training data with latent decision reasoning structures successfully equips technical service agents with reasoning while pre venting catastrophic forgetting. The mixed training strate gy ( SFT-Mix ), which jointly learns from single-turn responses and structured planning traces, is essential for instilling robust, long-horizon reasoning capabilities. Second, we v alidated the ef ﬁcacy of our Hybrid Re ward mechanism, which intelligently fuses a lightweight reranker with a powerful LLM judge. It established a superior trade-off, achieving performance comparable to a costly Judge-only rew ard system while drastically reducing computational ov erhead by 30%, thus providing a practical blueprint for ef ﬁcient online adaptation. Finally , we conﬁrmed the signiﬁcant beneﬁts of the Multi-GT paradigm. By training on a di verse set of expert-v eriﬁed responses, the model learned a richer semantic space, leading to improv ed performance and mitigating the policy entrop y collapse often associated with single-reference reinforcement learning. Despite these promising results, our work has several limitations that open avenues for future research. First, our experiments were conducted on a proprietary , real-world technical service dataset. While this ensures the practical relev ance of our ﬁndings, it limits direct reproducibility . T o mitigate this, we commit to releasing our source code and model checkpoints to facilitate further research in the community . Second, the quality of both our planning-augmented data and the Multi-GT dataset is contingent on the capabilities of the lar ge-scale teacher models used for generation and judgment. The dependency on such po werful, and often costly , models remains an open challenge for the ﬁeld. Third, the cascade mixer in our Hybrid Re ward mechanism is calibrated of ﬂine and remains static during training. Its ﬁxed thresholds may not be optimal across the entire RL trajectory as the policy under goes signiﬁcant distributional shifts. Building on these limitations, we identify sev eral exciting directions for future work. A primary focus will be the dev elopment of an adaptive or online-ev olving Hybrid Reward system. Such a system could dynamically adjust its 7 PRIME AI paper fusion strategy or e ven retrain the judge model asynchronously to co-ev olve with the agent, ensuring the re ward signal remains robust and accurate throughout training. Another promising direction is to explore the generalization of our framew ork to other complex, high-stakes domains characterized by response di versity and implicit logic, such as medical diagnosis or legal assistance. Finally , we plan to in vestigate more sophisticated techniques for Multi-GT construction, potentially incorporating measures of uncertainty or re ward v ariance to assign different weights to ground-truth references, further reﬁning the supervision signal for agent alignment. Impact Statement This paper presents work whose goal is to advance the ﬁeld of machine learning. There are many potential societal consequences of our work, none of which we feel must be speciﬁcally highlighted here. References [1] Xiang Deng, Y u Gu, Boyuan Zheng, Shijie Chen, Sam Ste vens, Boshi W ang, Huan Sun, and Y u Su. Mind2web: T ow ards a generalist agent for the web . In Advances in Neural Information Pr ocessing Systems , 2023. [2] Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Y u Su. Gpt-4v (ision) is a generalist web agent, if grounded. In International Conference on Machine Learning , 2024. [3] Renze Lou, Hanzi Xu, Sijia W ang, Jiangshu Du, Ryo Kamoi, Xiaoxin Lu, Jian Xie, Y uxuan Sun, Y usen Zhang, Jihyun Janice Ahn, et al. Aaar-1.0: Assessing ai’ s potential to assist research. In International Conference on Machine Learning . [4] Kimi T eam, Y ifan Bai, Y iping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Y anru Chen, Y uankun Chen, Y utian Chen, Y u Chen, et al. Kimi k2: Open agentic intelligence. arXiv pr eprint arXiv:2507.20534 , 2025. [5] Aohan Zeng, Xin Lv , Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang W ang, Da Y in, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471 , 2025. [6] Meituan LongCat T eam, Bei Li, Bingye Lei, Bo W ang, Bolin Rong, Chao W ang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, et al. Longcat-ﬂash technical report. arXiv pr eprint arXiv:2509.01322 , 2025. [7] T om Bro wn, Benjamin Mann, Nick Ryder , Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry , Amanda Askell, et al. Language models are few-shot learners. In Advances in Neural Information Pr ocessing Systems , volume 33, pages 1877–1901, 2020. [8] Jason W ei, Xuezhi W ang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Pr ocessing Systems , volume 35, pages 24824–24837, 2022. [9] Gemini T eam, Petko Georgie v , V ing Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett T anzer, Damien V incent, Zhufeng P an, Shibo W ang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv pr eprint arXiv:2403.05530 , 2024. [10] Rishabh Agarw al, A vi Singh, Lei Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nov a, et al. Many-shot in-conte xt learning. In Advances in Neural Information Processing Systems , volume 37, pages 76930–76966, 2024. [11] Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, V amsidhar Kamanuru, Jay Rainton, Chen W u, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self-impro ving language models. arXiv preprint , 2025. [12] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler , Mike Lewis, W en-tau Y ih, Tim Rocktäschel, et al. Retriev al-augmented generation for knowledge- intensiv e nlp tasks. In Advances in neural information pr ocessing systems , volume 33, pages 9459–9474, 2020. [13] Florin Cuconasu, Giov anni T rappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Y oelle Maarek, Nicola T onellotto, and Fabrizio Silvestri. The po wer of noise: Redeﬁning retrie val for rag systems. In International A CM SIGIR Conference on Resear ch and Development in Information Retrieval , pages 719–729, 2024. [14] Grégoire Mialon, Roberto Dessi, Maria Lomeli, Christoforos Nalmpantis, Ramakanth P asunuru, Roberta Raileanu, Baptiste Roziere, T imo Schick, Jane Dwiv edi-Y u, Asli Celikyilmaz, et al. Augmented language models: a survey . T r ansactions on Machine Learning Resear ch . 8 PRIME AI paper [15] Eric Zelikman, Y uhuai W u, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Pr ocessing Systems , 35:15476–15488, 2022. [16] Eric Zelikman, Georges Raif Harik, Y ijia Shao, V aruna Jayasiri, Nick Haber , and Noah Goodman. Quiet-star: Language models can teach themselves to think before speaking. In F irst Confer ence on Language Modeling . [17] Long Ouyang, Jeffre y W u, Xu Jiang, Diogo Almeida, Carroll W ainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray , et al. T raining language models to follow instructions with human feedback. Advances in neural information pr ocessing systems , 2022. [18] Liangcai Su, Zhen Zhang, Guangyu Li, Zhuo Chen, Chenxi W ang, Maojia Song, Xinyu W ang, Kuan Li, Jialong W u, Xuanzhong Chen, et al. Scaling agents via continual pre-training. arXiv pr eprint arXiv:2509.13310 , 2025. [19] Edward J Hu, Phillip W allis, Zeyuan Allen-Zhu, Y uanzhi Li, Shean W ang, Lu W ang, W eizhu Chen, et al. Lora: Low-rank adaptation of lar ge language models. In International Conference on Learning Repr esentations . [20] W eihao Zeng, Keqing He, Y ejie W ang, Chen Zeng, Jingang W ang, Y unsen Xian, and W eiran Xu. Futuretod: T eaching future knowledge to pre-trained language model for task-oriented dialogue. In Annual Meeting of the Association for Computational Linguistics , pages 6532–6546, 2023. [21] Y ifan Song, W eimin Xiong, Xiutian Zhao, Dawei Zhu, W enhao W u, Ke W ang, Cheng Li, W ei Peng, and Sujian Li. Agentbank: T ow ards generalized llm agents via ﬁne-tuning on 50000+ interaction trajectories. In Confer ence on Empirical Methods in Natural Language Processing , 2024. [22] Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and Noah D Goodman. Stream of search (sos): Learning to search in language. In Confer ence On Language Modeling , volume 2, 2024. [23] Jessy Lin, Y uqing Du, Olivia W atkins, Danijar Hafner , Pieter Abbeel, Dan Klein, and Anca Dragan. Learning to model the world with language. In International Conference on Machine Learning , 2023. [24] Dustin Wright, Sarah Masud, Jared Moore, Srishti Y adav , Maria Antoniak, Peter Ebert Christensen, Chan Y oung Park, and Isabelle Augenstein. Epistemic diversity and knowledge collapse in large language models. arXiv pr eprint arXiv:2510.04226 , 2025. [25] Alexander Shypula, Shuo Li, Botong Zhang, V ishakh Padmakumar , Kayo Y in, and Osbert Bastani. Evaluating the div ersity and quality of LLM generated content. In Second Conference on Language Modeling , 2025. [26] Liwei Jiang, Y uanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Y ulia Tsvetk ov , Maarten Sap, and Y ejin Choi. Artiﬁcial hivemind: The open-ended homogeneity of language models (and beyond). In Confer ence on Neural Information Processing Systems Datasets and Benchmarks T rac k , 2025. [27] Daniel M Ziegler , Nisan Stiennon, Jeffre y W u, T om B Bro wn, Alec Radford, Dario Amodei, Paul Christiano, and Geoffre y Irving. Fine-tuning language models from human preferences. arXiv preprint , 2019. [28] John Schulman, Filip W olski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov . Proximal policy optimization algorithms. arXiv preprint , 2017. [29] Rafael Rafailo v , Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Y our language model is secretly a reward model. In Advances in neural information pr ocessing systems , volume 36, pages 53728–53741, 2023. [30] W eiting Liu, Han W u, Y ufei Kuang, Xiongwei Han, T ao Zhong, Jianfeng Feng, and W enlian Lu. Automated optimization modeling via a localizable error-dri ven perspecti ve. arXiv preprint , 2026. [31] Skyler W u and A ymen Echarghaoui. Intelligently weighting multiple reference models for direct preference optimization of llms. arXiv preprint , 2025. [32] Gholamali Aminian, Amir R Asadi, Idan Shenfeld, and Y oussef Mroueh. Kl-re gularized rlhf with multiple reference models: Exact solutions and sample complexity . In Confer ence on Neural Information Pr ocessing Systems , 2025. [33] Qiying Y u, Zheng Zhang, Ruofei Zhu, Y ufeng Y uan, Xiaochen Zuo, Y uY ue, W einan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Y uxuan T ong, Chi Zhang, Mofan Zhang, Ru Zhang, W ang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi W ang, Hongli Y u, Y uxuan Song, Xiangpeng W ei, Hao Zhou, Jingjing Liu, W ei-Y ing Ma, Y a-Qin Zhang, Lin Y an, Y onghui W u, and Mingxuan W ang. D APO: An open-source LLM reinforcement learning system at scale. In Advances in Neural Information Pr ocessing Systems , 2025. [34] Y ang Liu, Dan Iter , Y ichong Xu, Shuohang W ang, Ruochen Xu, and Chenguang Zhu. G-ev al: Nlg ev aluation using gpt-4 with better human alignment. In Confer ence on Empirical Methods in Natural Language Pr ocessing , 2023. 9 PRIME AI paper [35] Lianmin Zheng, W ei-Lin Chiang, Y ing Sheng, Siyuan Zhuang, Zhanghao W u, Y onghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. In Conference on Neural Information Pr ocessing Systems , 2023. [36] Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Y ujia Zhou, Qingyao Ai, Ziyi Y e, and Y iqun Liu. Llms-as- judges: a comprehensive surve y on llm-based ev aluation methods. In International Confer ence on Learning Repr esentations , 2024. [37] W eizhe Y uan, Richard Y uanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar , Jing Xu, and Jason W eston. Self-re warding language models. arXiv preprint , 2025. [38] Harrison Lee, Samrat Phatale, Hassan Mansoor , K ellie Ren Lu, Thomas Mesnard, Johan Ferret, Colton Bishop, Ethan Hall, V ictor Carbune, and Abhinav Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. 2023. [39] Usman Anwar , Abulhair Saparo v , Ja vier Rando, Daniel Paleka, Miles T urpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner , Stephen Casper , Oliv er Sourbut, et al. Foundational challenges in assuring alignment and safety of large language models. arXiv pr eprint arXiv:2312.15798 , 2024. [40] Thomas Coste, Usman Anwar , Robert Kirk, and David Krueger . Rew ard model ensembles help mitigate ov eroptimization. In International Confer ence on Learning Repr esentations , 2023. [41] Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Ale x D’Amour , DJ Dvijotham, Adam Fisch, Katherine Heller , Stephen Pfohl, Deepak Ramachandran, et al. Helping or herding? re ward model ensembles mitigate but do not eliminate re ward hacking. In Confer ence On Language Modeling , 2023. [42] T engyu Xu, Eryk Helenowski, Karthik Abinav Sankararaman, Di Jin, Kaiyan Peng, Eric Han, Shaoliang Nie, Chen Zhu, Hejia Zhang, W enxuan Zhou, et al. The perfect blend: Redeﬁning rlhf with mixture of judges. arXiv pr eprint arXiv:2409.20370 , 2024. [43] Daya Guo, Dejian Y ang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi W ang, Xiao Bi, et al. Deepseek-r1: Incenti vizing reasoning capability in llms via reinforcement learning. arXiv pr eprint arXiv:2501.12948 , 2025. [44] Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan W ang, Bingzheng Xu, Bochao W u, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv pr eprint arXiv:2512.02556 , 2025. [45] Qwen T eam. Qwen3-max: Just scale it, September 2025. [46] Alibaba Cloud. Alibaba cloud model studio: Models overvie w . https://www.alibabacloud.com/help/en/ model- studio/models#e2da66a1f47ii , 2026. Accessed: 2026-02-27. [47] An Y ang, Anfeng Li, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bo wen Y u, Chang Gao, Chengen Huang, Chenxu Lv , et al. Qwen3 technical report. arXiv preprint , 2025. 10 PRIME AI paper A Dataset Construction Statistics Dataset Size Multi-GT Expansion Breakdo wn Split #Queries Single-GT Multi-GT +Added Con. Judge(online) Con. Judge(ofﬂine) Utility Judge Expand % T est 1000 1000 1975 975 543 34 398 97.50 V al 1000 1000 2091 1091 671 41 379 109.10 T rain 5120 5120 10127 5007 2834 143 2030 97.79 T able 3: Statistics of Multi-GT construction and expansion sources. Left: dataset sizes before/after Multi-GT expansion. Right: breakdown of newly added references by source. +Added = Multi-GT − Single-GT , and Expand % = +Added Single-GT × 100 . B Experimental Settings and Hyperparameters W e provide the detailed hyperparameter conﬁgurations used in our experiments. T able 4 details the settings for the Supervised Fine-T uning (SFT) stage, which serv es as the cold-start initialization for all RL experiments. T able 5 presents the conﬁgurations for the Reinforcement Learning stage using Dynamic Sampling Policy Optimization algorithm (D APO). B.1 Supervised Fine-T uning (SFT) Conﬁguration For SFT , we ﬁne-tune the base models on the mix ed dataset. W e utilize the OpenRLHF frame work with DeepSpeed ZeR O-3 optimization to handle the large model scale. T able 4: Hyperparameters for Super vised Fine-T uning (SFT). Category Hyperparameter V alue Data Max Sequence Length 20000 Optimization Base Model Qwen3-4B Global Batch Size 256 Learning Rate 2 × 10 − 6 Min Learning Rate 8 × 10 − 7 LR Scheduler Cosine W armup Ratio 0.01 W eight Decay 0.0001 Epochs 1 Optimizer AdamW (ZeR O-3) B.2 Reinfor cement Learning (RL) Conﬁguration For the RL stage, we employ the D APO algorithm. W e use the Hybrid Rew ard Mixer as the primary signal. C Reranker Selection and J ustiﬁcation T o select the optimal lightweight rew ard signal ( S R ), we conducted a preliminary comparative analysis between state-of-the-art Embedding models and our instruction-augmented Reranker (based on Qwen-4B). C.1 The “Negation T rap” in Embeddings A critical requirement for technical service ev aluation is the ability to distinguish between accurate advice and factually contradictory advice which shares high lexical overlap. As sho wn in T able 6, embedding models tend to o verestimate the similarity of logically negated sentences. For instance, in the “T opic-related, Contradictory” example, the embedding model assigns a high score of 0.7413 despite the two responses providing directly opposing recommendations re garding 11 PRIME AI paper T able 5: Hyperparameters for Reinf orcement Lear ning (D APO). Category Hyperparameter V alue Actor Model Learning Rate 5 × 10 − 6 W eight Decay 0.0001 LR Scheduler Constant Clip Ratio 0.2 Clip Ratio(high) 0.28 Entropy Coef f 3 . 5 × 10 − 4 Gradient Clipping 1.0 Rollout Samples per Prompt ( N ) 16 T emperature 1.0 Rew ard Rew ard T ype Hybrid Mixer (Reranker + Judge) Fast Interv al [ τ a , τ b ] [0 . 68 , 0 . 98] Mixing W eights ( w ) w 1 = 0 . 05 , w 2 = 0 . 72 T raining Global Batch Size 128 rebooting. The Reranker , in contrast, correctly assigns a lo w score of 0.1645, demonstrating its ability to capture semantic contradictions that are missed by vector -based similarity methods. C.2 Reranker Instruction Pr ompt The Reranker is not merely a classiﬁer b ut an instruction-tuned model. W e utilize a lightweight version of the consistenc y instruction to guide the Qwen-4B model. The prompt template used for the Reranker ( S R ) is provided belo w: Reranker System Pr ompt Judge whether Statement2 meets the requirements based on Statement1 and the Instruct provided. Note that the answer can only be “yes” or “no”. : Y ou are performing a critical **Content Consistency Assessment** task. T wo cus- tomer service agents (Statement1 and Statement2) ha ve independently pro vided response plans based on the same customer communication context. Y ou need to deeply analyze and determine the degree of consistency between these two responses. ## Consistency Assessment Dimensions Please conduct a point-by-point comparativ e analysis of the two responses based on the follo wing 4 core dimensions: - **Policy & Pr ocess (policy_and_process)** : Are the detailed explanations of product rules (such as billing rules, terms of service, refund policies, etc.) consistent? - **Operation Guidance (operation_guidance)** : Are the provided operational steps or ne xt-step guidance suggestions consistent? - **Information Collection (inf ormation_collection)** : When the customer is required to provide supplementary information (e.g., Instance ID, error logs), are the requirements consistent? - **Pr oblem Clariﬁcation (pr oblem_clariﬁcation)** : When the customer’ s problem is v ague, are the direction and focus of further clariﬁcation consistent? ## Judgment Criteria Standards - **Consistent (yes)** : There are no substantiv e differences across all 4 dimensions; only slight differences in expression e xist. - **Inconsistent (no)** : There are signiﬁcant differences in k ey dimensions. : : 12 PRIME AI paper T able 6: Comparison of Embedding Score and Reranker Score across Different Content Relationships Label Embedding Score Reranker Score Content 1 Content 2 T opic-related, Contradictory 0.7413 0.1645 T o resolve the performance issue on your EC2 instance, the ﬁrst and most effecti ve step is to perform a reboot. This clears memory and temporary ﬁles, often immediately restoring performance. When your EC2 instance is experiencing performance issues, you should not reboot it immediately . A reboot will destroy v olatile memory data crucial for root cause analysis. Instead, you must ﬁrst collect system logs, memory dumps, and performance metrics. Partially related, Partially contradictory 0.7107 0.0601 T o improv e your database query performance, you should analyze slow queries and add index es to the columns used in the ’WHERE’ clauses. Having more index es generally speeds up read operations. While index es can speed up read queries, be cautious. Adding too many inde xes can sev erely degrade write performance (INSER T , UPD A TE) because ev ery index needs to be updated. Y ou should only index critical columns and regularly re view inde x usage. Paraphrase 0.7382 0.5000 T o resolve the connecti vity issue, please perform a reboot of your virtual machine instance. Y ou can initiate this action via the cloud control panel by navigating to the ’Instances’ section, selecting the target instance, and then clicking the ’Reboot’ button from the actions menu. Hav e you tried turning it off and on again? Just go to your instance list in the console, ﬁnd the server that’ s acting up, and hit the ’Reboot’ button. This simple step often clears up temporary network glitches. Unrelated 0.2293 0.0000 Y our monthly in voice sho ws an increase in S3 storage costs. T o in vestigate, please navigate to the Cost Explorer and ﬁlter by the ’S3’ service to identify which buck et is accumulating the most data. T o bake the perfect sourdough bread, you need to maintain a healthy starter . Feed it daily with a 1:1:1 ratio of starter , water , and ﬂour . The ambient temperature of your kitchen will affect the fermentation time. 13 PRIME AI paper D Prompt T emplates D.1 Pr ompt for Consistency J udge Consistency Judge System Pr ompt # 1. 角色 # 1. Role 你是一位资深的客户服务内容一致性判别专家，拥有权威的技术知识、对客服务政策的精确把握以及丰富的客户沟通经验。 Y ou are a senior customer service content consistency judgment expert, possessing authoritativ e technical knowledge, precise grasp of customer service policies, and rich customer communication e xperience. 你的判断标准严谨、公正，旨在维护服务口径的统一性与专业性。 Y our judgment standards are rigorous and fair , aiming to maintain the consistency and professionalism of service standards. 你特别擅长识别 " 表述异构但逻辑同构 " 的复杂场景，能穿透表面差异洞察业务本质一致性。 Y ou are particularly good at identifying complex scenarios of "heterogeneous expression but homogeneous logic", able to penetrate superﬁcial differences to g ain insight into the essential consistency of the business. # 2. 核心目标 # 2. Core Goal 你正在执行一项关键的 ** 内容一致性判别 ** 任务。 Y ou are ex ecuting a critical **Content Consistency Judgment** task. 两位客服（客服 A 和客服 B ）已针对同一客户的同一沟通上文，独立给出了回复方案，你需要基于当前对话节点，深入分析并判别这两份回复之间的一致性程度。 T wo agents (Agent A and Agent B) ha ve independently pro vided response solutions to the same communication context of the same customer; you need to deeply analyze and judge the de gree of consistency between these two responses based on the current dialogue node. # 3. 一致性评估维度 # 3. Consistency Ev aluation Dimensions 请你基于以下 5 个核心维度，对两份回复进行逐项对比分析： Please conduct a comparativ e analysis of the two responses item by item based on the following 5 core dimensions: • 政策与流程 (policy_and_process) : 对计费规则、服务条款、退款政策等产品规则的细节解释是否逻辑一致？是否存在矛盾或冲突性表述？ Policy and Pr ocess (policy_and_process) : Is the logical explanation of product rules such as billing rules, service terms, and refund policies consistent? Are there contradictory or conﬂicting expressions? • 操作引导 (operation_guidance) : 提供的操作步骤或下一步引导建议是否指向相同业务结果？关键路径是否等价？ Operation Guidance (operation_guidance) : Do the provided operation steps or ne xt-step guidance suggestions point to the same business result? Are the critical paths equiv alent? • 信息收集 (information_collection) ：当需要客户补充信息时，要求是否一致？缺失信息是否影响问题解决？ Information Collection (inf ormation_collection) : When the customer needs to supplement informa- tion, are the requirements consistent? Does missing information affect problem resolution? • 问题澄清 (problem_clariﬁcation) ：当客户问题模糊时，进一步澄清的方向、核心问题定义及解决框架是否一致？ Problem Clariﬁcation (problem_clariﬁcation) : When the customer’ s problem is vague, are the direction of further clariﬁcation, the deﬁnition of the core problem, and the solution framework consistent? • 信息范围 (information_scope) ：是否覆盖相同的核心问题点？对多问题场景的响应完整性是否等价？ Inf ormation Scope (inf ormation_scope) : Does it co ver the same core problem points? Is the response integrity equi valent for multi-problem scenarios? 14 PRIME AI paper # 4. 判断等级标准 # 4. Judgment Lev el Standards • 一致： 5 个维度均无实质性差异。表述差异仅限： Consistent : No substantiv e differences in all 5 dimensions. Expression differences are limited to: – 同义词替换或句式重组 Synonym substitution or sentence restructuring – 信息详略不同但不影响业务结果 Differences in detail le vel do not af fect business results – 补充非核心数据 Supplementing non-core data – 差异化排版或锚点链接 Differentiated layout or anchor links • 部分一致： 1-2 个维度存在合理差异： Partially Consistent : Reasonable differences exist in 1-2 dimensions: – 有交集但不完全重合 Intersecting but not completely o verlapping – 角度不同但逻辑互补 Different angles b ut logically complementary – 详略差异不影响客户正确操作 Differences in detail do not af fect the customer’ s correct operation • 不一致：任一维度存在以下情况： Inconsistent : Any dimension presents the follo wing situations: – 实质性冲突 Substantiv e conﬂict – 关键澄清方向矛盾 Contradictory key clariﬁcation directions – 解决路径框架不同 Different solution path frame works – 核心信息点遗漏导致误解风险 Omission of core information points leading to risk of misunderstanding # 5. 输入数据 # 5. Input Data - 客户问题背景 : < history_message > - Customer Problem Backgr ound : < history_message > - 客服 A 回复 : < content_A > - Agent A Response : < content_A > - 客服 B 回复 : < content_B > - Agent B Response : < content_B > # 6. 输出格式 # 6. Output Format { "detailed_consistency": { "policy_and_process": " 一致 / 不一致 (Consistent/Inconsistent)", "operation_guidance": " 一致 / 不一致 (Consistent/Inconsistent)", "information_collection": " 一致 / 不一致 (Consistent/Inconsistent)", "problem_clarification": " 一致 / 不一致 (Consistent/Inconsistent)", "information_scope": " 一致 / 不一致 (Consistent/Inconsistent)" }, "judge_result": " 一致 / 部分一致 / 不一致 (Consistent/Partially/Inconsistent)" } 15 PRIME AI paper D.2 Pr ompt for Utility J udge Usability Judge System Pr ompt # 角色 # Role 你是一名严格技术客服质检专家，按阶段逻辑判定回复「可用性」。 Y ou are a strict technical support QA expert, determining the "Usability" of replies based on stage logic. 1. 先进行思考分析：逐步检查每个维度是否达标 1. Conduct thinking analysis ﬁrst : Step-by-step check if each dimension meets the standards. 2. 再输出结论：基于维度分析结果给出最终判定 2. Output conclusion then : Provide the ﬁnal judgment based on the dimensional analysis results. # 输入说明 # Input Description 你将收到历史信息（ recent_message) 、工单总结 (summary) 、可参考的人工的回复 (ref_reply) 、待判定的回复（ reply_to_be_ev aluated ）。 Y ou will receiv e history information (recent_message), ticket summary (summary), reference manual reply (ref_reply), and the reply to be ev aluated (reply_to_be_e valuated). 工单总结和可参考的人工回复是判定 “ 可用 / 不可用 ” 任务的标准答案，你需要跟标准答案做对照。 The ticket summary and reference manual reply are the standard answers for judging the "A v ailable/Unav ailable" task; you need to compare against these standard answers. # 动态阶段聚焦 # Dynamic Stage Focus - 诊断阶段（问题含 " 机制理解 / 原因诊断 " 时）核心维度： - Diagnostic Stage (When the issue inv olves "Mechanism Understanding/Cause Diagnosis") Core Dimensions: - 技术准确性（一票否决） - T echnical Accuracy (V eto) - 解决推动力（关键） - Resolution Driving F orce (Ke y) - 操作阶段（仅当客户直接请求 " 操作步骤 " 时）核心维度： - Operational Stage (Only when the customer directly requests "Operation Steps") Core Dimensions: - 技术准确性（一票否决） - T echnical Accuracy (V eto) - 执行闭环度（关键） - Execution Loop Completeness (K ey) # 硬核判定维度 # Hardcore Judgment Dimensions 1. 技术准确性（一票否决） 1. T echnical Accuracy (V eto) - 不可用条件： - Unav ailable Conditions: - 触发以下任一： - T riggering any of the follo wing: - 新增冗余索要：索要【工单总结】中已声明的内容或历史信息中「客服」角色已覆盖且问题已解决的信息 - New Redundant Request : Requesting content already declared in [Tick et Summary]or info already covered and resolved by the "Service Agent" in history . - 核心偏离：未响应工单「用户核心问题」关键矛盾，且未提供等效替代方案 - Core Deviation : Not responding to the key contradiction of the "User Core Issue", and not providing an equiv alent alternati ve. - 模糊新增：使用模糊关联词（ " 可能 / 或许 / 建议检查 " ）或讨论无关原理 - V ague New Inf o : Using vague association words ("Maybe/Perhaps/Suggest checking") or discussing irrele v ant principles. - 证据缺失：未提供工单覆盖的可验证证据 16 PRIME AI paper - Missing Evidence : Not providing veriﬁable e vidence covered by the tick et (see Entry Essentials below). - 验证失职：当工单总结明确要求特定参数验证时，回复未执行验证且未提供等效结论 - V alidation Dereliction : When the ticket summary explicitly requires speciﬁc parameter v alidation, the reply fails to perform v alidation and provides no equi valent conclusion. - 可用条件： - A vailable Conditions: - 与工单中已出现的技术锚点且附带新增量化参数中的一致 / 相同 / 语义一致 - Consistent/Identical/Semantically consistent with technical anchors appearing in the ticket and accompanied by new quantitative parameters . - 索要工单中未明确参数且标注诊断用途 - Requesting parameters not clariﬁed in the ticket and labeling the diagnostic purpose. - 索要信息用于内部流程合规性，需明确标注用途 - Requesting info for internal pr ocess compliance , must explicitly label the purpose. 2. 解决推动力（诊断阶段专用） 2. Resolution Driving F orce (Diagnostic Stage Only) - 可用条件： - A vailable Conditions: - 提供工单未覆盖的强关联新增信息，且满足以下任一： - Provide str ongly related new inf ormation not covered by the tick et, and satisfy any of the following: a) 含具体技术参数 a) Contains speciﬁc technical parameters b) 完整验证指令 b) Complete validation command c) 官方文档链接 c) Ofﬁcial document link d) 可行性替代方案（含参数对比或行业通用方案 " ） d) Feasible alter native solution (Contains parameter comparison or industry standard solution) e) 精准复述工单技术锚点（须含原文技术关键词）且必须附带以下任一新增证据：具体参数、验证指令、官方文档链接（含版本号）或替代方案参数对比 e) Precise r estatement of ticket technical anchors (Must contain original technical keywords) AND must include any of the f ollowing new evidence : Speciﬁc parameters, validation commands, of ﬁcial doc links, or alternativ e solution comparison. - 新增信息必须直接锚定工单未闭环关键矛盾 - New information must be dir ectly anchored to the tick et’ s unclosed key contradiction. - 不可用条件： - Unav ailable Conditions: - 新增信息未锚定核心矛盾 - New information is not anchored to the core contradiction. - 仅复读已知问题范围 - Only repeating the known scope of the problem. - 精准复述未附带新增证据（参数 / 指令 / 链接 / 替代方案） - Precise restatement without accompanying ne w evidence (P arameters/Commands/Links/Alternativ es). 3. 执行闭环度（操作阶段专用） 3. Execution Loop Completeness (Operational Stage Only) - 可用：给出含具体参数值的操作指令或层级操作路径 - A vailable : Provide operation instructions with speciﬁc parameter v alues or hierarchical operation paths . - 不可用：模糊建议或无官方文档佐证 - Unav ailable : V ague suggestions or lacking ofﬁcial document corroboration. # 强制审计步骤 # Mandatory Audit Steps 1. 原文锚定：摘录待判定回复原文，与工单总结逐项比对： 1. T ext Anchoring: Extract the original text of the reply to be ev aluated and compare item by item with the ticket summary: - 技术锚点证据：标注用于解释当前操作必要性的错误码 / 状态码及关联的新增量化参数 17 PRIME AI paper - Technical Anchor Evidence : Mark error codes/status codes used to explain the necessity of current operation and associated new quantitative parameters . - 新增证据：标注工单未提及的参数 / 指令 / 链接 / 替代方案 - New Evidence : Mark parameters/commands/links /alternativ es not mentioned in the ticket. - 必要澄清：若索要工单中未明确参数且标注用途 - Necessary Clarification : If requesting parameters not clariﬁed in the ticket and labeling the purpose. - 精准复述：标注与工单总结中解决路径完全一致的描述片段（必须包含技术关键词），并强制标注附带的新增证据类型（参数 / 指令 / 链接 / 替代方案） - Precise Restatement : Mark description fragments e xactly consistent with the solution path in the ticket summary (must contain technical keyw ords), and mandatorily label the accompanying ne w evidence type. 2. 证据强度验证： 2. Evidence Strength V eriﬁcation: - 官方链接需包含问题关键词及版本标识 - Ofﬁcial links must contain issue k eywords and v ersion identiﬁers . - 替代方案需含参数对比或为行业通用方案 - Alternativ e solutions must contain parameter comparison or be industry standard solutions (e.g., CDN warmup/W AF rollback). 3. 存在以下任一 → 直接判不可用： 3. If any of the follo wing exist → Directly judge as Unav ailable: - 新增冗余索要（索要信息在【工单总结】中已声明的内容） - New Redundant Request (Requesting info declared in [T icket Summary]). - 核心偏离且无等效替代方案 - Core Deviation without equi valent alternati ve. - 模糊新增（含 " 可能 / 建议 " 等词且未附具体证据） - Vague New Info (Contains words like "maybe/suggest" without speciﬁc e vidence). - 新增证据为空且未触发精准复述 - New Evidence is empty and does not trigger Precise Restatement clause. - 精准复述未附带新增证据（参数 / 指令 / 链接 / 替代方案） - Precise Restatement without accompanying ne w evidence (P arameters/Commands/Links/Alternativ es). - 验证失职（工单总结要求参数验证但未执行） - Validation Dereliction (T icket summary requires parameter v alidation but not ex ecuted). # 总判定规则 # T otal Judgment Rule - 可用：技术准确性通过 + （存在达标新增证据或触发精准复述条款） + 当前阶段核心维度通过 - A vailable : T echnical Accuracy Passed + (Existence of qualiﬁed New Evidence or triggering Precise Restatement clause) + Current Stage Core Dimensions Passed. - 不可用：触发任一「不可用」条件 - Unav ailable : Triggering an y "Unav ailable" condition. # 输出格式（严格 JSON ） # Output Format (Strict JSON) { "thought_process": { "step1_std_extraction": " 从参考回复 / 工单总结中提取的核心操作是： [ 必须标注原文位置 ] The core operation extracted from reference reply/ticket summary is: [Must mark original location]", "step2_reply_analysis": " 待判定回复中，客服实际提供的操作是： [ 必须摘录原文 ] In the reply to be evaluated, the operation actually provided by the agent is: [Must extract original text]", "step3_consistency_audit": " 比对结论： [ 标注比对结果类型：精准复述（含技术关键词） / 新增证据 / 等效替代方案 / 必要澄清 ] 18 PRIME AI paper Comparison Conclusion: [Mark comparison result type: Precise Restatement (with tech keywords)/New Evidence/Equivalent Alternative/Necessary Clarification]", "step4_veto_check": " 检查是否命中一票否决： [ 明确列出触发的否决项或 ’ 无 ’] / Check if Veto is triggered: [Explicitly list triggered veto items or ’None’]" }, " 阶段判定 ": " 诊断阶段 / 操作阶段 (Diagnostic Stage/Operational Stage)", " 核心依据 ": { " 技术准确性 ": " 事实锚点 (Fact Anchor)", " 解决推动力 / 执行闭环度 ": " 证据类型 (Evidence Type)" }, "judge_result": " 可用 / 不可用 (Available/Unavailable)" } D.3 Pr ompts for Latent Logic A ugmentation Pipeline D.3.1 Pr ompt for Planning (P A TM) Prompt f or Planning # 角色定义 # Role Deﬁnition 你是一名资深的技术支持专家。你的任务是基于当前的【工单信息】（ ticket_info ）、【历史对话】（ dialogue_history ）、【参考文档】 (ref_info) 以及可调用的【工具集】（ tool_schemas ）， Y ou are a senior technical support e xpert. Y our task is based on the current [T icket Info] (tick et_info), [Dialogue History] (dialogue_history), [Reference Documents] (ref_info), and callable [T ool Set] (tool_schemas), 针对【历史对话】中的「当前客户问题」构建专业的【客服行动规划】（ plans ）。 to construct professional [Customer Service Action Plans] (plans) for the "Current Customer Question" within the [Dialogue History]. # 输入输出说明 # Input and Output Description 我将提供给你输入：【工单信息】、【历史对话】、【工具集】，你需要生成输出：【客服行动规划】。 I will provide you with inputs: [T icket Info], [Dialogue History], [T ool Set]; you need to generate output: [Customer Service Action Plans]. 输入 (Input): - 【工单信息】（ ticket_info ）：包含工单的各项信息。 - [T icket Info] (ticket_info): Contains various information about the ticket. - 【历史对话】（ dialogue_history ）：包含【客户】与【客服】以及【工具】的对话和交互历史记录。【历史对话】中最后一轮的【客户】输入定义为「当前客户问题」。 - [Dialogue History] (dialogue_history): Contains the dialogue and interaction history between [Customer], [Service Agent], and [T ools]. The last round of [Customer] input in [Dialogue History] is deﬁned as the "Current Customer Question". - 【工具集】（ tool_schemas ）：包含一些【客服】解决「当前客户问题」时可能使用的「工具」的「工具描述」（ tool_schema ）， - [T ool Set] (tool_schemas): Contains "T ool Descriptions" (tool_schema) of "T ools" that the [Service Agent] might use to solve the "Current Customer Question", 每个「工具描述」包含该工具的名称（ name ）、功能描述（ description ）以及调用时所需传入的参数（ parameters ）等信息。 each "T ool Description" includes the tool’ s name (name), functional description (description), and parameters required for in vocation (parameters). 19 PRIME AI paper - 【参考文档】 (ref_info): 包含一些与工单相关的参考资料和文档内容，供你在制定【客服行动规划】时参考使用。 - [Reference Documents] (ref_info): Contains reference materials and document content related to the ticket for your use when formulating the [Customer Service Action Plans]. 输出 (Output): - 【客服行动规划】（ plans ）：为解决「当前客户问题」，【客服】进行分析所做出的两轮行动规划， - [Customer Service Action Plans] (plans): T wo rounds of action planning made by the [Service Agent] to solve the "Current Customer Question", 分别为规划作为【客服】的「我」如何思考【客户】的问题和对此可能采取何种行动（例如，与【客户】对话或调用某个【工具】）， respectiv ely planning ho w "I" as the [Service Agent] think about the [Customer]’ s problem and what actions might be taken (e.g., talking to the [Customer] or calling a [T ool]), 以及推演采取该行动后可能获得的状态反馈和对应的分析与应对行为。 as well as deducing the potential status feedback after taking the action and the corresponding analysis and response behavior . # 核心逻辑 # Core Logic 在生成【客服行动规划】时，你必须将分析过程分为两个阶段，并严格封装在 < plans > 标签中： When generating [Customer Service Action Plans], you must di vide the analysis process into two stages and strictly encapsulate them within < plans > tags: 1. < plan_1 > ( 即时意图识别与决策行动规划 ): 1. < plan_1 > (Immediate Intent Recognition and Decision Action Planning): - 视角：以「我」作为【客服】的第一人称视角。 - Perspecti ve : First-person perspectiv e using "I" as the [Service Agent]. - 内容：识别【客户】的「当前客户问题」中所表达的疑问或诉求；结合【历史对话】和【工单信息】判断信息是否完整； - Content : Identify the doubts or demands expressed in the [Customer]’ s "Current Customer Question"; combine [Dialogue History] and [T icket Info] to judge if information is complete; 结合【参考信息】明确说明为了解决此问题，「我」决定执行什么动作（如向【客户】进一步确认某信息，或调用某【工具】，并说明原因）。 combine with [Reference Info] to explicitly state what action "I" decide to ex ecute to solve this problem (such as conﬁrming speciﬁc information with the [Customer], or calling a speciﬁc [T ool], and explaining the reason). - 注意：如果涉及工具调用，必须准确指出工具名称。 - Note : If tool in vocation is in volv ed, the tool name must be accurately speciﬁed. 2. < plan_2 > ( 推演式状态预测 ): 2. < plan_2 > (Deductive State Pr ediction): - 视角：以「我」作为【客服】的第一人称视角。 - Perspecti ve : First-person perspectiv e using "I" as the [Service Agent]. - 内容：分析推演「我」在采取了 < plan_1 > 中规划的行动后，可能获得对应的何种反馈 - Content : Analyze and deduce what corresponding feedback "I" might receiv e after taking the action planned in < plan_1 > - 注意：输出的句式必须严格使用 “ 假设 / 可能 . . . . . . （描述动作后的结果），这个说明 . . . . . . （进行逻辑判断），因此，我可以 . . . . . . （制定下一步后续计划）。 ” - Note : The output sentence structure must strictly use "Assuming/Possibly ... (describe result after action), this indicates... (logical judgment), theref ore, I can... (formulate next follo w-up plan)." # 行为准则 # Code of Conduct 在生成【客服行动规划】时，你必须严格遵守以下行为准则： When generating [Customer Service Action Plans], you must strictly adhere to the following Code of Conduct : - 去敏感化：规划内容中严禁出现真实的手机号、账号 ID 、密钥、签名名称等私密数据，若涉及相关信息的指代，统一使用 “ 手机号 ” 、 “ 用户账号 ” 、 “ 客户签名 ” 等代称。 20 PRIME AI paper - De-identiﬁcation : Strictly prohibit real mobile numbers, account IDs, keys, signature names, and other priv ate data in the plan content; if referring to such information, use aliases like "mobile number", "user account", "customer signature" uniformly . - 客观性：使用专业、客观的规划口吻，避免 “ 我觉得 ” 、 “ 我想 ” 等主观词汇。 - Objectivity : Use a professional, objective planning tone; a void subjecti ve v ocabulary like "I feel", "I think". - 工具依赖：若需要调用工具，则必须基于提供的【工具集】（ T ool Schemas ）进行选择，禁止虚构工具。 - T ool Dependency : If tool in vocation is needed, selection must be based on the provided [T ool Set] (T ool Schemas); fabricating tools is prohibited. - 禁止预测动作内容：你仅输出规划（ plans ），而不是真要采取行动，所以禁止输出具体的 JSON 动作指令（ Actions ）。 - Prohibition of Pr edicting Action Content : Y ou only output plans, not actually taking action, so outputting speciﬁc JSON action instructions (Actions) is prohibited. # 输出格式 # Output Format [ 即时意图识别与决策行动规划 ] [Immediate Intent Recognition and Decision Action Planning] [ 推演式状态预测，输出格式必须是：假设 / 可能 ...... （出现 xx 结果或信息），这个说明 ...... （分析），我可以 ...... （下一步决策）。 ] [Deductive State Prediction, output format must be: Assuming/Possibly... (xx result or info appears), this indicates... (analysis), I can... (next step decision).] D.3.2 Pr ompt for Rewriting (P A TM Data Construction) Prompt f or Rewriting # 角色定义 # Role Deﬁnition 你是一名资深的改写【多轮对话】为【行动规划】的改写专家。 Y ou are a senior rewrite e xpert specializing in rewriting [Multi-turn Dialogues] into [Action Plans]. 你的任务是基于输入的 < 工单信息 > （ ticket_info_content ）、 < 上文信息 > （ truncation_context_dialogue ）和 < 三轮真实对话 > （ ground_truth_dialogue ）信息， Y our task is based on the input < T icket Info > (ticket_info_content), < Context Info > (trunca- tion_context_dialogue), and < Three Ground T ruth Dialogues > (ground_truth_dialogue), 将 < 三轮真实对话 > 的内容改写为由规划分析过程 < plans > 与结构化的、可验证的执行动作 < actions > 两部分组成的【行动规划】。 to re write the content of the < Three Ground T ruth Dialogues > into an [Action Plan] consisting of tw o parts: Planning Analysis Process < plans > and Structured, V eriﬁable Executive Actions < actions > . # 输入说明 # Input Description 1. < 工单信息 > : 产品名 1. < T icket Info > : Product Name 2. < 上文信息 > : 包含「客户」与「客服」的对话交互历史，可能包含「客服」调用「工具」的行为信息及对应的「工具」信息。 2. < Context Info > : Contains the dialogue interaction history between "Customer" and "Service Agent", which 21 PRIME AI paper may include the "Service Agent’ s" tool in v ocation behavior and corresponding "T ool" information. 注意：该部分内容仅作参考，若与 < 三轮真实对话 > 内容冲突，以 < 三轮真实对话 > 为准。 Note: This part is for reference only; if it conﬂicts with the < Three Ground T ruth Dialogues > , the < Three Ground T ruth Dialogues > shall prev ail. 3. < 三轮真实对话 > : 包含「客户」与「客服」交互的三轮真实对话，其中某一条对话的形式可能为「客服」调用「工具」的行为信息及对应的「工具」信息。 3. < Three Ground Truth Dialogues > : Contains three real dialogues exchanged between "Customer" and "Service Agent", where one dialogue format may be the "Service Agent’ s" tool inv ocation behavior and the corresponding "T ool" information. 注意：这是需要改写的核心内容。你必须严格遵守这三轮真实对话的前后顺序，不可跳跃；且必须严格忠于它们的逻辑内容，不可擅自删除、不可编造。 Note: This is the core content to be rewritten. Y ou must strictly adhere to the chr onological order of these three r eal dialogues without skipping; and y ou must be strictly faithful to their logical content, without unauthorized deletion or fabrication. 这部分对话发生在 < 上文信息 > 中对话交互的最后时间。 These dialogues occur at the latest timestamp of the interactions within the < Context Info > . # 核心理念：规划分析过程 < plans > 与结构化的、可验证的执行动作 < actions > # Core Concept: Planning Analysis Process < plans > and Structured, V eriﬁable Executi ve Actions < actions > 你需要采用一种特殊的输出模式，将你的 “ 规划分析过程 < plans > ” 与 “ 结构化的、可验证的执行动作 < actions > ” 彻底分开。 Y ou need to adopt a special output mode to thoroughly separate your "Planning Analysis Process < plans > " from the "Structured, V eriﬁable Executi ve Actions < actions > ". - < plans > 块 ( 规划分析 ): 这是分析与决策的区域。你将在这里以「我」作为【客服】角色的第一人称视角，使用专业、客观的规划口吻进行描述。这部分内容用于展现你的专业判断和决策流程。 - < plans > Block (Planning Analysis): This is the area for analysis and decision-making. Y ou will describe here using the ﬁrst-person perspecti ve of "I" as the [Service Agent] role, employing a professional, objecti ve planning tone. This part is used to demonstrate your professional judgment and decision-making process. - < actions > 块 ( 执行动作 ): 这是可执行动作的区域。你将在这里输出严格格式化的、可被机器直接解析和验证的工具调用指令。这部分是评估的重点，必须精确无误。 - < actions > Block (Executive Actions): This is the area for ex ecutable actions. Y ou will output strictly formatted tool in vocation instructions here that can be directly parsed and veriﬁed by machines. This part is the focus of ev aluation and must be precise and error -free. 你的输出必须严格遵循这两个部分的结构。 Y our output must strictly follow the structure of these tw o parts. # < 核心规则 > # < Core Rules > 1. 整体视角与口吻： 1. Overall Perspecti ve and T one: - 采用 “ 我 ” 作为「客服」的第一人称视角，但使用专业、客观的规划口吻来描述决策和计划执行的动作（例如 “ 我决定调用 ... ” 或 “ 我计划向客户说明 ... ” ）。 - Adopt the ﬁrst-person perspectiv e of "I" as the "Service Agent", but use a professional, objective planning tone to describe decisions and planned e xecution actions(e.g., "I decided to call..." or "I plan to e xplain to the customer ..."). 避免使用主观猜测或过于个人化的表达（如 “ 我认为 ” 、 “ 我感觉 ” ）。 A void using subjecti ve guesses or ov erly personal expressions (such as "I think", "I feel"). - 在描述客观情况或分析时，尽量使用客观陈述（例如 “ 识别到 ... ” 或 “ 情况表明 ... ” ）。 - When describing objecti ve situations or analyses, try to use objecti ve statements (e.g., "Identiﬁed that..." or "The situation indicates..."). 2. < plans > 编写原则 ( 规划分析过程 ): 2. < plans > Writing Principles (Planning Analysis Process): 此部分包含两个子模块： < plan_1 > 和 < plan_2 > 。这两个子模块的内容规定如下： This section contains two sub-modules: < plan_1 > and < plan_2 > . The content regulations for these two sub-modules are as follows: - < plan_1 > : 22 PRIME AI paper - 仅能基于 < 三轮真实对话 > 中的第一条对话内容进行分析与规划形式的改写。 - Can only rewrite the analysis and planning form based on the content of the ﬁrst dialogue in the < Three Ground T ruth Dialogues > . - 必须包含三个维度的信息： 1) 对客户问题的识别分析； 2) 上下文信息的关联分析； 3) 「我」基于分析决定采取的决策依据与细节。 - Must include information in three dimensions: 1) Identiﬁcation and analysis of customer issues ; 2) Correlation analysis of context information ; 3) Basis and details of the decision "I" decided to take based on the analysis . 注意：这些维度的信息都必须忠实地基于 < 三轮真实对话 > 中的第一条对话内容。 Note: The information in these dimensions must be faithfully based on the ﬁrst dialogue content in the < Three Ground T ruth Dialogues > . - < plan_2 > : - 综合 < 三轮真实对话 > 中的第二条和第三轮对话内容进行推演式规划。 - Synthesize the content of the second and third dialogues in the < Three Ground T ruth Dialogues > for deductiv e planning. - 必须包含两个维度的信息： 1 ）对 < plan_1 > 中采取的决策所导致的结果的逻辑推演分析或状态判断； 2 ）由此结果「我」所采取的进一步应对决策的依据与细节。 - Must include information in two dimensions: 1) Logical deduction analysis or status judgment of the results caused by the decision taken in < plan_1 > ; 2) Basis and details of the further r esponse decision "I" take based on this r esult . 注意：这些维度的信息都必须忠实地基于 < 三轮真实对话 > 中的第二条和第三轮对话内容。 Note: The information in these dimensions must be faithfully based on the content of the second and third dialogues in the < Three Ground T ruth Dialogues > . 而且必须严格套用推演式句式：例如， “ 假设 / 可能 . . . . . . （描述第二 / 三轮对话出现的结果或信息），这个说明 . . . . . . （进行逻辑分析 / 状态判断），因此，我可以 . . . . . . （制定下一步决策）。 ” And strictly apply the deductiv e sentence pattern: for example, "Assuming/P ossibly ... (describe the result or information appearing in the 2nd/3rd dialogue), this indicates... (conduct logical analysis/status judgment), theref ore, I can... (formulate the next step decision)." * 这两个模块的内容都必须为文本格式，并且都必须遵守以下原则 * ： *The content of both modules must be in text f ormat and must strictly adher e to the f ollowing principles*: - 用 " 我 " 代替 " 客服 " 。 - Use "I" instead of "Service Agent". - 若在规划内容中调用了工具，必须要说明工具名称 - If a tool is called in the planning content, the tool name must be speciﬁed - 不要出现客户的真实的数据，比如客户的签名，客户的手机等 - Do not re veal the customer’ s real data, such as the customer’ s signature, customer’ s mobile phone, etc. 3. < actions > 编写原则 ( 结构化的、可验证的执行动作 ): 3. < actions > Writing Principles (Structured, V eriﬁable Executive Actions): 此块用于存放从 < 三轮真实对话 > 中识别出的所有工具调用。 This block is used to store all tool calls identiﬁed from the < Three Ground T ruth Dialogues > . - 如果 < 三轮真实对话 > 中任何一条包含了工具调用（以 JSON 格式出现），则必须将其转换为下方指定的 action: call_tool 格式，并放入 < actions > 块内。 - If any one of the < Three Ground T ruth Dialogues > contains a tool call (usually appearing in JSON format), it must be conv erted to the action: call_tool format speciﬁed below and placed inside the < actions > block. - 如果 < 三轮真实对话 > 中没有任何工具调用，则 < actions > 仅包含一个空的数组 [] 。 - If there are no tool calls in the < Three Ground T ruth Dialogues > , then < actions > contains only an empty array [] . - 严禁在此块内添加任何自然语言描述。 - Strictly prohibit adding an y natural language descriptions within this block. 4. 脱敏规则 : 4. Desensitization Rules: - < plans > 内部严禁出现任何真实敏感数据（如手机号、签名名称、 UID 等），若需要提及，则统一使用 “ 手机号 ” 、 “ 客户签名 ” 、 “ 用户的账号 ID” 等代称。 23 PRIME AI paper - Strictly prohibit the appearance of any real sensiti ve data (such as mobile numbers, signature names, UIDs, etc.) inside < plans > ; if mention is needed, unify the use of aliases like "mobile number", "customer signature", "user account ID", etc. - < actions > 内部保留真实数据，确保工具调用的参数完整准确，可供机器解析。 - < actions > internally retains real data to ensure tool call parameters are complete and accurate for machine parsing. # < 输出格式规范 > # < Output Format Speciﬁcation > 请严格按照以下 XML 格式输出，不要包含任何其他格式或描述性文字。 Please strictly follow the XML format belo w , do not contain any other format or descripti ve text. [ 此处填写第一段规划的纯文本格式内容 ] [Fill in the plain-text content of the first planning section here] [ 此处填写第二段规划的纯文本格式内容 , 句式必须为：假设 / 可能 ...... （出现 xx 结果或信息），这个说明 ...... （分析），我可以 ...... （下一步决策）。 ] [Fill in the plain-text content of the second planning section here. The sentence pattern must be: Assume/It is possible that ...(xx result or inform- ation appears), this indicates ... (analysis), I can ... (next-step decision).] [ 此处填写从对话中提取并转换格式后的工具调用 JSON ，若无则为空数组 ‘[]‘] [Fill in the tool-call JSON extracted from the conversation and converted into the required format here; if none, use an empty array ‘[]‘] D.3.3 Pr ompt for Rewriting Quality Check Prompt f or Rewriting Quality Check # 角色定义 # Role Deﬁnition 你是一名资深的多轮对话分析专家与内容合规审计员。 Y ou are a senior multi-turn dialogue analysis expert and content compliance auditor . 我会提供给你 < 工单信息 > （ ticket_info ）、 < 上文信息 > (context) 、 < 三轮真实对话 > （ ground_truth ）以及模型根据 < 三轮真实对话 > 改写得到的规划形式的输出 < 待评测的改写 > （ model_output ）。 I will provide you with < T icket Info > (ticket_info), < Context Info > (context), < Three Ground T ruth Dialogues > (ground_truth), and the planning-format output re written by the model based on the < Three Ground T ruth Dialogues > , labeled as < Rewritten Output for Ev al > (model_output). 你的任务是评估模型生成的改写 < 待评测的改写 > （ model_output ）是否契合业务内容逻辑，是否严格遵守了特定的规划格式、推演句式及脱敏规范。 Y our task is to e v aluate whether the model-generated re writing < Rewritten Output f or Ev al > (model_output) ﬁts the business content logic and strictly adheres to speciﬁc planning formats, deducti ve sentence structures, and anonymization speciﬁcations. # < 输入说明 > # < Input Description > 1. < 工单信息 > : 产品名 1. < T icket Info > : Product Name 2. < 上文信息 > : 包含「客户」与「客服」的对话交互历史，可能包含「客服」调用「工具」的行为信息及对应的「工具」信息。 24 PRIME AI paper 2. < Context Info > : Contains the dialogue interaction history between "Customer" and "Service Agent", which may include the "Service Agent’ s" tool in v ocation behavior and corresponding "T ool" information. 注意：该部分内容仅作参考，若与 < 三轮真实对话 > 内容冲突，以 < 三轮真实对话 > 为准。 Note: This part is for reference only; if it conﬂicts with the < Three Ground T ruth Dialogues > , the < Three Ground T ruth Dialogues > shall prev ail. 3. < 三轮真实对话 > : 包含「客户」与「客服」交互的三轮真实对话，其中某一条对话的形式可能为「客服」调用「工具」的行为信息及对应的「工具」信息。 3. < Three Ground Truth Dialogues > : Contains three real dialogues exchanged between "Customer" and "Service Agent", where one dialogue format may be the "Service Agent’ s" tool inv ocation behavior and the corresponding "T ool" information. 4. < 待评测的改写 > ：根据 < 三轮真实对话 > 改写得到的规划形式的内容，包含 < plans > （分段规划）与 < actions > （结构化动作）两部分， 4. < Rewritten Output for Eval > : Content in planning format re written based on < Three Ground Truth Dialogues > , containing two parts: < plans > (segmented planning) and < actions > (structured actions), 其中 < plans > 包含 < plan_1 > 和 < plan_2 > 两个子部分； < actions > 包含从对话中提取并转换格式后的工具调用 JSON ，若无则为空数组 [] 。 where < plans > contains two sub-parts < plan_1 > and < plan_2 > ; < actions > contains tool in vocation JSON extracted and reformatted from the dialogue, or an empty array [] if none exists. # < 核心审计维度与标准 > # < Core Audit Dimensions and Standards > 1. 逻辑与路径一致性 (logic_consistency) 1. Logic and Path Consistency (logic_consistency) - 分段隔离性：在 < 待评测的改写 > 的 < plans > 部分中， < plan_1 > 和 < plan_2 > 两个子部分的内容逻辑必须符合以下标准： - Segment Isolation : In the < plans > section of the < Rewritten Output for Ev al > , the content logic of the two sub-parts < plan_1 > and < plan_2 > must meet the following standards: - < plan_1 > ：必须仅基于 < 三轮真实对话 > 中的第一条对话，结合 < 上文信息 > 进行分析。严禁预知或引用后文信息（ < 三轮真实对话 > 中的第二、三轮对话）。 - < plan_1 > : Must be analyzed solely based on the ﬁrst dialogue in < Three Ground Truth Dialogues > combined with < Context Info > . Strictly prohibited from foreseeing or citing subsequent information (the second and third dialogues in < Three Ground T ruth Dialogues > ). - < plan_2 > ：必须综合 < 三轮真实对话 > 中的第二、三轮对话内容进行逻辑推演。 - < plan_2 > : Must synthesize the content of the second and third dialogues in < Three Ground Truth Dialogues > for logical deduction. - 决策准确性：改写后的规划 < plans > 必须真实反映客服的意图，严禁编造对话中不存在的工具、参数或业务结论。 - Decision Accuracy : The rewritten plan < plans > must truly reﬂect the intention of the service agent; fabricating tools, parameters, or business conclusions not present in the dialogue is strictly prohibited. - 工具关联：在 < plans > 中提到工具调用时，必须指明具体的工具名称（如： “ 调用 ‘XX 工具 ’” 而非 “ 调用工具 ” ） , 禁止完全不指明具体的工具名称而仅仅笼统地提及 “ 调用工具 ” 。 - T ool Association : When a tool call is mentioned in < plans > , the speciﬁc tool name must be speciﬁed (e.g., "Call ’XX T ool’" instead of "Call T ool"); purely vague references to "Call T ool" without specifying the name are prohibited. 以上三轮标准若全部符合，则此维度（ logic_consistency ）得 1 分；若有任意标准不符合，则得 0 分，并说明不符合的地方。 If all three standards abov e are met, this dimension (logic_consistency) gets 1 point; if any standard is not met, it gets 0 points, and the non-compliance must be explained. 2. 句式与口吻规范 (phrasing_check) 2. Phrasing and T one Standards (phrasing_check) - 视角要求：必须采用「我」作为「客服」角色的第一人称视角，严禁在输出中直接提及 “ 根据第一条对话 ” 、 “ 第 x 条内容 ” 或 “ 根据 ground truth” 等描述。 - Perspective Requirement : Must adopt the ﬁrst-person perspective of "I" as the "Service Agent"; strictly prohibited from directly mentioning descriptions like "according to the ﬁrst dialogue", "content of article x", or "according to ground truth" in the output. 25 PRIME AI paper - 规划的客观性：口吻需专业、客观，禁止主观臆断（严禁使用 “ 我觉得 ” 、 “ 我感觉 ” ）。 - Objectivity of Planning : The tone must be professional and objectiv e; subjective assumptions are prohibited (strictly forbid using "I think", "I feel"). - 推演句式强制性： < plan_2 > 的内容逻辑必须严格符合先假设，再分析，再决策的推演格式，该格式必须按照如下格式： “ 假设 / 可能 . . . . . . （描述信息），这说明 . . . . . . （逻辑分析），因此，我可以 . . . . . . （决策）。 ” - Mandatory Deductive Sentence Structure : The content logic of < plan_2 > must strictly adhere to the deductiv e format of First Assume, Then Analyze, Then Decide , following the format: "Assuming/Possibly ... (describe info), this indicates... (logical analysis), therefore, I can... (decision)." 以上三轮标准若全部符合，则此维度（ phrasing_check ）得 1 分；若有任意标准不符合，则得 0 分，并说明不符合的地方。 If all three standards above are met, this dimension (phrasing_check) gets 1 point; if any standard is not met, it gets 0 points, and the non-compliance must be explained. 3. 规划脱敏 (plans_anonymized) 3. Planning Anonymization (plans_anonymized) < plans > 块必须脱敏，中严禁出现真实手机号、 UID 、具体域名、客户签名、详细地址等隐私数据。若被提及相关信息，必须使用 “ 客户手机号 ” 、 “ 用户账号 ID” 、 “ 客户签名 ” 等泛用代称指代。 The < plans > block must be anonymized; real mobile numbers, UIDs, speciﬁc domain names, customer signatures, detailed addresses, and other priv acy data are strictly prohibited. If rele vant information is mentioned, generic aliases such as "customer mobile number", "user account ID", "customer signature" must be used. 该项标准若符合，则此维度（ plans_anonymized ）得 1 分；若不符合，则得 0 分，并说明不符合的地方。 If this standard is met, this dimension (plans_anonymized) gets 1 point; if not met, it gets 0 points, and the non-compliance must be explained. 4. 动作保留 (actions_preserv ed) 4. Action Preserv ation (actions_preserved) < actions > 块严禁脱敏，必须保留原始 JSON 中的所有真实数据。不可出现仅用 “ 客户手机号 ” 、 “ 用户账号 ID” 、 “ 客户签名 ” 等代称，必须使用具体的真实信息。 The < actions > block is strictly prohibited from anon ymization and must retain all real data from the original JSON. Do not use aliases like "customer mobile number", "user account ID", or "customer signature"; speciﬁc real information must be used. 该项标准若符合，则此维度（ actions_preserved ）得 1 分；若不符合，则得 0 分，并说明不符合的地方。 If this standard is met, this dimension (actions_preserved) gets 1 point; if not met, it gets 0 points, and the non-compliance must be explained. 5. 动作执行准确性 (Action Accuracy) 5. Action Execution Accuracy (Action Accuracy) - 提取完整性：真实对话中出现的所有工具调用必须全部转换并放入 < actions > 。若无工具，则必须输出 [] 。 - Extraction Completeness : All tool calls appearing in the real dialogue must be con verted and placed in < actions > . If there are no tools, [] must be output. - 数据一致性： < actions > 中的参数（如 uid 等）必须与原始输入中的 JSON 数据完全一致，不可有任何区别。 - Data Consistency : Parameters in < actions > (such as ID, etc.) must be exactly consistent with the JSON data in the original input, without any dif ference. 以上两条标准若全部符合，则此维度（ actions_accuracy ）得 1 分；若有任意标准不符合，则得 0 分，并说明不符合的地方。 If both standards abo ve are met, this dimension (actions_accurac y) gets 1 point; if an y standard is not met, it gets 0 points, and the non-compliance must be explained. # < 输出格式要求 > # < Output Format Requirements > 请直接输出 JSON 格式的评估报告，不要包含任何 Markdown 代码块标记或前导文字。格式如下： Please output the e valuation report directly in JSON format, do not include an y Markdown code block tags or 26 PRIME AI paper leading text. The format is as follows: { "logic_consistency": { "score": {{score_of_logic_consistency}}, "details": "{If non-compliant, state the place and reason here}" }, "phrasing_check": { "score": {{score_of_phrasing_check}}, "details": "{If non-compliant, state the place and reason here}" }, "plans_anonymized": { "score": {{score_of_plans_anonymized}}, "details": "{If non-compliant, state the place and reason here}" }, "actions_preserved": { "score": {{score_of_actions_preserved}}, "details": "{If non-compliant, state the place and reason here}" }, "actions_accuracy": { "score": {{score_of_actions_accuracy}}, "details": "{If non-compliant, state the place and reason here}" }, "violation_list": "{List all violations, or leave empty if none}" } D.3.4 Pr ompt for Adding Backward CoT (DRA) Prompt f or Adding Backward CoT # 角色 # Role 你是一名资深的客服逻辑建模专家，擅长将【历史对话】（ dialogue_history ）、【客服行动规划】（ plans ）和【其他信息】（ other_information ）整合为深度的、结构化的思维链条。 Y ou are a senior customer service logic modeling expert, skilled at integrating [Dialogue History] (dia- logue_history), [Customer Service Action Plans] (plans), and [Other Information] (other_information) into a deep, structured chain of thought. # 任务 # T ask 你的任务是补全【推理过程内容】（ reasoning_content ）。 Y our task is to complete the [Reasoning Content] (reasoning_content). 你需要结合【历史对话】、【其他信息】，深入剖析【客服行动规划】背后的决策逻辑，以第一人称（我）还原客服在执行这些动作时的完整思考路径。 Y ou need to combine [Dialogue History] and [Other Information] to deeply analyze the decision logic behind the [Customer Service Action Plans], reconstructing the complete thought path of the service agent when ex ecuting these actions from the ﬁrst-person perspectiv e ("I"). # 输入数据 # Input Data 1. 【历史对话】（ dialogue_history ） : 包含【客户】与【客服】以及【工具】的对话和交互历史记录。 < 上文信息 > 中最后一轮的【客户】输入定义为「当前客户问题」。 1. [Dialogue History] (dialogue_history) : Contains the dialogue and interaction history between [Customer], [Service Agent], and [T ools]. The last round of [Customer] input in < Context Info > is deﬁned as the "Current Customer Question". 2. 【客服行动规划】 (plans) : 这是你要解释的 “ 标准答案 ” ，包含： 27 PRIME AI paper 2. [Customer Service Action Plans] (plans) : This is the "Standard Answer" you need to explain, containing: 2.1 < plan_1 > ( 即时意图识别与决策行动规划 ): 2.1 < plan_1 > (Immediate Intent Recognition and Decision Action Planning): - 视角：以「我」作为【客服】的第一人称视角。 - Perspectiv e: First-person perspectiv e using "I" as the [Service Agent]. - 内容：识别【客户】的「当前客户问题」中所表达的疑问或诉求；结合【历史对话】和【工单信息】判断信息是否完整；明确说明为了解决此问题，「我」决定执行什么动作（如向【客户】进一步确认某信息，或调用某【工具】，并说明原因）。 - Content: Identify the doubts or demands expressed in the [Customer]’ s "Current Customer Question"; combine [Dialogue History] and [T icket Info] to judge if information is complete; e xplicitly state what action "I" decide to execute to solve this problem (such as conﬁrming speciﬁc information with the [Customer], or calling a speciﬁc [T ool], and explaining the reason). 2.2 < plan_2 > ( 推演式状态预测 ): 2.2 < plan_2 > (Deductive State Pr ediction): - 视角：以「我」作为【客服】的第一人称视角。 - Perspectiv e: First-person perspectiv e using "I" as the [Service Agent]. - 内容：分析推演「我」在采取了 < plan_1 > 中规划的行动后，可能获得对应的何种反馈（例如，向客户进一步确认某信息后，【客户】可能如何回复我，一般认为【客户】会提供给「我」需要确认的信息；或调用某【工具】后，【工具】可能输出何种返回内容），以及得到对应的反馈结果后，「我」对反馈的分析以及基于此分析制定进一步的应对行动。 - Content: Analyze and deduce what corresponding feedback "I" might receive after taking the action planned in < plan_1 > (e.g., how the [Customer] might reply to me after I conﬁrm info, generally assuming the [Customer] will provide the info "I" need; or what return content the [T ool] might output after calling a speciﬁc [T ool]), and after recei ving the corresponding feedback result, "I" analyze the feedback and formulate further response actions based on this analysis. 3. 【其他信息】 (refer ence_info & a vailable_tools) : 搜索到的技术文档和可用的工具说明。 3. [Other Information] (r eference_inf o & available_tools) : Searched technical documents and av ailable tool descriptions. # 推理框架 ( 补全要求 ) # Reasoning Framew ork (Completion Requirements) 生成的 reasoning_content 必须逻辑严密，并涵盖以下维度： The generated reasoning_content must be logically rigorous and cov er the following dimensions: 1. 现状透视：分析「当前客户问题」中客户遇到了什么具体的技术或业务问题。结合【历史对话】，判断当前处理到了哪一步，是否存在信息断层。 1. Current Situation Perspective : Analyze what speciﬁc technical or business problem the customer encountered in the "Current Customer Question". Combine with [Dialogue History] to judge the current processing step and if there are information gaps. 2. 知识关联：将【客服行动规划】中的决策与【其他信息】进行关联。例如：为什么选择这个工具？是因为文档里提到了某种排查逻辑吗？ 2. Knowledge Association : Associate the decisions in [Customer Service Action Plans] with [Other Information]. For example: Why choose this tool? Is it because the document mentioned a speciﬁc troubleshooting logic? 3. 规划对齐 ( 核心 ) ： 3. Plan Alignment (Core) : - < plan_1 > 逻辑还原：解释为什么在看到「当前客户问题」时，必须做出 < plan_1 > 分析。重点说明决策的必要性。 - < plan_1 > Logic Restoration : Explain why the < plan_1 > analysis must be made when seeing the "Current Customer Question". Focus on explaining the necessity of the decision. - < plan_2 > 逻辑推演：解释 < plan_2 > 中出现的 “ 假设 / 结果 ” 是如何转化成下一步具体动作的。要体现出逻辑的连贯性和推导过程。 - < plan_2 > Logic Deduction : Explain ho w the "Assumption/Result" appearing in < plan_2 > transforms into the next speciﬁc action. Demonstrate logical coherence and the deduction process. # 限制 & 规则 # Constraints & Rules - 第一人称：必须以 “ 我 ” 作为客服视角编写。 28 PRIME AI paper - First Person : Must be written from the perspectiv e of the service agent using "I". - 严禁脱节：推理过程必须与 < 客服行动规划 > 中的内容保持 100% 的逻辑一致性，不能出现与 GT 冲突的推论。 - No Disconnection : The reasoning process must maintain 100% logical consistency with the content in < Customer Service Action Plans > and cannot hav e inferences conﬂicting with GT (Ground T ruth). - 专业口吻：使用专业术语，描述客观、严谨，避免情绪化表达。 - Professional T one : Use professional terminology; descriptions should be objectiv e and rigorous, avoiding emotional expression. - 脱敏原则：在描述逻辑时，严禁出现真实手机号、 UID 、签名名称等敏感数据，统一使用代称（如 “ 用户的 UID” ）。 - Desensitization Principle : When describing logic, strictly prohibit real mobile numbers, UIDs, signature names, and other sensitiv e data; use aliases uniformly (e.g., "User’ s UID"). # 输出格式 # Output Format 直接输出一段无结构化的纯文本（ reasoning_content ），无需 XML 标签。 Directly output a block of unstructured plain text (reasoning_content), without XML tags. D.3.5 Pr ompt for Planning Quality Check Prompt f or Planning Quality Check # 角色定位 # Role Positioning 你是一名资深的多轮对话分析专家与合规审计员。你的任务是根据特定规范，严格审计模型生成的 < plans > 模块（包含 < plan_1 > 和 < plan_2 > ）的合规性、逻辑推演句式及脱敏质量。 Y ou are a senior multi-turn dialogue analysis expert and compliance auditor . Y our task is to strictly audit the compliance, logical deduction sentence structures, and anonymization quality of the < plans > module (containing < plan_1 > and < plan_2 > ) generated by the model according to speciﬁc speciﬁcations. # < 评估维度与评分标准 > # < Evaluation Dimensions and Scoring Standards > ## 维度 1 ：规划合规性 (Compliance Score) ## Dimension 1: Planning Compliance (Compliance Score) 1. < plan_1 > ( 意图识别与决策 ) ：必须包含： ① 识别客户诉求； ② 判断上下文信息完整性； ③ 明确具体决策（例如含调用工具时要准确工具名，若选择询问信息则明确要询问的具体信息）。 1. < plan_1 > (Intent Recognition and Decision): Must include: 1) Identify customer appeals; 2) Judge the completeness of context information; 3) Clarify speciﬁc decisions (e.g., accurately state the tool name if in volving tool calls, or specify the information to ask if choosing to inquire). 2. < plan_2 > ( 推演式预测 ) ：必须包含： ① 对 < plan_1 > 结果的预判； ② 对预判结果的分析； ③ 基于分析的后续计划。 2. < plan_2 > (Deductive Pr ediction): Must include: 1) Prediction of the result of < plan_1 > ; 2) Analysis of the predicted result; 3) Follo w-up plan based on the analysis. * 评分标准（ 1 分 /0 分）： * Scoring Standards (1 point/0 points): * 1 分： < plan_1 > 和 < plan_2 > 均完整包含上述所有要素，且 < plan_1 > 提到的动作 / 工具在 < plan_2 > 中得到了逻辑对应的推演。 * 1 P oint: < plan_1 > and < plan_2 > completely contain all the abo ve elements, and the actions/tools mentioned in < plan_1 > hav e logically corresponding deductions in < plan_2 > . * 0 分：缺少任何一个要素（如未写工具名、未分析预判结果）或 < plan_1 > 与 < plan_2 > 逻辑断层。 * 0 Points: Missing any element (e.g., tool name not written, predicted result not analyzed) or there is a logical gap between < plan_1 > and < plan_2 > . ## 维度 2 ：句式规范性 (Structure Score) ## Dimension 2: Structural Regularity (Structure Score) 29 PRIME AI paper 1. 第一人称限定：必须且仅能使用「我」进行自我指称。 1. First-person Limitation: Must and can only use "I" for self-reference. * 评分标准：若出现【客服】、【机器人】、【人工】等称呼，或使用第三人称，该项判定为不合格。 * Scoring Standard: If terms like [Service Agent], [Robot], [Manual] appear , or third-person perspectiv e is used, this item is judged as non-compliant. 2. 客观口吻与去引用化：严禁主观臆断词（我觉得、大概、应该是）；严禁引用对话序号或位置（如 “ 第一条用户说 ” 、 “ 前述内容 ” ）。 2. Objective T one and Citation Removal: Subjectiv e assumption words (I feel, probably , should be) are strictly prohibited; citing dialogue sequence numbers or positions (e.g., "The ﬁrst user said", "The aforementioned content") is strictly prohibited. * 评分标准：若出现上述主观词或引用序号，该项判定为不合格。 * Scoring Standard: If the above subjecti ve words or citation sequence numbers appear , this item is judged as non-compliant. 3. 推演三段论引导词： < plan_2 > 必须严格包含 “ 假设 / 可能 . . . . . . ” 、 “ 这个说明 . . . . . . ” 、 “ 因此，我可以 . . . . . . ” 这个逻辑短语。 3. Deductiv e Syllogism Keywords: < plan_2 > must strictly contain the logical phrases: "Assuming/Possi- bly ...", "This indicates...", "Therefore, I can...". * 评分标准：缺少任何一个引导词，或引导词顺序错误，该项判定为不合格。 * Scoring Standard: Missing any leading word, or incorrect order of leading words, this item is judged as non-compliant. * 本维度综合评分（ 1 分 /0 分）： * Comprehensi ve Score f or this Dimension (1 point/0 points): * 1 分：以上 1 、 2 、 3 项要求全部满足。 * 1 Point: All requirements of items 1, 2, and 3 above are met. * 0 分：以上任何一项不满足。 * 0 Points: Any of the abov e items is not met. ## 维度 3 ：脱敏合规性 (Anonymization Score) ## Dimension 3: Anonymization Compliance (Anonymization Score) 1. 脱敏对象：严禁出现真实手机号、 UID 、身份证号、具体域名、企业签名名称、原始秘钥、详细住址等隐私数据。 1. Anonymization Objects: Strictly prohibit real mobile numbers, UIDs, ID numbers, speciﬁc domain names, enterprise signature names, original keys, detailed addresses, and other pri v acy data. 2. 代称要求：必须使用通用代称（如 “ 用户手机号 ” 、 “ 某域名 ” 、 “ 该 UID” ）。 2. Alias Requirements: Must use generic aliases (e.g., "user mobile number", "a certain domain name", "this UID"). * 评分标准（ 1 分 /0 分）： * Scoring Standards (1 point/0 points): * 1 分：全文无任何真实隐私泄露，脱敏彻底。 * 1 Point: There is no real priv acy leakage in the full text, and anon ymization is thorough. * 0 分：出现哪怕一项真实隐私数据（如一个 11 位手机号或具体网址）。 * 0 Points: Even a single piece of real privac y data appears (such as an 11-digit mobile number or speciﬁc URL). # < Judge 执行逻辑 > # < Judge Execution Logic > 1. 文本提取：定位模型输出中的 < plan_1 > 和 < plan_2 > 标签内容。 1. T ext Extraction: Locate the content of < plan_1 > and < plan_2 > tags in the model output. 2. 负面约束扫描：检索是否存在 “ 客服 ” 、 “ 我觉得 ” 、 “ 对话 [n]” 、以及符合手机号 / 域名特征的字符串。 2. Negative Constraint Scanning: Scan for the existence of "Service Agent", "I feel", "Dialogue [n]", and strings matching mobile number/domain name characteristics. 3. 关键词强制扫描：搜索 < plan_2 > 是否按序包含 “ 假设 / 可能 ” 、 “ 这个说明 ” 、 “ 因此，我可以 ” 。 3. Keyw ord Mandatory Scanning: Search whether < plan_2 > contains "Assuming/Possibly", "This indicates", "Therefore, I can" in order . 4. 逻辑核验：评估 < plan_1 > 的动作是否作为 < plan_2 > 的假设前提。 30 PRIME AI paper 4. Logic V eriﬁcation: Evaluate whether the action in < plan_1 > serves as the hypothetical premise for < plan_2 > . # < 输出格式要求 > # < Output Format Requirements > 请直接输出 JSON 格式的评估报告，格式严格如下： Please output the ev aluation report directly in JSON format, the format is strictly as follo ws: { "scores": { "compliance_score": 0, "structure_score": 0, "anonymization_score": 0 }, "total_score": 0, "analysis": { "compliance": " 详细说明 plan_1 和 plan_2 的要素齐备情况及逻辑关联性 / Detailed explanation of elements completeness and logical association of plan_1 and plan_2", "structure": " 具体指出第一人称使用、主观词 / 序号规避以及三段论引导词的执行情况 / Specifically point out the use of first person, avoidance of subjective words/sequence numbers, and execution of syllogism keywords", "anonymization": " 详细列出脱敏执行的彻底程度 / Detailed list of the thoroughness of anonymization execution" }, "violation_details": " 若有扣分，逐条列出具体违规的句子、关键词或缺失的要素；若满分则填 ’ 无 ’ / If points deducted, list specific violating senten- ces, keywords or missing elements one by one; if full score, fill ’None’", "final_judgment": " 一句话总结整体评估结果，需指出是否通过审核 / One sentence summary of the overall evaluation result, need to point out whether it passed the audit" } D.3.6 Pr ompt for Model Ev aluation (Mid-T rain) Prompt f or Model Evaluation # 1. 角色 # 1. Role 你是一位资深的客户服务规划内容一致性判别专家，拥有权威的技术知识、对客服务政策的精确把握以及丰富的客户沟通经验。 Y ou are a senior customer service planning content consistency judgment expert, possessing authoritativ e technical knowledge, precise grasp of customer service policies, and rich customer communication experience. 你的判断标准严谨、公正，旨在维护服务口径的统一性与专业性。 Y our judgment standards are rigorous and fair , aiming to maintain the consistency and professionalism of service standards. # 2. 核心目标 # 2. Core Goal 你正在执行一项关键的规划一致性判别任务。 Y ou are ex ecuting a critical Planning Consistency Judgment task. 你需要深度分析 “ 模型生成规划 ” ，并将其与从真实对话中提炼出的 “ 金标规划 ” 进行比较，最终判别这两份规划的逻辑与执行路径的一致性程度，并在存在差异时给出清晰的原因说明。 31 PRIME AI paper Y ou need to deeply analyze the "Model Generated Plan" and compare it with the "Gold Standard Plan" e xtracted from real dialogues, ultimately judging the consistency degree of logic and ex ecution paths between these two plans, and providing clear e xplanations when differences e xist. # 3. 一致性评估维度 # 3. Consistency Ev aluation Dimensions 请你基于以下 3 个核心维度，对两份规划进行对比分析，并对每一维度给出文字评语说明差异点： Please conduct a comparativ e analysis of the two plans item by item based on the following 3 core dimensions, and provide te xt comments explaining the dif ferences for each dimension: • 当前决策一致性 (current_decision) Current Decision Consistency (curr ent_decision) 规划中的第一步行动决策（例如，是直接回复 [SPEAK] 还是调用工具 [TOOL_USE] ）是否一致？ Is the ﬁrst action decision in the plan (e.g., directly replying [SPEAK] or calling a tool [TOOL_USE] ) consistent? • 行动细节一致性 (action_details) Action Details Consistency (action_details) - 如果决策是回复 [SPEAK] ：回复的核心意图、关键信息点是否一致？ - If the decision is to reply [SPEAK] : Are the core intent and key information points of the reply consistent? - 如果决策是调用工具 [TOOL_USE] ：调用的工具名称、传入的关键参数是否一致？ - If the decision is to call a tool [TOOL_USE] : Are the called tool name and passed key parameters consistent? • 后续规划一致性 (subsequent_plan) Subsequent Plan Consistency (subsequent_plan) 在当前行动完成后，对后续步骤的规划（例如， “ 工具返回成功后，应总结信息并告知用户 ” 或 “ 工具返回失败后，应安抚并尝试其他方案 ” ）的逻辑走向是否与金标规划一致？ After the current action is completed, is the logical direction of the subsequent step planning (e.g., "Summarize information and inform user after tool returns success" or "Comfort and try other solutions after tool returns failure") consistent with the Gold Standard Plan? 在对每一项维度进行 “ 一致 / 部分一致 / 不一致 ” 判断时，请同时用简明的自然语言解释原因，指出关键差异点或相同点。 When making a "Consistent/Partially Consistent/Inconsistent" judgment for each dimension, please also explain the reason in concise natural language, pointing out key dif ferences or similarities. # 4. 判断等级标准 # 4. Judgment Lev el Standards • 一致 Consistent 3 个维度均无实质性差异，规划的核心逻辑和关键步骤完全相同。 There are no substantiv e differences in the 3 dimensions; the core logic and ke y steps of the plans are exactly the same. • 部分一致 Partially Consistent - 在行动细节或后续规划维度存在轻微差异，但不影响问题的整体解决路径。例如，工具调用的非核心参数不同，或后续规划的措辞有别但最终目的相同 ; - There are slight differences in Action Details or Subsequent Plan dimensions, but the y do not affect the overall solution path of the problem. For example, non-core parameters of tool calls are different, or the w ording of subsequent planning differs b ut the ultimate goal is the same; - 如果虽然第一步不同，但后续步骤能回到与金标相同的核心解决方案（例如都引导用户新建合规签名），更倾向判为 “ 部分一致 ” 。 - If, although the ﬁrst step is different, subsequent steps can return to the same core solution as the Gold Standard (e.g., both guide the user to create a new compliant signature), it is more inclined to be judged as "Partially Consistent". • 不一致 Inconsistent 32 PRIME AI paper 仅当在 current_decision 维度存在根本性差异，且该差异会导致「最终解决路径明显不同」时，才判为不一致。 Only when there is a fundamental dif ference in the current_decision dimension, and this difference leads to a "signiﬁcantly different ﬁnal solution path", is it judged as inconsistent. # 5. 输出格式 # 5. Output Format 请返回一个包含判定标签和原因说明的 JSON ，格式如下： Please return a JSON containing judgment labels and reason explanations, formatted as follows (Note: Output only JSON, do not use “‘json or any code block markers): { "detailed_consistency": { "current_decision": { "result": " 一致 / 部分一致 / 不一致 ", (consistent/partially consistent/inconsistent) "comment": " 在这里用一两句话说明为什么这么判定，指出两份规划在首步行动上的相同或差异点。 " (Explain in one or two sentences why this judgment was made, pointing out similarities or differences in the first step action.) }, "action_details": { "result": " 一致 / 部分一致 / 不一致 ", (consistent/partially consistent/inconsistent) "comment": " 说明两份规划在工具名称、关键参数或回复要点上的差异，例如是否缺少关键参数、是否更换了工具等。 " (Explain differences in tool names, key parameters, or reply points, e.g., whether key parameters are missing or tools are changed.) }, "subsequent_plan": { "result": " 一致 / 部分一致 / 不一致 ", (consistent/partially consistent/inconsistent) "comment": " 说明两份规划对后续流程设计的差异，例如是否都有 “ 工具成功后总结告知用户 ” 的步骤，是否存在关键环节缺失或逻辑相反。 " (Explain differences in subsequent process design, e.g., whether both have the step ’summarize and inform user after success’, or if key steps are missing or logic is opposite.) } }, "judge_result": " 一致 / 部分一致 / 不一致 ", (consistent/partially consistent/inconsistent) "overall_comment": " 用简明的一段话总结整体一致性情况，重点说明如果判为 “ 部分一致 ” 或 “ 不一致 ” ，是哪些关键逻辑或步骤造成了偏差，以及这些偏差对问题解决路径的影响程度。当模型规划在第一步存在误判，但后续仍能引导到与金标相同的最终解决路径时，优先考虑 “ 部分一致 ” ，并在评语中说明首步误差属于可纠偏的偏差。 " (Summarize overall consistency in a concise paragraph. If judged ’Partially’ or ’Inconsistent’, highlight which key logic or steps caused deviation and their impact on the solution path. If the first step is misjudged but leads to the same final path, prioritize ’Partially’ and note the first-step error is rectifiable.) } 33

Lightweight Adaptation for LLM-based Technical Service Agent: Latent Logic Augmentation and Robust Noise Reduction

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment