DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Real-world large language model deployments (e.g., conversational AI systems, code generation assistants) naturally generate abundant implicit user dissatisfaction (DSAT) signals, as users iterate toward better answers through refinements, corrections, and expressed preferences, while explicit satisfaction (SAT) feedback is scarce. Existing preference learning approaches are poorly aligned with this data profile, as they rely on costly human annotations or assume plentiful positive responses. In this paper, we introduce \textbf{DRIFT} (\textbf{D}issatisfaction-\textbf{R}efined \textbf{I}terative pre\textbf{F}erence \textbf{T}raining), which anchors training on real-world DSAT signals and samples positives dynamically from the evolving policy. Empirically, DRIFT models trained on real-world \textit{WildFeedback} datasets and synthetic \textit{UltraFeedback} datasets achieve up to +6.23% (7B) / +7.61% (14B) on WildBench Task Score and up to +8.95% (7B) / +12.29% (14B) on AlpacaEval2 win rate over base models, outperforming strong baseline methods such as iterative DPO and SPIN. At larger scales, the improvements are particularly pronounced: 14B models trained with DRIFT surpass GPT-4o-mini on WildBench. Further analysis shows that DRIFT also preserves exploratory capacity, yielding more diverse high-reward solutions rather than collapsing to narrow subsets. Theoretically, we demonstrate that this design preserves preference margins and avoids the gradient degeneration. These results show that DRIFT is an effective and scalable recipe for real-world post-training that leverages the most abundant and informative signal. The code and data are available at https://github.com/cacayaya/DRIFT.git.

💡 Research Summary

The paper introduces DRIFT (Dissatisfaction‑Refined Iterative Preference Training), a novel preference‑learning framework that exploits the abundant implicit user dissatisfaction (DSAT) signals generated by deployed large language models (LLMs). In real‑world deployments, users rarely provide explicit satisfaction (SAT) feedback, but they frequently express dissatisfaction through follow‑up questions, corrections, or negative ratings. The authors argue that this asymmetry should be turned into an advantage: DSAT provides high‑quality negative supervision, while positive examples can be sampled dynamically from the current policy.

The work builds on two recent trends: (1) the shift from costly human‑annotated preference data (as used in RLHF and DPO) toward automatically mined signals from real interactions, and (2) self‑improvement methods that generate their own preference pairs (e.g., SPIN, Iterative DPO, Self‑Rewarding LMs). Existing self‑improvement approaches suffer from “gradient collapse”: as the model improves, the chosen and rejected responses become increasingly similar, weakening the preference signal and often leading to mode collapse. DRIFT avoids this by anchoring the rejected side to genuine user‑reported DSAT responses and by generating the chosen side afresh from the evolving model at each iteration, thereby preserving a non‑vanishing preference margin.

Methodologically, DRIFT proceeds in two stages. First, a warm‑start phase trains on 491 seed pairs where a DSAT response was later revised into a SAT response (DSAT→SAT). This gives the model an initial bias toward reducing obvious failures. Second, the model enters an iterative loop: for each prompt x that has a DSAT label, the current policy πθ_k samples a fresh positive response y⁺, while the original DSAT response y⁻ is kept as the negative. The pair (x, y⁺, y⁻) is used to minimize the standard DPO loss L_DPO = –E

DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment