Direct Preference Optimization with Rating Information: Practical Algorithms and Provable Gains
The class of direct preference optimization (DPO) algorithms has emerged as a promising approach for solving the alignment problem in foundation models. These algorithms work with very limited feedback in the form of pairwise preferences and fine-tune models to align with these preferences without explicitly learning a reward model. While the form of feedback used by these algorithms makes the data collection process easy and relatively more accurate, its ambiguity in terms of the quality of responses could have negative implications. For example, it is not clear if a decrease (increase) in the likelihood of preferred (dispreferred) responses during the execution of these algorithms could be interpreted as a positive or negative phenomenon. In this paper, we study how to design algorithms that can leverage additional information in the form of rating gap, which informs the learner how much the chosen response is better than the rejected one. We present new algorithms that can achieve faster statistical rates than DPO in presence of accurate rating gap information. Moreover, we theoretically prove and empirically show that the performance of our algorithms is robust to inaccuracy in rating gaps. Finally, we demonstrate the solid performance of our methods in comparison to a number of DPO-style algorithms across a wide range of LLMs and evaluation benchmarks.
💡 Research Summary
The paper addresses a fundamental limitation of existing Direct Preference Optimization (DPO) and its variants such as Identity‑mapping Preference Optimization (IPO) for aligning large language models (LLMs). Traditional DPO relies solely on binary preference signals—knowing which of two responses is preferred—without any quantitative measure of how much better one response is than the other. This lack of granularity can lead to ambiguous learning dynamics, including likelihood displacement and biased policy updates, especially when both responses are of similar quality (high or low).
To overcome this, the authors augment the training data with a “rating gap” (Δ̂r), a scalar that quantifies the relative quality difference between the chosen and rejected responses. This rating gap can be obtained from reliable human annotators or high‑performing LLM evaluators (e.g., GPT‑4). Leveraging this additional signal, they propose three algorithms:
-
Ratings Direct Preference Optimization (RDPO) – Extends the RLHF objective by adding a linear combination of the original reward r and the rating estimate ˆr, weighted by a confidence parameter β₁. The resulting closed‑form policy leads to a loss of the form –log σ(β·Δθ – (β/β₁)·Δ̂r), where Δθ is the log‑probability ratio between preferred and dispreferred responses, and Δ̂r is the rating gap. β₁ controls how strongly the rating influences learning; smaller β₁ means higher trust in the rating.
-
Ratings IPO (RIPO) – Adapts the IPO square‑loss framework to incorporate rating gaps, yielding a shifted quadratic loss (β·Δθ – (β/β₁)·Δ̂r – ½)². This formulation is mathematically equivalent to a variant of Distilled‑DPO but does not rely on any statistical assumption about the rating distribution.
-
Maximum‑Likelihood‑based RDPO (ML‑RDPO) – Treats preferences and ratings as jointly generated from a probabilistic model. Assuming conditional independence and that rating gaps are Gaussian with variance V, the joint log‑likelihood decomposes into a DPO‑style term and a squared‑error term weighted by 1/V. The resulting loss is a sum of the standard DPO loss and a Distilled‑DPO loss scaled by V⁻¹. V thus plays a role analogous to β₁, reflecting confidence in the rating information. Importantly, ML‑RDPO can be applied even when rating and preference data are not co‑located for the same prompt‑response pair.
Theoretical contributions – The authors formalize an approximation error Err(ˆr) = E
Comments & Academic Discussion
Loading comments...
Leave a Comment