Aligning large language models (LLMs) to preference data typically assumes a known link function between observed preferences and latent rewards (e.g., a logistic Bradley-Terry link). Misspecification of this link can bias inferred rewards and misalign learned policies. We study preference alignment under an unknown and unrestricted link function. We show that realizability of $f$-divergence-constrained reward maximization in a policy class induces a semiparametric single-index binary choice model, where a scalar policy-dependent index captures all dependence on demonstrations and the remaining preference distribution is unrestricted. Rather than assuming this model has identifiable finite-dimensional structural parameters and estimating them, as in econometrics, we focus on policy learning with the reward function implicit, analyzing error to the optimal policy and allowing for unidentifiable nonparametric indices. We develop preference optimization algorithms robust to the unknown link and prove convergence guarantees in terms of generic function complexity measures. We demonstrate this empirically on LLM alignment. Code is available at https://github.com/causalml/spo/
Modern large language models (LLMs) are tuned using human or AI feedback (RLHF/RLAIF) to better align outputs with user preferences and safety desiderata [Christiano et al., 2017, Ziegler et al., 2019, Stiennon et al., 2020, Ouyang et al., 2022, Bai et al., 2022b,a, Nakano et al., 2021, Wu et al., 2021]. A common setup interprets pairwise preferences as discrete choice under a latent reward, which is inferred and optimized, while constraining or penalizing the deviation from a reference model. This balances quality improvements with preservation of language abilities.
Linking preferences to rewards is usually done by assuming a particular choice model, such as Bradley-Terry (logistic link) or Thurstone (probit link), so that, given demonstrations and a reward function, the distribution of preferences becomes fully specified [Rafailov et al., 2023, Zhan et al., 2023, Glaese et al., 2022, Ziegler et al., 2019, Ibarz et al., 2018]. A prominent example is Direct Preference Optimization (DPO, Rafailov et al., 2023), which uses the Bradley-Terry choice model. This, however, imposes a lot of structure on choice behavior, and misspecifying this link can bias inferred rewards and misalign policy optimization [Hong et al., 2025, Xu andKankanhalli, 2024]. Alternative approaches to alignment from preferences depart from structural/generative modeling of choice, focusing on loss aversion and label corruption [Ethayarajh et al., 2024, Liang et al., 2024, Kong et al., 2024].
However, specifying a known reward-preference link is not actually necessary for a structural/generative interpretation (meaning one that assumes preferences optimize a random utility or equivalently that preferences are generated by some conditional probability distribution given demonstrations and reward). Letting the link be arbitrary and unknown gives rise to semiparametric discrete choice models, which have been studied extensively in the econometrics literature [Cosslett,
The target we focus on is the policy maximizing average rewards subject to a constraint on its deviation from a reference policy. This is a standard approach to align models to a new task while staying close to a pre-trained policy, preserving its language ability and preventing catastrophic forgetting. While the common divergence is Kullback-Leibler (KL), we here consider any f -divergence, of which KL is an example.
We let x ∈ X denote context (e.g., user query) and y ∈ Y action (e.g., LLM response). The mean reward of taking action y in context x is given by the unknown reward function r ⋆ (x, y). For simplicity we focus on a finite action space throughout the paper, |Y| < ∞, while X can be general Borel. Given convex f : R + → R with f (1) = 0, for any two probability mass functions p, q ∈ ∆ Y on Y we define D f (p∥q) = ∞ whenever p(y) > 0, q(y) = 0 for some y and D f (p∥q) = y:q(y)>0 q(y)f (p(y)/q(y)) otherwise. Henceforth we will assume throughout that f is twice continuously differentiable and strictly convex with derivative approaching -∞ near 0. KL is given by setting f (u) = u log u. Other examples satisfying these conditions include α-divergences (which include KL), Jensen-Shannon divergence, and the combination of KL and χ 2 -divergence as used in Huang et al. [2024].
Given a reference policy π ref : X → ∆ Y , we are interested in the divergence-constrained reward-maximizing policy:
where the expectations are taken with respect to a context distribution x ∼ P x .
Combining a simple convexity argument with the results of [Wang et al., 2023] yields a closed form for π ⋆ .
Assumption 1. Let ω(x) = y∈argmax y r ⋆ (x,y) π ref (y | x). Suppose E x ω(x)f (1/ω(x)) > κ.
Theorem 1 (Closed form for π ⋆ ). Suppose Assumption 1 holds. There exist β ⋆ > 0 and a function λ ⋆ : X → R such that
where
Assumption 1 simply requires that a pure-reward-maximizing policy is not already feasible in the divergence constraint thus rendering it irrelevant.
Remark 1 (The case of KL). The solution simplifies a lot for f (u) = u log u, in which case π ⋆ (y | x) ∝ π ref (y | x) exp(β ⋆-1 r ⋆ (x, y)) for some β ⋆ .
Toward optimizing Eq. ( 1), we assume we have access to preference data consisting of n tuples (x, y 0 , y 1 , z) ∼ P, with w = (x, y 0 , y 1 ) ∈ W drawn from some joint distribution with x-marginal matching P x (otherwise we can do importance sampling) and z ∈ {0, 1} being the indicator that y 1 is preferred to y 0 . We index our data (w i , z i ) = (x i , y i0 , y i1 , z i ) for i = 1, . . . , n and assume they are drawn iid.
We assume that preferences are drawn according to z | x, y 0 , y 1 ∼ Bernoulli(Φ ⋆ (r ⋆ (x, y 1 ) -r ⋆ (x, y 0 ))),
where Φ ⋆ is an unknown cumulative distribution function (CDF).
The key in this paper is that we let Φ ⋆ be completely unknown, rather than setting it to a known function like sigmoid or the normal CDF. This is important since we only care to understand rewards so that we can optimize policies, rather than fully model the distributio
This content is AI-processed based on open access ArXiv data.