Generalisation of RLHF under Reward Shift and Clipped KL Regularisation

Generalisation of RLHF under Reward Shift and Clipped KL Regularisation Kenton T ang 1 Y uzhu Chen 2 Fengxiang He 1 1 Univ ersity of Edinbur gh 2 Univ ersity of Science and T echnology of China Abstract Alignment and adaptation in large language models heavily rely on reinforcement learning from human feedback (RLHF); yet, theoretical understanding of its generalisability remains premature, especially when the learned rew ard could shift, and the KL control is estimated and clipped. T o address this issue, we de velop generalisation theory for RLHF that e xplicitly accounts for (1) re war d shift : re ward models are trained on preference data from earlier or mixed behaviour policies while RLHF optim- ises the current policy on its o wn rollouts; and (2) clipped KL r egularisation : the KL regulariser is estimated from sampled log-probability ratios and then clipped for stabilisation, resulting in an error to RLHF . W e present generalisation bounds for RLHF , suggesting that the generalisation error stems from a sampling error from prompts and rollouts, a reward shift error , and a KL clipping error . W e also discuss special cases of (1) initial- ising RLHF parameters with a uniform prior o ver a ﬁnite space, and (2) training RLHF by stochastic gradient descent, as an Ornstein-Uhlenbeck pro- cess. The theory yields practical implications in (1) optimal KL clipping threshold, and (2) budget al- location in prompts, rollouts, and preference data. 1 INTR ODUCTION Reinforcement learning from human feedback (RLHF) has become a central method for steering large language models (LLMs) tow ards better reﬂecting human preferences [Chris- tiano et al., 2017, Stiennon et al., 2020], task requirements [Ouyang et al., 2022, Chung et al., 2024], safety constraints [Bai et al., 2022a,b], amongst many others. Despite the em- pirical success of RLHF , the theoretical understanding of its generalisability remains largely absent. T o address this issue, this paper presents generalisation bounds for RLHF . T o note, we analyse the post-trained policy in deployment, rather than studying online reinforce- ment learning during interactions with the en vironment. A typical RLHF algorithm consists of two coupled compon- ents: (1) a re ward model trained from preference data and serving as a proxy for human judgment, and (2) a policy model optimised to maximise the re ward model. Most RLHF algorithms additionally employ Kullback-Leibler (KL) re g- ularisation to keep the policy close to a reference model, typically , a supervised ﬁne-tuned (SFT) model [Ziegler et al., 2019], for improving stability and limiting distribution shift [Schulman et al., 2015]. These induce two major challenges that signiﬁcantly complicate analysis, as follows. Rewar d shift. The rew ard model is usually trained on pref- erence data collected from an earlier behaviour policy or a mixtur e of policies [Christiano et al., 2017]. Howe ver , the policy model is e valuated and optimised on rollouts drawn from the curr ent distribution of responses. As the policy im- prov es or drifts, it can move into regions where the re ward model is less reliable, creating a feedback loop in which rew ard-model error is ampliﬁed in optimisation [Gao et al., 2023]. This calls for the potential RLHF generalisation the- ory to account for r ewar d shift between the data used to train the re ward model and the rollout distrib ution induced by the current policy . Clipped KL r e gularisation. KL regularisation is usually assumed to be computed as a population expectation in theoretical treatments [Schulman et al., 2015]. In practice, howe ver , the KL control is computed from sampled se- quences through log probability ratios; empirical analyses hav e sho wn that the choice of estimator and implementation details can materially affect optimisation stability [Shah et al., 2025]. A common stabilisation is to clip the per- sample log ratio, in order to control rare trajectories whose likelihood ratios are extreme, echoing clipping used in PPO [Schulman et al., 2017, Lambert, 2025]. This clipped KL regularisation further introduces an error . Motiv ated by these, we dev elop generalisation theory for RLHF that explicitly accounts for both: the re ward is learned and shifting , rather than gi ven and ﬁxed; and the KL regularisation is estimated and clipped , rather than an exact population quantity . Based on a change-of-measure de- composition and employing P A C-Bayes tools [McAllester, 1999, See ger, 2002, Catoni, 2007], our analysis yields high-probability generalisation bounds for the learned, data- dependent policy that decompose the generalisation error into three distinct, interpretable sources: (1) a sampling er - r or , induced by the two-stage sampling of observing ﬁnitely many prompts and estimating expectations from limited Monte Carlo rollouts, (2) a re war d shift err or , capturing the gap between the learned rew ard and the (implicit) target rew ard, and the additional error induced when the policy- driv en rollout distribution dif fers from the rew ard model’ s training distribution, (3) a KL clipping err or , characterising the deviation from the clipped KL re gulariser . A good theory has practical implications. Our theory sug- gests: (1) optimal KL clipping threshold : the theory indicates that the KL log-ratio clipping threshold τ controls the bias- variance trade-of f, since clipping reduces sampling noise while introducing an objecti ve mismatch that does not v an- ish asymptotically . Our theory further provides advice on how to strike a good balance; (2) budg et allocation acr oss pr ompts, r ollouts, and pr efer ence labels : our generalisation bounds separate the impacts of prompts, rollouts per prompt, and preference labels, thereby guiding b udget allocation across prompts and rollouts, and preference data collection. 2 RELA TED WORK Optimisation theory of RLHF Efforts hav e been made in theoretically studying RLHF from an optimisation per- spectiv e. Zhu et al. [2023] analyse RLHF based on pair- wise and list-wise comparisons, and characterise how the rew ard model error can induce suboptimal policies, moti v- ating conserv ativ e strategies under coverage assumptions. Similarly , Zhan et al. [2024] provide ﬁnite-sample guar- antees for ofﬂine RLHF that depend on a concentrability coefﬁcient quantifying the cov erage of the target policy by the of ﬂine data. Xiong et al. [2024] establish ﬁnite-sample guarantees for KL-regularised RLHF , in the ofﬂine, online, and hybrid regimes. Reward shift The literature has seen empirical studies on the impact of re ward shift. Gleav e and Irving [2022] empirically study uncertainty estimation for reward models, highlighting that re ward models can be unreliable out of dis- tribution. Gao et al. [2023] empirically characterise re ward model over -optimisation by measuring how the proxy-oracle gap grows when a policy is optimised against a learned proxy rew ard and ev aluated under a stronger oracle re ward. In addition, RewardBench pro vides a complementary ev alu- ation resource for quantifying re ward model behaviour on challenging and out-of-distribution comparisons [Lambert et al., 2025]. Clipped KL regularisation As an empirical work, Shah et al. [2025] provide an extensi ve analysis showing that sev eral commonly used estimators for KL regularisation can produce biased gradients, which can affect optimisation and stability . Liu et al. [2025] analyse KL regularisation implementations in RLHF and characterise when common choices are principled or biased, including off-polic y bias that arises when importance weighting is neglected. Concurrent paper A concurrent work, released on 23 Jan 2026, provides interesting results on the generalisation of RLHF , under linear re ward model assumption, through the algorithmic stability framew ork [Li et al., 2026]. Our work is more general, formulated for RLHF pipelines beyond linear re ward; instead, the rew ard in this paper is learned from pref- erence data and shifts with policy updates. Moreover , our paper allo ws the KL control to be estimated from sampled log ratios and clipped for stabilisation, while Li et al. [2026] formulate the KL penalty as an exact conditional KL di- ver gence term in the objective, without sample-based KL estimation or clipping. 3 PRELIMINARIES RLHF Giv en a prompt x ∈ X , a policy π speciﬁes a conditional distribution π ( · | x ) ov er responses y ∈ Y . W e denote the post-trained policy by π θ with parameter θ ∈ Θ , and denote π ref as a ﬁxed reference polic y . Evaluation uses prompts drawn from a distrib ution ρ , while preference data for rew ard modelling may come from a diff erent prompt distribution ρ label because of prompt shift. Suppose the target reward is r ⋆ : X × Y → [0 , 1] . A re- ward model is a proxy ˆ r ϕ : X × Y → [0 , 1] , with parameter ϕ ∈ Φ ; the pointwise error is e ϕ ( x, y ) = ˆ r ϕ ( x, y ) − r ⋆ ( x, y ) . T raining the reward model uses a data collection distribution, deﬁned as D train ( x, y ) = ρ label ( x ) π ref ( y | x ) , where we also use π ref as the behaviour policy for re ward-data collec- tion. In practice, this policy can be a mixture, and we write π ref ( · | x ) := P M m =1 c m π ( m ) ( · | x ) , reﬂecting the stand- ard practice of collecting preference rankings o ver di verse behaviour policy mixtures. Moreover , the policy-induced distribution is deﬁned as D θ ( x, y ) = ρ ( x ) π θ ( y | x ) . The re ward model is e valuated on the training distribution by the mean-squared error L (2) train ( ϕ ) , deﬁned by E ( X,Y ) ∼ D train h  ˆ r ϕ ( X, Y ) − r ⋆ ( X, Y )  2 i . (1) This quantity is an oracle-risk term deﬁned with respect to r ⋆ , and is typically not directly observable from pairwise preference labels in practice. Clipped KL r egularisation Let β > 0 denote the re gular- isation strength, and ℓ θ ( x, y ) = log π θ ( y | x ) − log π ref ( y | x ) denote the exact log ratio. This log ratio is the per- sample quantity that appears when the KL control is im- plemented from sampled rollouts, which refer to the re- sponse sequence generated by sampling sequentially from the policy conditioned on the prompt. Its conditional ex- pectation strictly recovers the standard reference KL di- ver gence (Deﬁnition 2, Appendix B) within the popula- tion objectiv e. In particular , for ev ery prompt x , we hav e E Y ∼ π θ ( ·| x ) [ ℓ θ ( x, Y )] = KL( π θ ( · | x ) ∥ π ref ( · | x )) . In post-training, ℓ θ ( x, y ) can hav e a large magnitude on rare samples, which can signiﬁcantly increase the variance of empirical KL-related quantities and destabilise optim- isation unless additional control is imposed [Shah et al., 2025, Lambert, 2025]. T o stabilise KL control while keep- ing the target objecti ve explicit, a popular approach is clip- ping with threshold τ > 0 : ℓ τ θ ( x, y ) = clip( ℓ θ ( x, y ) , − τ , τ ) [Schulman et al., 2017, Lambert, 2025]. Correspondingly , the clipped population objectiv e J r,τ ( θ ) is given by E X ∼ ρ E Y ∼ π θ ( ·| X ) [ r ( X, Y ) − β ℓ τ θ ( X, Y )] . (2) Generalisation The population objectiv e is J r ( θ ) = E X ∼ ρ E Y ∼ π θ ( ·| X ) [ r ( X, Y ) − β ℓ θ ( X, Y )] , where ℓ θ ( x, y ) = log π θ ( y | x ) − log π ref ( y | x ) is the exact log ratio. Ev aluating a policy relies on ﬁnite prompts and rollouts. Let x 1 , . . . , x n be independent prompts drawn from ρ . For each x i , let y i, 1 , . . . , y i,K denote K independent rollouts drawn from π θ ( · | x i ) . The resulting empirical objectiv e is b J r,τ n,K ( θ ) = 1 n n X i =1 1 K K X j =1 [ r ( x i , y i,j ) − β ℓ τ θ ( x i , y i,j )] . For brevity , b J ϕ,τ n,K ( θ ) denotes b J ˆ r ϕ ,τ n,K ( θ ) ; J ⋆ ( θ ) denotes J r ⋆ ( θ ) ; J ϕ ( θ ) denotes J ˆ r ϕ ( θ ) ; J ϕ,τ ( θ ) denotes J ˆ r ϕ ,τ ( θ ) . The generalisability can be quantiﬁed by the generalisation error , deﬁned to be the discrepanc y between the empirical and population objectiv es:    b J ϕ,τ n,K ( θ ) − J ⋆ ( θ )    . 4 MAIN RESUL TS This section presents our theoretical results. 4.1 DECOMPOSING GENERALISA TION ERROR W e decompose the generalisation error into three compon- ents: (1) a sampling err or , induced by prompts and rollouts, which is present even if the follo wing two terms do not exist; (2) a re war d shift err or , induced by re ward shift under the same exact KL regulariser; and (3) a KL clipping err or , induced by the objective mismatch induced by estimating and clipping the KL penalty . Lemma 1 (Generalisation error decomposition) . Given parameters θ ∈ Θ and ϕ ∈ Φ , and clipping thr eshold τ > 0 . Then, we have the following decomposition,    b J ϕ,τ n,K ( θ ) − J ⋆ ( θ )    ≤    b J ϕ,τ n,K ( θ ) − J ϕ,τ ( θ )    | {z } sampling err or +   J ϕ ( θ ) − J ⋆ ( θ )   | {z } r eward shift err or +   J ϕ,τ ( θ ) − J ϕ ( θ )   | {z } KL clipping err or (3) 4.2 SAMPLING ERR OR BOUND W e ﬁrst study the sampling error    b J ϕ,τ n,K ( θ ) − J ϕ,τ ( θ )    . W e deﬁne b J r,τ n, ∞ ( θ ) as the conditional expectation of b J r,τ n,K ( θ ) , giv en the prompts x 1: n . Equi valently , it is the value one would obtain by averaging inﬁnitely many rollouts per prompt while keeping the same ﬁnite set of prompts. The estimator b J ϕ,τ n,K ( θ ) thus has a two-stage structure: (1) prompts are sampled from ρ , leading to a deviation    b J r,τ n, ∞ ( θ ) − J r,τ ( θ )    , and (2) rollouts are sampled from π θ ( · | x ) , conditional on each prompt, inducing a deviation    b J r,τ n,K ( θ ) − b J r,τ n, ∞ ( θ )    . W e ﬁrst bound the rollout sampling error as follo ws. Lemma 2 (Rollout sampling error bound) . Given parameter θ ∈ Θ , re war d r : X × Y → [0 , 1] , clipping thr eshold τ > 0 , and conﬁdence level δ ∈ (0 , 1) , with pr obability at least 1 − δ over r ollouts, conditional on prompts x 1: n , we have    b J r,τ n,K ( θ ) − b J r,τ n, ∞ ( θ )    ≤ (1 + 2 β τ ) r log(2 /δ ) 2 nK . Proof sketch Loss clipping ensures that ℓ τ θ ( x, y ) ∈ [ − τ , τ ] by construction, which is a standard stabilisation ap- proach in reinforcement learning [Mnih et al., 2015, Schul- man et al., 2017]. Combining that the re ward function sat- isﬁes r ( x, y ) ∈ [0 , 1] , for each per-rollout contribution, r ( x, y ) − β ℓ τ θ ( x, y ) lies in an interv al of length 1 + 2 β τ . Giv en the prompts x 1: n , the rollouts are independent across both i and j . Applying Hoeffding’ s inequality (Lemma 8) to the average ov er the nK rollout terms yields Lemma 2. Detailed proofs are in Appendix C.2. Remark 1. Lemma 2 contr ols the Monte Carlo err or fr om using ﬁnitely many r ollouts per pr ompt. The bound decays at r ate O ( nK ) − 1 / 2 as the number of r ollouts per pr ompt K incr eases (assuming β and τ ar e independent of n and K ). The factor 1 + 2 β τ comes fr om the range of the per -r ollout contribution. Clipping is the mechanism that makes this range ﬁnite without imposing any artiﬁcial uniform bound on the exact lo g ratio ℓ θ . Lemma 3 (Prompt sampling error bound) . Under the same conditions of Lemma 2, with pr obability at least 1 − δ over pr ompts x 1: n , we have    b J r,τ n, ∞ ( θ ) − J r,τ ( θ )    ≤ (1 + 2 β τ ) r log(2 /δ ) 2 n . Proof sketch Treating b J ϕ,τ n, ∞ ( θ ) as a function of the sampled prompts only , it is an av erage of n independent bounded terms, each term being the conditional expecta- tion ov er rollouts for a ﬁxed prompt. Applying Hoeffding’ s inequality again yields Lemma 3. Detailed proofs are in Appendix C.2. Remark 2. Lemma 3 suggests that the pr ompt sampling er- r or decays at rate O ( n − 1 / 2 ) , and the corr esponding bound does not depend on the number of r ollouts per pr ompt K , similarly , assuming β and τ ar e independent of n and K . It isolates the deviation induced purely by ﬁnite prompt sampling. Even an arbitrarily accurate estimate of each conditional e xpectation o ver r ollouts cannot compensate for having too few evaluation pr ompts, because the population objective is deﬁned as an expectation o ver ρ . Combining the two lemmas leads to the following lemma on the sampling error . Lemma 4 (Sampling error bound) . Under the same condi- tions of Lemma 2, with pr obability at least 1 − δ over both pr ompts and rollouts, the sampling err or satisﬁes    b J r,τ n,K ( θ ) − J r,τ ( θ )    ≤ (1 + 2 β τ ) r log(4 /δ ) 2 n + r log(4 /δ ) 2 nK ! . (4) Remark 3. In addition to the noise induced by pr ompts and r ollouts, a penalty term carries the additional factor 2 β τ be- cause the clipped log ratio rang es in [ − τ , τ ] . Consequently , incr easing τ enlar ges the range of each r ollout penalty term, and the r esulting concentration bound is looser . 4.3 REW ARD SHIFT ERR OR BOUND This subsection studies the rew ard shift error   J ϕ ( θ ) − J ⋆ ( θ )   . T o characterise the reward shift er- ror in transfers from D train to D θ , we use a χ 2 cov erage coefﬁcient, deﬁned below , based on χ 2 div ergence (see Deﬁnition 3 in Appendix B). χ 2 cov erage coefﬁcient is standard in the literature of importance weighting and cov ariate shift analyses; see, e.g., Sugiyama et al. [2007], Owen [2013]. Deﬁnition 1 ( χ 2 cov erage coefﬁcient) . Suppose that D θ is absolutely continuous with r espect to D train . The χ 2 cover - age coef ﬁcient is deﬁned to be C ( θ ) := p 1 + χ 2 ( D θ ∥ D train ) , (5) wher e χ 2 ( ·∥· ) is χ 2 diver gence. Remark 4. Intuitively , C ( θ ) measur es how far the policy- induced distribution departs fr om the distribution used to train the r ewar d model. It acts as an ampliﬁcation factor when we upper bound the r ewar d shift error . Because J ϕ ( θ ) and J ⋆ ( θ ) share the same exact KL reg- ulariser , the KL penalty cancels in the difference; con- sequently , only the reward model error remains. Deﬁning e ϕ ( x, y ) = ˆ r ϕ ( x, y ) − r ⋆ ( x, y ) , we hav e J ϕ ( θ ) − J ⋆ ( θ ) = E ( X,Y ) ∼ D θ [ e ϕ ( X, Y )] , so the problem is to control the re- ward model error under the deployment distribution D θ using information av ailable under the training distribution D train . This step requires a cov erage condition, stated be- low , when deri ving the change-of-measure bound; it yields the same coefﬁcient C ( θ ) deﬁned in eq. (5). Assumption 1 (Absolute continuity and ﬁnite coverage) . The policy-induced distrib ution D θ is absolutely continuous with r espect to the re war d model training distribution D train . Mor eover , the χ 2 diver gence χ 2 ( D θ ∥ D train ) is ﬁnite, so the covera ge coefﬁcient C ( θ ) in eq. (5) is ﬁnite. Remark 5. Assumption 1 is the standard condition that makes a change of measure from D θ back to D train le git- imate [Sugiyama et al., 2007, Shimodaira, 2000, Precup et al., 2000]. It ensur es that the density ratio d D θ d D train exists and has a ﬁnite second moment, which is r equired for the Cauchy-Sc hwarz step in Lemma 5 [Owen, 2013]. The coefﬁ- cient C ( θ ) plays the r ole of an ampliﬁcation factor , which quantiﬁes how str ongly the r ewar d model err or can be mag- niﬁed when the policy visits r e gions that ar e rar e under the data used for r ewar d modelling. Our theory also relies on the following mild assumption. Assumption 2 (Bounded training error) . The squared tr ain- ing err or L (2) train ( ϕ ) deﬁned in eq. (1) is bounded. Remark 6. This assumption does not assert that the re- war d model is accur ate everywher e; instead, it pr ovides a baseline level of accuracy on the distribution wher e pref- er ence supervision is available. The coverage coefﬁcient explains how the baseline can de grade under deployment. W e then prov e the rew ard shift bound. Lemma 5 (Re ward shift error bound) . Under Assumptions 1 and 2, we have   J ϕ ( θ ) − J ⋆ ( θ )   ≤ C ( θ ) q L (2) train ( ϕ ) . Proof sketch T o relate   J ϕ ( θ ) − J ⋆ ( θ )   to the training distribution, we re write the expectation under D θ as an importance-weighted expectation under D train . Assump- tion 1 ensures that the required density ratio e xists and has ﬁnite second moment. Applying the χ 2 change-of-measure bound (Lemma 11 in Appendix B) yields a product of tw o square roots: the ﬁrst factor under the square root is exactly 1 + χ 2 ( D θ ∥ D train ) , the second moment of the density ratio under D train , whose square root therefore produces C ( θ ) ; and the second factor under the square root is L (2) train ( ϕ ) by deﬁnition, the second moment of the re ward model error under D train . This yields Lemma 5. Detailed proofs are in Appendix C.3. Remark 7. Lemma 5 char acterises two ingr edients that play differ ent r oles: (1) the term L (2) train ( ϕ ) measur es r ewar d model err or only on the r ewar d model training distribution D train ; and (2) the coefﬁcient C ( θ ) measur es how far D θ moves away from that tr aining distribution. It also quantiﬁes how much tr aining err or can be ampliﬁed when moving to deployment. When prompt shifts and polic y shifts are qualitati vely dif fer- ent, it is useful to further factorise the cov erage coefﬁcient. The next lemma interprets the source of shifts in practice. Lemma 6 (Coverage factorisation) . Suppose ρ ≪ ρ label and π θ ( · | x ) ≪ π ref ( · | x ) , for any pr ompt x with ρ label ( x ) > 0 . Deﬁne C prompt = E X ∼ ρ label "  ρ ( X ) ρ label ( X )  2 #! 1 / 2 , and deﬁne C pol ( θ ) by sup x ∈X : ρ label ( x ) > 0 E Y ∼ π ref ( ·| x ) "  π θ ( Y | x ) π ref ( Y | x )  2 #! 1 / 2 . If both C prompt and C pol ( θ ) ar e bounded, we have C ( θ ) ≤ C prompt C pol ( θ ) . Remark 8. Lemma 6 separates mismatch in pr ompts from mismatch in policies. The coef ﬁcient C prompt depends only on the shift between ρ and ρ label . The coefﬁcient C pol ( θ ) depends only on how far π θ departs from π ref on the sup- port of ρ label . This separation is valuable when diagnosing failur es in r ewar d modelling and post training, because the two sour ces of shift have differ ent operational causes and differ ent mitigation strate gies. 4.4 KL CLIPPING ERR OR BOUND W e no w bound the KL clipping error   J ϕ,τ ( θ ) − J ϕ ( θ )   . This is the only place where the systematic mismatch cre- ated by clipping enters the analysis. Clipping is beneﬁcial in the sampling bounds because it makes each rollout contrib u- tion bounded. Meanwhile, clipping may bias the objecti ve used in practice, thereby introducing a systematic mismatch between the optimised objectiv e and the target objecti ve. T o state this mismatch cleanly , we only require an integrabil- ity condition on the exact log ratio in deployment. Assumption 3 (Integrability of e xact log ratio) . The exact log ratio is inte grable in deployment, i.e., E ( X,Y ) ∼ D θ [ | ℓ θ ( X, Y ) | ] < ∞ . Remark 9. Assumption 3 is mild. It allows heavy tails in ℓ θ while still ensuring that the e xact object- ive is well deﬁned. Under this assumption, clipping is analysed as an explicit bias-inducing modiﬁcation of the penalty; the KL clipping err or term in Lemma 7, β E ( X,Y ) ∼ D θ [ | ℓ θ ( X, Y ) − ℓ τ θ ( X, Y ) | ] , is an objective mis- match term that does not vanish asymptotically as the number of evaluation pr ompts or r ollouts incr eases. It is strictly weaker than assuming ℓ θ is uniformly bounded, and it matches the intent of tr eating clipping as an al- gorithmic choice rather than as a structural pr operty of the policy class. Similar inte grability conditions ar e standar d in analysing truncation-based stabilisation and importance sampling; see, e.g ., Ionides [2008], Owen and Zhou [2000]. W e then pro ve the follo wing bound on the bias induced by the surrogate. Lemma 7 (KL clipping error bound) . Under Assumption 3, we have   J ϕ,τ ( θ ) − J ϕ ( θ )   ≤ β E ( X,Y ) ∼ D θ [ | ℓ θ ( X, Y ) − ℓ τ θ ( X, Y ) | ] . (6) Proof sketch   J ϕ,τ ( θ ) − J ϕ ( θ )   is not an estimation er- ror; it compares two population objecti ves under the same learned re ward, with the only dif ference being whether the penalty uses ℓ τ θ or ℓ θ . Expanding deﬁnitions shows that the reward contributions cancel and only the penalty dif- ference remains. T aking absolute values and applying the triangle inequality yields Lemma 7. Detailed proofs are in Appendix C.3. Remark 10. The right-hand side of eq. (6) measur es clip- ping bias dir ectly as the expected amount of truncation under the deployment distribution. This term can r emain nonzer o even with inﬁnite evaluation data, whic h r eﬂects the fact that clipping is an objective mismatch rather than an estimation err or . It is small when the policy rar ely pr oduces extr eme log ratios under D θ , and it can be lar ge when the policy places substantial mass in re gions wher e the e xact log ratio has heavy tails. This is why the ﬁnal bound contains a term that depends on the tail behaviour of the exact log ratio under D θ and does not in volve n or K . 4.5 FIXED-POLICY GENERALISA TION BOUND W e no w combine the results on sampling error , reward shift error , and KL clipping error into a single statement for a ﬁxed polic y parameter θ , as follows. Theorem 1 (Fixed-policy generalisation bound) . Under Assumptions 1, 2, and 3, with pr obability at least 1 − δ over the evaluation pr ompts and r ollouts, the following holds,    b J ϕ,τ n,K ( θ ) − J ⋆ ( θ )    ≤ (1 + 2 β τ ) r log(4 /δ ) 2 n + r log(4 /δ ) 2 nK ! | {z } sampling err or + C ( θ ) q L (2) train ( ϕ ) | {z } r eward shift err or + β E ( X,Y ) ∼ D θ [ | ℓ θ ( X, Y ) − ℓ τ θ ( X, Y ) | ] | {z } KL clipping err or . (7) 4.6 D A T A-DEPENDENT P A C-BA YES BOUND The ﬁxed-polic y theorem treats θ as pre-ﬁxed. In practice, θ is often selected after observing data. This section ﬁxes the gap by employing P A C-Bayes theory that extends our ana- lysis to data-dependent selection [McAllester, 1999, Seeger, 2002, Catoni, 2007]. Speciﬁcally , we provide a bound that holds simultaneously for all posteriors Q ov er Θ , at the cost of a complexity term that measures ho w far Q deviate s from a prior P on Θ . Deﬁne b J ϕ,τ n,K ( Q ) = E θ ∼ Q h b J ϕ,τ n,K ( θ ) i , J ⋆ ( Q ) = E θ ∼ Q [ J ⋆ ( θ )] . Then, we have the follo wing data-dependen P AC-Bayes bound. Theorem 2 (Data-dependent generalisation bound) . Let P be any prior distribution on Θ that is independent of the evaluation pr ompts and r ollouts. F or any posterior Q on Θ , suppose Assumptions 1, 2, and 3 hold for any θ in the support of Q . Then, with pr obability at least 1 − δ over the evaluation pr ompts and rollouts, the following inequality holds simultaneously for all such posteriors Q ,    b J ϕ,τ n,K ( Q ) − J ⋆ ( Q )    ≤ (1 + 2 β τ ) r KL( Q ∥ P ) + log(8 /δ ) 2 n + r KL( Q ∥ P ) + log(8 /δ ) 2 nK ! | {z } sampling err or + E θ ∼ Q [ C ( θ )] q L (2) train ( ϕ ) | {z } r eward shift err or + β E θ ∼ Q  E ( X,Y ) ∼ D θ [ | ℓ θ ( X, Y ) − ℓ τ θ ( X, Y ) | ]  | {z } KL clipping err or . Remark 11. Comparing with Theorem 1, Theorem 2 re- places the ﬁxed θ with an avera ge over θ ∼ Q . The complex- ity term KL( Q ∥ P ) appears only inside the sampling err or , as the price paid for making the guar antee hold uniformly over data-dependent choices of Q . 5 SPECIAL CASES The P A C-Bayes bound in Theorem 2 contains a complexity term KL( Q ∥ P ) , which measures how strongly the data- dependent posterior Q departs from the data-independent prior P . This subsection discusses two operational instanti- ations of KL( Q ∥ P ) that are common in practice. 5.1 INITIALISA TION BY UNIFORM PRIOR O VER FINITE CANDID A TE CLASS Let M ≥ 2 be an integer , θ (1) , . . . , θ ( M ) ∈ Θ be a col- lection of candidate parameters speciﬁed independently of the ev aluation sample used to construct b J ϕ,τ n,K . Suppose Θ M := { θ (1) , . . . , θ ( M ) } and restrict both P and Q to be distributions on Θ M . Suppose the prior is uniform on Θ M ; i.e., P ( θ ( m ) ) = 1 / M for an y m . This non-informati ve prior is standard in ﬁnite model selection [Seeger, 2002]. Corollary 1 (KL bound for uniform prior over ﬁnite can- didate class) . Under the conditions above, KL( Q ∥ P ) ≤ log M . In particular , if Q is the Dirac distribution supported on a data-selected chec kpoint θ ( b m ) , we have KL( Q ∥ P ) = log M . Remark 12. Corollary 1 yields a dir ect interpr etation of model selection in the P A C-Bayes bound. Evaluating M ﬁxed c heckpoints and selecting one using the evaluation sample incurs an additional sampling err or cost in The- or em 2, contr olled by log M via the quantity KL( Q ∥ P ) . 5.2 TRAINING RLHF BY SGD AS ORNSTEIN-UHLENBECK PR OCESS Stochastic gradient descent (SGD), and its variants, are popular optimisers [Robbins and Monro, 1951]. Suppose the parameter space is R d for some d ∈ N . Assume the prior is Gaussian, i.e., P = N ( θ 0 , Λ) for some θ 0 ∈ R d and some symmetric positiv e deﬁnite matrix Λ ∈ R d × d . W e employ a standard local diffusion approximation for constant-step-size SGD. Near a locally stable optimum, late- stage SGD iterates follow an Ornstein-Uhlenbeck (OU) pro- cess [Uhlenbeck and Ornstein, 1930]: dθ t = − H ( θ t − ˆ θ ) dt + √ ε Σ 1 / 2 g dW t , where W t is a d -dimensional Brownian motion, H ≻ 0 is the local Hessian at ˆ θ , and Σ g ≻ 0 is the local gradient-noise cov ariance. W e make the following assumptions, standard for this local OU approximation; see Mandt et al. [2017], He et al. [2019], Chen et al. [2023]. Assumption 4. Assume the optimisation pr oblem has a locally stable optimum ˆ θ ∈ R d ; i.e., within a neighbour- hood of ˆ θ , the objective admits a quadr atic appr oximation with Hessian H ≻ 0 and the gradient noise covariance is appr oximately constant and equal to Σ g ≻ 0 . In addition, the matrices H and Σ g commute; i.e., H Σ g = Σ g H holds. Mor eover , ther e e xist constants 0 < m ≤ M < ∞ such that the local curvatur e spectrum satisﬁes mI ⪯ H ⪯ M I . Under the assumption abov e, the OU process admits a sta- tionary Gaussian law N ( ˆ θ , Σ) , where Σ ≻ 0 satisﬁes the continuous L yapunov equation H Σ + Σ H = ε Σ g . Accord- ingly , we approximate the P A C-Bayes posterior induced by late-stage SGD iterates by Q SGD := N ( ˆ θ , Σ) . Corollary 2 (KL bound for SGD as Ornstein-Uhlenbeck process) . In the special case above, the P A C-Bayes com- plexity term admits the upper bound KL( Q SGD ∥ P ) ≤ 1 2  ( ˆ θ − θ 0 ) ⊤ Λ − 1 ( ˆ θ − θ 0 ) + ε 2 m tr(Λ − 1 Σ g ) − d + logdet(Λ) − logdet(Σ g ) − d log  ε 2 M  . (8) A detailed proof is giv en in Appendix C.7.2. Remark 13. Cor ollary 2 yields a locally valid, optimiser - explicit bound for KL( Q ∥ P ) via the stationary diffusion appr oximation of constant-step-size SGD. This secondary specialisation of Theor em 2 imposes no additional struc- tural assumptions on the main RLHF analysis. The dif fusion perspective and the associated Ornstein-Uhlenbeck appr ox- imation ar e discussed in detail by Mandt et al. [2017]. 6 PRA CTICAL IMPLICA TIONS The discussion below translates our theory into concrete, practical algorithm design recommendations. 6.1 OPTIMAL KL CLIPPING THRESHOLD Lemma 4 includes a factor 1 + 2 β τ , indicating that a smaller τ tightens the sampling deviations that arise from ﬁnite n and K . Lemma 12 in Appendix B gi ves the corresponding KL-speciﬁc concentration bound for the clipped log-ratio av erage, whose deviation also scales linearly with τ . Mean- while, clipping changes the regularised objecti ve and intro- duces a systematic mismatch that does not v anish with more ev aluation samples, as formalised in Lemma 7. Therefore, τ acts as a bias-variance trade-off hyperpara- meter rather than a purely stabilising tweak. Aggressive clipping reduces Monte Carlo noise but increases objecti ve mismatch; weak clipping preserves the exact KL objecti ve but e xposes training to high-variance log-ratio estimates. For bre vity , we deﬁne α n,K,δ := r log(4 /δ ) 2 n + r log(4 /δ ) 2 nK , T θ ( τ ) := E ( X,Y ) ∼ D θ  ( | ℓ θ ( X, Y ) | − τ ) +  . Since | ℓ θ − ℓ τ θ | = ( | ℓ θ | − τ ) + under symmetric clipping, the τ -dependent part of eq. (7) is thus B θ ( τ ) := (1 + 2 β τ ) α n,K,δ + β T θ ( τ ) . Let τ ⋆ be any minimiser of τ 7→ B θ ( τ ) ov er τ ≥ 0 . W e hav e the following corollary . Corollary 3 (Optimal KL clipping threshold) . F or any para- meters θ ∈ Θ and ϕ ∈ Φ , r e gularisation coefﬁcient β > 0 , conﬁdence level δ ∈ (0 , 1) , and inte gers n ≥ 1 and K ≥ 1 , if 2 α n,K,δ < 1 , τ ⋆ satisﬁes Pr ( X,Y ) ∼ D θ  | ℓ θ ( X, Y ) | > τ ⋆  ≤ 2 α n,K,δ ≤ Pr ( X,Y ) ∼ D θ  | ℓ θ ( X, Y ) | ≥ τ ⋆  , if, in addition, Pr ( X,Y ) ∼ D θ ( | ℓ θ ( X, Y ) | = τ ⋆ ) = 0 , we have Pr ( X,Y ) ∼ D θ  | ℓ θ ( X, Y ) | > τ ⋆  = 2 α n,K,δ , and, equivalently , τ ⋆ is the (1 − 2 α n,K,δ ) -quantile of | ℓ θ ( X, Y ) | under D θ . Otherwise, if 2 α n,K,δ ≥ 1 , we have τ ⋆ = 0 is a minimizer of τ 7→ B θ ( τ ) o ver τ ≥ 0 . Detailed proofs are in Appendix C.4. Remark 14. Corollary 3 suggests choosing τ so that the clipping fraction Pr( | ℓ θ | > τ ) matches the tar get level 2 α n,K,δ . As the evaluation b udget ( n or K ) increases, α n,K,δ decr eases. Consequently , the tar get clipping frac- tion decr eases and the recommended thr eshold τ incr eases. This quantile-based rule automatically r elaxes clipping as Monte Carlo err or diminishes. Threshold calibration Practitioners often treat the clip- ping threshold τ as a static hyperparameter that re- quires manual tuning. Corollary 3 instead yields a direct, budget-a ware calibration rule. Given an e valuation batch { ( x i , y i,j ) } i ≤ n, j ≤ K , compute the log-ratio magnitudes u i,j := | ℓ θ ( x i , y i,j ) | . If 2 α n,K,δ ≥ 1 , set b τ := 0 . Oth- erwise, set b τ to the empirical (1 − 2 α n,K,δ ) -quantile of { u i,j } . Algorithmically , this theory-guided rule balances the bias–variance trade-of f by clipping approximately the top 2 α n,K,δ fraction of extreme log-ratios in the batch, thereby reducing reliance on heuristic hyperparameter sweeps. Theorems 1–2 treat τ as ﬁxed; when τ is selected from the e valuation sample (e.g., by an empirical quantile rule), the resulting procedure should be vie wed as a practical calibration heuristic unless additional uniformity or sample- splitting arguments are used. 6.2 BUDGET ALLOCA TION A CR OSS PROMPTS, R OLLOUTS, AND PREFERENCE DA T A Gi ven a ﬁxed computational b udget, practitioners often face an allocation trade-off among prompts, rollouts per prompt, and preference data. This subsection provides theoretically grounded guidelines for this budget distrib ution. 6.2.1 Uniform-cost baseline Suppose rollouts share the same cost and the sampling budget is bounded by nK ≤ B for some B > 0 . Sub- stituting n = B /K into the leading-order sampling terms of Lemma 4 reveals that the upper bound is minimised at K ⋆ = 1 . Therefore, under a uniform-cost model, the range-based concentration bound strongly fa vours allocat- ing budget to prompt coverage rather than additional rollouts per prompt. A detailed deriv ation is gi ven in Appendix C.8. 6.2.2 Preﬁll and decode cost model In LLM inference, sampling costs are typically asymmet- ric across prompts and rollouts [Pope et al., 2023]. Eval- uating a ne w prompt requires a forward pass over prompt tokens to construct an attention cache, whereas additional rollouts reuse this cache, primarily incurring incremental decoding costs [Kwon et al., 2023]. W e model this asym- metry by separating a preﬁll and a decode cost, imposing the constraint: B ≥ n c preﬁll + nK c decode . Substituting n = B / ( c preﬁll + K c decode ) into the dominant sampling structure isolates a one-dimensional objecti ve in K . T reating K ≥ 1 as a continuous v ariable in the leading-order proxy from Lemma 4 yields the following optimal allocation rule. Corollary 4 (Optimal rollout allocation) . The continuous pr oxy minimiser over K ≥ 1 satisﬁes K ⋆ = max ( 1 ,  c preﬁll c decode  2 / 3 ) . In practice, one may take K = ⌊ K ⋆ ⌉ , and then set n by the budg et constraint. Detailed proofs are in Appendix C.8. Remark 15. The expr ession for K ⋆ depends only on the ratio c preﬁll /c decode because the shar ed range multiplier 1 + 2 β τ does not af fect the minimiser . The 2 / 3 power law implies that the optimal number of r ollouts per pr ompt gr ows sublinearly with c preﬁll /c decode . V ariance-awar e reﬁnement The range-based sampling simpliﬁcation abov e is conservati ve because it does not sep- arate prompt-le vel v ariability from rollout-lev el variability . A reﬁnement is to use a two-stage variance decomposition. Let Z denote a per-rollout contrib ution in the empirical ob- jectiv e (see the variables Z i,j in the proof of Lemma 2 in Appendix C.2), and deﬁne σ 2 prompt := V ar( E [ Z | X ]) , σ 2 rollout := E [V ar( Z | X )] . Corollary 5. Under the same cost constraint B ≥ n c preﬁll + nK c decode , optimising the r esulting variance pr oxy yields an allocation rule of the form K ⋆ ≈ max ( 1 , s c preﬁll c decode · σ 2 rollout σ 2 prompt ) . A proof is giv en in Appendix C.8. 6.2.3 Prefer ence data Beyond prompts and rollouts, the reward shift error intro- duces an additional budget consideration. By Lemma 5, this term depends on the re ward-model training error L (2) train ( ϕ ) and the co verage coefﬁcient C ( θ ) . Preference data collection therefore affects the bound in two ways. Increasing relev- ant preference data can improve rew ard-model ﬁt on the training distrib ution, and collecting data closer to the policy- induced distribution can reduce the mismatch captured by C ( θ ) . These observations pro vide guidance for preference data collection through their effect on the re ward shift term, although the present analysis does not deriv e an explicit allocation rule in terms of the number of preference labels. This implication is most rele vant when the sampling terms are no longer the dominant terms in the bound. 7 CONCLUSIONS Alignment and adaptation in large language models (LLMs) are now dri ven by reinforcement learning from human feed- back (RLHF), but a rigorous theory of ho w RLHF general- ises is still underdeveloped, particularly when the reward could shift, and a KL clipping regularisation is implemen- ted. T o address this g ap, we dev elop generalisation theory for RLHF that explicitly models two ke y practical effects: (1) distribution shift between the data used to train the re- ward model and the policy-induced distribution encountered at deployment, and (2) statistical noise introduced by em- pirical estimation of the clipped KL regulariser . W e pro ve high-probability generalisation bounds that decompose the generalisation error into interpretable components, includ- ing sampling error from both prompts and rollouts, re ward shift error , and KL clipping error . Our theory suggests op- timal KL clipping threshold rules, quantitativ e budget alloc- ation guidance on prompts and rollouts, and guidance for preference data collection through the rew ard shift term. References Y untao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nov a DasSarma, Dawn Drain, Stanisla v F ort, Deep Ganguli, T om Henighan, Danny Hernandez, et al. T raining a helpful and harmless assistant with reinforce- ment learning from human feedback. arXiv pr eprint , 2022a. Y untao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson K ernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Cather- ine Olsson, Christopher Olah, Danny Hernandez, Da wn Drain, Deep Ganguli, Dustin Li, Eli T ran-Johnson, Ethan Perez, Jamie Kerr , Jared Mueller , Jeffre y Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer , Noemi Mercado, Nov a DasSarma, Robert Lasenby , Robin Larson, Sam Ringer , Scott Johnston, Shauna Krav ec, Sheer El Sho wk, Stanislav Fort, T amera Lanham, Timothy T elleen-Lawton, T om Conerly , T om Henighan, Tristan Hume, Samuel R. Bo wman, Zac Hatﬁeld-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCand- lish, T om Brown, and Jared Kaplan. Constitutional ai: Harmlessness from ai feedback. arXiv preprint , 2022b. Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence . Oxford Univ ersity Press, 2013. Olivier Catoni. P A C-Bayesian Supervised Classiﬁc- ation: The Thermodynamics of Statistical Learning , volume 56 of IMS Lectur e Notes–Monograph Series . In- stitute of Mathematical Statistics, 2007. doi: 10.1214/ 074921707000000391. Feng Chen, Daniel Kunin, Atsushi Y amamura, and Surya Ganguli. Stochastic collapse: How gradient noise attracts sgd dynamics towards simpler subnetworks. Advances in Neural Information Pr ocessing Systems , 36:35027– 35063, 2023. Paul F . Christiano, Jan Leik e, T om B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Pr ocessing Systems , 2017. Hyung W on Chung, Le Hou, Shayne Longpre, Barret Zoph, Y i T ay , W illiam Fedus, Y unxuan Li, Xuezhi W ang, Mostafa Dehghani, Siddhartha Brahma, Albert W eb- son, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Cho wdhery , Alex Castro-Ros, Marie Pellat, Ke vin Robinson, Dasha V alter , Sharan Narang, Gaurav Mishra, Adams Y u, V incent Zhao, Y an- ping Huang, Andrew Dai, Hongkun Y u, Slav Petrov , Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V . Le, and Jason W ei. Scaling instruction- ﬁnetuned language models. Journal of Mac hine Learning Resear ch , 25(70):1–53, 2024. URL https://jmlr. org/papers/v25/23- 0870.html . Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In Pr o- ceedings of the 40th International Conference on Ma- chine Learning , volume 202 of Pr oceedings of Ma- chine Learning Resear ch , pages 10835–10866. PMLR, 2023. URL https://proceedings.mlr.press/ v202/gao23h.html . Adam Gleav e and Geoffre y Irving. Uncertainty estimation for language rew ard models. arXiv preprint , 2022. Fengxiang He, T ongliang Liu, and Dacheng T ao. Control batch size and learning rate to generalize well: Theoretical and empirical e vidence. Advances in neural information pr ocessing systems , 32, 2019. W assily Hoef fding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association , 58(301):13–30, 1963. Edward L. Ionides. T runcated importance sampling. Journal of Computational and Graphical Statistics , 17(2):295– 311, 2008. doi: 10.1198/106186008X320456. Solomon Kullback and Richard A. Leibler . On information and sufﬁcienc y . The Annals of Mathematical Statistics , 22(1):79–86, mar 1951. doi: 10.1214/aoms/1177729694. W oosuk Kwon, Zhuohan Li, Siyuan Zhuang, Y ing Sheng, Lianmin Zheng, Cody Hao Y u, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efﬁcient memory manage- ment for lar ge language model serving with pagedatten- tion. In Jason Flinn, Margo I. Seltzer , Peter Druschel, Antoine Kaufmann, and Jonathan Mace, editors, Pro- ceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany , October 23-26, 2023 , pages 611–626. A CM, 2023. doi: 10. 1145/3600006.3613165. URL https://doi.org/ 10.1145/3600006.3613165 . Nathan Lambert. Reinforcement learning from human feedback, 2025. URL 2504.12501 . RLHF Book; online version also av ail- able at https://rlhfbook.com/ . Nathan Lambert, V alentina Pyatkin, Jacob Morrison, LJ Mir- anda, Bill Y uchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar , T om Zick, Y ejin Choi, Noah A. Smith, and Hannaneh Hajishirzi. Rew ardbench: Ev aluat- ing reward models for language modeling. In Find- ings of the Association for Computational Linguist- ics: N AACL 2025 , pages 1755–1797, Albuquerque, New Mexico, April 2025. Association for Computa- tional Linguistics. doi: 10.18653/v1/2025.ﬁndings- naacl. 96. URL https://aclanthology.org/2025. findings- naacl.96/ . Zhaochun Li, Mingyang Y i, Y ue W ang, Shisheng Cui, and Y ong Liu. T o wards a theoretical understanding to the generalization of RLHF. arXiv preprint , 2026. Kezhao Liu, Jason Klein Liu, Mingtao Chen, and Y iming Liu. Rethinking KL regularization in RLHF: From value estimation to gradient optimization. arXiv pr eprint , 2025. doi: 10.48550/arXiv .2510.01555. Stephan Mandt, Matthe w D. Hoffman, and Da vid M. Blei. Stochastic gradient descent as approximate bayesian in- ference. Journal of Machine Learning Resear ch , 18(134): 1–35, 2017. David A. McAllester . Pac-bayesian model averaging. In Pr oceedings of the T welfth Annual Confer ence on Compu- tational Learning Theory (COLT) , pages 164–170, 1999. doi: 10.1145/307400.307435. V olodymyr Mnih, K oray Kavukcuoglu, Da vid Silver , et al. Human-lev el control through deep reinforcement learn- ing. Natur e , 518(7540):529–533, 2015. doi: 10.1038/ nature14236. Ke vin P . Murphy . Pr obabilistic Machine Learning: An Intr oduction . MIT Press, Cambridge, MA, 2022. URL https://probml.github.io/pml- book/ book1.html . Long Ouyang, Jef f W u, Xu Jiang, Diogo Almeida, Carroll W ainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray , John Schulman, Jacob Hilton, et al. T raining language models to follow instructions with human feedback. In Ad- vances in Neural Information Pr ocessing Systems , 2022. Art Owen and Y i Zhou. Safe and effecti ve importance sampling. Journal of the American Statistical Association , 95(449):135–143, 2000. doi: 10.1080/01621459.2000. 10473909. Art B. Owen. Monte Carlo theory , methods and examples . Self-published, 2013. URL https://artowen.su. domains/mc/ . Reiner Pope, Sholto Douglas, Aakanksha Chowd- hery , Jacob Devlin, James Bradbury , Jonathan Heek, K efan Xiao, Shiv ani Agra wal, and Jeff Dean. Efﬁciently scaling transformer inference. In Pr oceedings of Machine Learning and Systems , 2023. URL https://proceedings.mlsys. org/paper_files/paper/2023/hash/ c4be71ab8d24cdfb45e3d06dbfca2780- Abstract- mlsys2023. html . Doina Precup, Richard S. Sutton, and Satinder Singh. Eligibility traces for off-policy polic y ev aluation. In Pr oceedings of the 17th International Confer- ence on Mac hine Learning (ICML) , pages 759–766, 2000. URL https://www.incompleteideas. net/papers/PSS- 00.pdf . Herbert Robbins and Sutton Monro. A stochastic approx- imation method. The Annals of Mathematical Statistics , 22(3):400–407, September 1951. doi: 10.1214/aoms/ 1177729586. URL https://doi.org/10.1214/ aoms/1177729586 . John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. T rust region policy optimiz- ation. In Pr oceedings of the 32nd International Confer- ence on Machine Learning , volume 37 of Pr oceedings of Machine Learning Resear ch , pages 1889–1897. PMLR, 2015. URL https://proceedings.mlr.press/ v37/schulman15.html . John Schulman, Filip W olski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov . Proximal policy optimization algorithms. arXiv preprint , 2017. Matthias Seeger . Pac-bayesian generalisation error bounds for gaussian process classiﬁcation. Journal of Machine Learning Resear ch , 3:233–269, 2002. V edant Shah, Johan Obando-Ceron, V ineet Jain, Brian Bar- toldson, Bhavya Kailkhura, Sarthak Mittal, Glen Berseth, Pablo Samuel Castro, Y oshua Bengio, Nikolay Malkin, Moksh Jain, Siddarth V enkatraman, and Aaron Courville. A comedy of estimators: On kl regularization in rl training of llms. arXiv preprint , 2025. Hidetoshi Shimodaira. Improving predictiv e inference under cov ariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Infer ence , 90(2):227– 244, 2000. doi: 10.1016/S0378- 3758(00)00115- 4. Nisan Stiennon, Long Ouyang, Jeff W u, Daniel M. Ziegler , Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize with human feedback. In Advances in Neural Information Pr ocessing Systems , 2020. Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert Müller . Cov ariate shift adaptation by importance weighted cross v alidation. Journal of Machine Learn- ing Resear ch , 8:985–1005, 2007. Alexandre B. Tsybakov . Introduction to Nonparametric Estimation . Springer Series in Statistics. Springer , Ne w Y ork, NY , 2009. ISBN 978-0-387-79051-0. doi: 10.1007/ b13794. G. E. Uhlenbeck and L. S. Ornstein. On the theory of the bro wnian motion. Physical Re view , 36(5):823–841, 1930. doi: 10.1103/PhysRe v .36.823. W ei Xiong, Hanze Dong, Chenlu Y e, Ziqi W ang, Han Zhong, Heng Ji, Nan Jiang, and T ong Zhang. Iterat- iv e preference learning from human feedback: Bridging theory and practice for RLHF under KL-constraint. In Pr oceedings of the 41st International Confer ence on Machine Learning , volume 235 of Pr oceedings of Ma- chine Learning Resear ch , pages 54715–54754. PMLR, 2024. URL https://proceedings.mlr.press/ v235/xiong24a.html . W enhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D. Lee, and W en Sun. Prov able of ﬂine preference-based reinforcement learning. In The T welfth International Con- fer ence on Learning Representations (ICLR) . OpenRe- view .net, 2024. URL https://openreview.net/ forum?id=tVMPfEGT2w . Banghua Zhu, Michael Jordan, and Jiantao Jiao. Prin- cipled reinforcement learning with human feedback from pairwise or k -wise comparisons. In Pr oceed- ings of the 40th International Confer ence on Ma- chine Learning , volume 202 of Pr oceedings of Ma- chine Learning Resear ch , pages 43037–43067. PMLR, 2023. URL https://proceedings.mlr.press/ v202/zhu23f.html . Daniel M. Ziegler , Nisan Stiennon, Jeffre y W u, T om B. Brown, Alec Radford, Dario Amodei, P aul F . Christiano, and Geof frey Irving. Fine-tuning language models from human preferences. arXiv preprint , 2019. Generalisation of RLHF under Reward Shift and Clipped KL Regularisation (Supplementary Material) Kenton T ang 1 Y uzhu Chen 2 Fengxiang He 1 1 Univ ersity of Edinbur gh 2 Univ ersity of Science and T echnology of China A NO T A TION T able 1: Notation Symbol Meaning X , Y Prompt space and response space. ( x, y ) A prompt-response pair . ρ Prompt distribution used for post-training / e valuation. ρ label Prompt distribution used for collecting preference data (re ward modelling). π ( · | x ) A policy: conditional distrib ution over responses gi ven prompt x . π θ Post-trained policy , parameterised by θ . Θ Policy parameter space. π ref Reference policy (typically an SFT model). θ Parameters of the polic y π θ . Φ Rew ard-model parameter space. ϕ Parameters of the learned re ward model ˆ r ϕ . r ⋆ : X × Y → [0 , 1] T ar get (oracle) reward function. ˆ r ϕ : X × Y → [0 , 1] Learned rew ard model with parameters ϕ . e ϕ ( x, y ) Rew ard-model error , typically e ϕ ( x, y ) = ˆ r ϕ ( x, y ) − r ⋆ ( x, y ) . D train Joint distribution for re ward-model training, e.g. D train ( x, y ) = ρ label ( x ) π ref ( y | x ) . D θ Policy-induced joint distrib ution, D θ ( x, y ) = ρ ( x ) π θ ( y | x ) . L (2) train ( ϕ ) Rew ard-model MSE on D train : E ( X,Y ) ∼ D train [ e ϕ ( X, Y ) 2 ] . χ 2 ( D θ ∥ D train ) Chi-square div ergence measuring co verage / shift from D train to D θ . C ( θ ) Cov erage coefﬁcient, typically C ( θ ) = p 1 + χ 2 ( D θ ∥ D train ) . C prompt Prompt-shift component of cov erage (in a factorisation of C ( θ ) ). C pol ( θ ) Policy-shift component of co verage (in a factorisation of C ( θ ) ). β > 0 KL-regularisation strength (penalty coef ﬁcient). ℓ θ ( x, y ) Log-ratio, ℓ θ ( x, y ) = log π θ ( y | x ) − log π ref ( y | x ) . τ > 0 Clipping threshold for log-ratios. ℓ τ θ ( x, y ) Clipped log-ratio, ℓ τ θ ( x, y ) = clip( ℓ θ ( x, y ) , − τ , τ ) . KL( π θ ( · | x ) ∥ π ref ( · | x )) Reference KL at prompt x (population expectation of ℓ θ ( x, Y ) under Y ∼ π θ ( · | x ) ). J r ( θ ) Population objectiv e under rew ard r : E X ∼ ρ,Y ∼ π θ ( ·| X ) [ r ( X, Y ) − β ℓ θ ( X, Y )] . J r,τ ( θ ) Clipped population objectiv e: replace ℓ θ by ℓ τ θ in J r ( θ ) . J ⋆ ( θ ) T arget objecti ve, typically J ⋆ ( θ ) = J r ⋆ ( θ ) . J ϕ ( θ ) Learned-rew ard objectiv e, typically J ϕ ( θ ) = J ˆ r ϕ ( θ ) . J ϕ,τ ( θ ) Learned-re ward clipped objecti ve, typically J ϕ,τ ( θ ) = J ˆ r ϕ ,τ ( θ ) . Symbol Meaning b J r,τ n,K ( θ ) Empirical objecti ve using n prompts and K rollouts per prompt (re ward r , clipping τ ). b J r,τ n, ∞ ( θ ) Conditional (inﬁnite-rollout) analogue: expectation o ver rollouts gi ven the n sampled prompts. n Number of sampled prompts. K Number of rollouts per prompt. P Prior distribution o ver Θ (P A C-Bayes). Q Posterior distribution o ver Θ (P A C-Bayes). KL( Q ∥ P ) P A C-Bayes complexity term. δ ∈ (0 , 1) Conﬁdence parameter for high-probability bounds. B DEFINITIONS AND LEMMAS Deﬁnition 2 ( KL di vergence [K ullback and Leibler, 1951]) . Suppose that P is absolutely continuous with r espect to Q . The KL diver gence is deﬁned by KL( P ∥ Q ) := Z p ( x ) log  p ( x ) q ( x )  dx. Lemma 8 (Hoef fding’ s inequality [Hoeffding, 1963]) . Let Z 1 , . . . , Z N be independent random variables. Assume there exist constants a ≤ b such that a ≤ Z i ≤ b almost sur ely for every i . Then, for any δ ∈ (0 , 1) , with pr obability at least 1 − δ , | 1 N N X i =1 Z i − E " 1 N N X i =1 Z i # | ≤ ( b − a ) r log(2 /δ ) 2 N . Lemma 9 (Hoef fding’ s lemma [Boucheron et al., 2013]) . Let Z be a random variable and assume a ≤ Z ≤ b almost sur ely . Then, for any λ ∈ R , E  exp  λ ( Z − E [ Z ])  ≤ exp  λ 2 ( b − a ) 2 8  . Lemma 10 (Change of measure [Catoni, 2007]) . Let P and Q be distributions on Θ such that KL( Q ∥ P ) < ∞ . Let F : Θ → R satisfy E θ ∼ P [exp( F ( θ ))] < ∞ . Then, E θ ∼ Q [ F ( θ )] ≤ KL( Q ∥ P ) + log E θ ∼ P [exp( F ( θ ))] . Pr oof. Let p and q denote densities of P and Q with respect to a common reference. By deﬁnition, KL( Q ∥ P ) = E Q [log( q /p )] . Start from the identity E Q [ F ] = E Q [log( e F )] . Insert the density ratio p/q inside the logarithm: E Q [ F ] = E Q  log  e F p q  + E Q  log  q p  . The second term is exactly KL( Q ∥ P ) . For the ﬁrst term, Jensen’ s inequality gives E Q  log  e F p q  ≤ log E Q  e F p q  = log E P [ e F ] . Substituting these two relations into the pre vious display yields E Q [ F ] ≤ KL( Q ∥ P ) + log E P [ e F ] , which is the claimed inequality . Deﬁnition 3 ( χ 2 div ergence [Tsybak ov, 2009]) . Suppose that D θ is absolutely continuous with r espect to D train . The χ 2 diver gence is deﬁned by χ 2 ( D θ ∥ D train ) := E ( X,Y ) ∼ D train "  D θ ( X, Y ) D train ( X, Y ) − 1  2 # . Lemma 11 ( χ 2 change of measure) . Let P and Q be distributions on a common space and assume Q ≪ P . Let w = dQ dP and assume χ 2 ( Q ∥ P ) < ∞ . If f satisﬁes E Z ∼ P [ f ( Z ) 2 ] < ∞ , we have | E Z ∼ Q [ f ( Z )] | ≤ p 1 + χ 2 ( Q ∥ P ) p E Z ∼ P [ f ( Z ) 2 ] . Pr oof. Because Q ≪ P , the density ratio w = dQ dP exists and the e xpectation under Q can be written as E Z ∼ Q [ f ( Z )] = E Z ∼ P [ w ( Z ) f ( Z )] . Applying Cauchy–Schwarz to the right-hand side gi ves | E P [ w f ] | ≤ p E P [ w 2 ] p E P [ f 2 ] . It remains to express E P [ w 2 ] in terms of χ 2 ( Q ∥ P ) . By deﬁnition, χ 2 ( Q ∥ P ) = E P [( w − 1) 2 ] = E P [ w 2 ] − 2 E P [ w ] + 1 . Also E P [ w ] = 1 , since w = dQ/dP integrates to 1 under P . Substituting E P [ w ] = 1 into the previous identity yields E P [ w 2 ] = 1 + χ 2 ( Q ∥ P ) . Plugging this into the Cauchy–Schwarz bound gi ves | E Z ∼ Q [ f ( Z )] | ≤ p 1 + χ 2 ( Q ∥ P ) p E Z ∼ P [ f ( Z ) 2 ] , which completes the proof. Lemma 12 (Monte Carlo estimation of the clipped log ratio) . Under the same conditions of Lemma 2, with pr obability at least 1 − δ over the evaluation pr ompts and r ollouts,   b κ τ n,K ( θ ) − κ τ ( θ )   ≤ 2 τ r log(4 /δ ) 2 n + r log(4 /δ ) 2 nK ! . (9) Lemma 13 (KL div ergence between Gaussian distrib utions [Murphy, 2022]) . Let Q = N ( µ Q , Σ Q ) and P = N ( µ P , Σ P ) be Gaussian distributions on R d , wher e Σ Q ≻ 0 and Σ P ≻ 0 . Then, KL( Q ∥ P ) = 1 2  tr(Σ − 1 P Σ Q ) + ( µ Q − µ P ) ⊤ Σ − 1 P ( µ Q − µ P ) − d + log det(Σ P ) det(Σ Q )  . (10) C PR OOFS C.1 ERR OR DECOMPOSITION Pr oof of Lemma 1. Let θ ∈ Θ and ϕ ∈ Φ be arbitrary , and let τ > 0 be an arbitrary clipping threshold. The argument is a purely algebraic decomposition in which two intermediate population objectives are inserted between the empirical surrogate objecti ve and the target objecti ve. Consider the difference b J ϕ,τ n,K ( θ ) − J ⋆ ( θ ) . Add and subtract the intermediate quantities J ϕ,τ ( θ ) and J ϕ ( θ ) to obtain b J ϕ,τ n,K ( θ ) − J ⋆ ( θ ) = b J ϕ,τ n,K ( θ ) − J ϕ,τ ( θ ) + J ϕ,τ ( θ ) − J ϕ ( θ ) + J ϕ ( θ ) − J ⋆ ( θ ) . T aking absolute v alues and applying the triangle inequality giv es | b J ϕ,τ n,K ( θ ) − J ⋆ ( θ ) | ≤ | b J ϕ,τ n,K ( θ ) − J ϕ,τ ( θ ) | + | J ϕ,τ ( θ ) − J ϕ ( θ ) | + | J ϕ ( θ ) − J ⋆ ( θ ) | . This is exactly the inequality stated in Lemma 1. C.2 ST A TISTICAL ERR OR Pr oof of Lemma 2. Let θ ∈ Θ be an arbitrary policy parameter . Let r : X × Y → [0 , 1] be an arbitrary rew ard function, let τ > 0 be an arbitrary clipping threshold, and let δ ∈ (0 , 1) be an arbitrary conﬁdence le vel. The goal is to control the Monte Carlo de viation arising from drawing only K rollouts per prompt, while conditioning on the realized prompts. Let x 1 , . . . , x n denote the realized prompts. For each i ∈ { 1 , . . . , n } and each rollout index j ∈ { 1 , . . . , K } , deﬁne the per-rollout contrib ution Z i,j := r ( x i , y i,j ) − β ℓ τ θ ( x i , y i,j ) . By the deﬁnition of the empirical objectiv e, one can rewrite b J r,τ n,K ( θ ) = 1 nK n X i =1 K X j =1 Z i,j . Next deﬁne the conditional e xpectation of the empirical objectiv e giv en the prompts. For each ﬁxed prompt x i , conditional on x i the rollout y i,j is distributed as π θ ( · | x i ) , hence E [ Z i,j | x i ] = E Y ∼ π θ ( ·| x i ) [ r ( x i , Y )] − β E Y ∼ π θ ( ·| x i ) [ ℓ τ θ ( x i , Y )] . A veraging these conditional e xpectations over i yields the inﬁnite-rollout analogue b J r,τ n, ∞ ( θ ) := 1 n n X i =1 E [ Z i, 1 | x i ] . By construction, E h b J r,τ n,K ( θ ) | x 1: n i = b J r,τ n, ∞ ( θ ) . T o apply Hoeffding’ s inequality , it remains to verify a uniform bound on each Z i,j . Because r ( x i , y i,j ) ∈ [0 , 1] and ℓ τ θ ( x i , y i,j ) ∈ [ − τ , τ ] , it follows that − β τ ≤ Z i,j ≤ 1 + β τ , so the interval width is 1 + 2 β τ . Conditional on the prompts x 1: n , the rollouts are independent across all index pairs ( i, j ) . Therefore the collection { Z i,j } i ≤ n, j ≤ K is independent conditional on x 1: n . Applying Lemma 8 to the a verage of these nK bounded independent random variables, with f ailure probability δ , giv es that with probability at least 1 − δ over the rollouts conditional on x 1: n , | b J r,τ n,K ( θ ) − E h b J r,τ n,K ( θ ) | x 1: n i | ≤ (1 + 2 β τ ) r log(2 /δ ) 2 nK . Replacing the conditional expectation by b J r,τ n, ∞ ( θ ) yields | b J r,τ n,K ( θ ) − b J r,τ n, ∞ ( θ ) | ≤ (1 + 2 β τ ) r log(2 /δ ) 2 nK , which is the conclusion of Lemma 2. Pr oof of Lemma 3. Let θ ∈ Θ be an arbitrary policy parameter . Let r : X × Y → [0 , 1] be an arbitrary rew ard function, let τ > 0 be an arbitrary clipping threshold, and let δ ∈ (0 , 1) be an arbitrary conﬁdence le vel. This lemma controls the deviation due only to sampling ﬁnitely many prompts, after taking the conditional expectation ov er rollouts. Deﬁne, for each prompt x ∈ X , g r,τ θ ( x ) = E Y ∼ π θ ( ·| x ) [ r ( x, Y )] − β E Y ∼ π θ ( ·| x ) [ ℓ τ θ ( x, Y )] . Because r ( · , · ) ∈ [0 , 1] and ℓ τ θ ( · , · ) ∈ [ − τ , τ ] pointwise, the ﬁrst expectation lies in [0 , 1] and the second expectation lies in [ − τ , τ ] . Consequently , for ev ery x , − β τ ≤ g r,τ θ ( x ) ≤ 1 + β τ , so the interval width is ag ain 1 + 2 β τ . By deﬁnition, b J r,τ n, ∞ ( θ ) = 1 n n X i =1 g r,τ θ ( x i ) , J r,τ ( θ ) = E X ∼ ρ [ g r,τ θ ( X )] . Since x 1 , . . . , x n are independent draws from ρ , the sequence g r,τ θ ( x 1 ) , . . . , g r,τ θ ( x n ) consists of i.i.d. random variables bounded in an interv al of width 1 + 2 β τ . Applying Lemma 8 with N = n and failure probability δ yields that with probability at least 1 − δ over the prompts, | b J r,τ n, ∞ ( θ ) − J r,τ ( θ ) | ≤ (1 + 2 β τ ) r log(2 /δ ) 2 n . This is precisely the statement of Lemma 3. Pr oof of Lemma 4. Let θ ∈ Θ be an arbitrary policy parameter . Let r : X × Y → [0 , 1] be an arbitrary rew ard function, let τ > 0 be an arbitrary clipping threshold, and let δ ∈ (0 , 1) be an arbitrary conﬁdence le vel. The proof combines the two pre vious concentration statements by enforcing that they hold on a common high-probability ev ent, and then applying a triangle inequality . Deﬁne the rollout concentration ev ent E roll := ( | b J r,τ n,K ( θ ) − b J r,τ n, ∞ ( θ ) | ≤ (1 + 2 β τ ) r log(4 /δ ) 2 nK ) . Lemma 2 applied with conﬁdence parameter δ / 2 implies that, conditional on x 1: n , Pr( E roll | x 1: n ) ≥ 1 − δ / 2 . Deﬁne the prompt concentration ev ent E prompt := ( | b J r,τ n, ∞ ( θ ) − J r,τ ( θ ) | ≤ (1 + 2 β τ ) r log(4 /δ ) 2 n ) . Lemma 3 applied with conﬁdence parameter δ / 2 yields Pr( E prompt ) ≥ 1 − δ / 2 . Let E stat := E roll ∩ E prompt . By the union bound, Pr( E stat ) ≥ 1 − δ. Assume that E stat holds. Then, the triangle inequality giv es | b J r,τ n,K ( θ ) − J r,τ ( θ ) | ≤ | b J r,τ n,K ( θ ) − b J r,τ n, ∞ ( θ ) | + | b J r,τ n, ∞ ( θ ) − J r,τ ( θ ) | ≤ (1 + 2 β τ ) r log(4 /δ ) 2 n + r log(4 /δ ) 2 nK ! , which is exactly the inequality claimed in Lemma 4. Pr oof of Lemma 12. Let θ ∈ Θ be an arbitrary policy parameter, let τ > 0 be an arbitrary clipping threshold, and let δ ∈ (0 , 1) be an arbitrary conﬁdence le vel. Recall that x 1 , . . . , x n are independent draws from ρ , and that, conditional on each x i , the rollouts y i, 1 , . . . , y i,K are independent draws from π θ ( · | x i ) . Deﬁne the per-rollout clipped log ratio Z i,j := ℓ τ θ ( x i , y i,j ) , so that, by the deﬁnition of b κ τ n,K ( θ ) , b κ τ n,K ( θ ) = 1 nK n X i =1 K X j =1 Z i,j . Because ℓ τ θ ( x, y ) = clip( ℓ θ ( x, y ) , − τ , τ ) by deﬁnition, it follows that Z i,j ∈ [ − τ , τ ] almost surely for all ( i, j ) , and therefore each Z i,j is bounded in an interval of width 2 τ . T o make the tw o-stage sampling structure explicit, introduce the conditional inﬁnite-rollout analogue b κ τ n, ∞ ( θ ) := 1 n n X i =1 E [ Z i, 1 | x i ] = 1 n n X i =1 E Y ∼ π θ ( ·| x i ) [ ℓ τ θ ( x i , Y )] . By construction, conditional on the realized prompts x 1: n , the random variables { Z i,j } i ≤ n, j ≤ K are independent, and moreov er E  b κ τ n,K ( θ ) | x 1: n  = b κ τ n, ∞ ( θ ) . Applying Lemma 8 to the av erage of the nK bounded independent random v ariables { Z i,j } , conditional on x 1: n and with failure probability δ / 2 , yields that with probability at least 1 − δ / 2 ov er the rollouts conditional on x 1: n ,   b κ τ n,K ( θ ) − b κ τ n, ∞ ( θ )   ≤ 2 τ r log(4 /δ ) 2 nK . It remains to control the deviation due to sampling only ﬁnitely man y prompts. Deﬁne the prompt-lev el functional h τ θ ( x ) := E Y ∼ π θ ( ·| x ) [ ℓ τ θ ( x, Y )] . Since ℓ τ θ ( x, Y ) ∈ [ − τ , τ ] almost surely under Y ∼ π θ ( · | x ) , it follows that h τ θ ( x ) ∈ [ − τ , τ ] for ev ery x , and thus h τ θ ( X ) is bounded in an interval of width 2 τ when X ∼ ρ . By the deﬁnition of b κ τ n, ∞ ( θ ) , b κ τ n, ∞ ( θ ) = 1 n n X i =1 h τ θ ( x i ) . Moreov er , by the deﬁnition of D θ ( x, y ) = ρ ( x ) π θ ( y | x ) , the clipped population av erage can be written as κ τ ( θ ) = E ( X,Y ) ∼ D θ [ ℓ τ θ ( X, Y )] = E X ∼ ρ [ h τ θ ( X )] . Since x 1 , . . . , x n are independent draws from ρ , the sequence h τ θ ( x 1 ) , . . . , h τ θ ( x n ) consists of i.i.d. random variables bounded in an interval of width 2 τ . Applying Lemma 8 with N = n and failure probability δ / 2 yields that with probability at least 1 − δ / 2 ov er the prompts,   b κ τ n, ∞ ( θ ) − κ τ ( θ )   ≤ 2 τ r log(4 /δ ) 2 n . Finally , consider the ev ent on which both of the preceding inequalities hold. By the union bound, this ev ent has probability at least 1 − δ over the joint dra w of prompts and rollouts. On this ev ent, the triangle inequality implies   b κ τ n,K ( θ ) − κ τ ( θ )   ≤   b κ τ n,K ( θ ) − b κ τ n, ∞ ( θ )   +   b κ τ n, ∞ ( θ ) − κ τ ( θ )   ≤ 2 τ r log(4 /δ ) 2 n + r log(4 /δ ) 2 nK ! , which is exactly the claimed bound in (9). C.3 REW ARD SHIFT AND SURROGA TE BIAS Pr oof of Lemma 5. Let θ ∈ Θ and ϕ ∈ Φ be arbitrary parameters. The proof be gins by expressing the objecti ve gap as an expectation of re ward-model error under the deployment distribution, and then transferring this expectation back to the rew ard-model training distribution via a density ratio. By deﬁnition, J ϕ ( θ ) = E X ∼ ρ E Y ∼ π θ ( ·| X ) [ ˆ r ϕ ( X, Y )] − β E X ∼ ρ KL( π θ ( · | X ) ∥ π ref ( · | X )) , and J ⋆ ( θ ) = E X ∼ ρ E Y ∼ π θ ( ·| X ) [ r ⋆ ( X, Y )] − β E X ∼ ρ KL( π θ ( · | X ) ∥ π ref ( · | X )) . The KL regularization terms coincide, so the y cancel after subtraction, giving J ϕ ( θ ) − J ⋆ ( θ ) = E X ∼ ρ E Y ∼ π θ ( ·| X ) [ ˆ r ϕ ( X, Y ) − r ⋆ ( X, Y )] . Introduce the pointwise reward-model error e ϕ ( x, y ) = ˆ r ϕ ( x, y ) − r ⋆ ( x, y ) . Using the joint distribution D θ ( x, y ) = ρ ( x ) π θ ( y | x ) , the preceding display can be re written as J ϕ ( θ ) − J ⋆ ( θ ) = E ( X,Y ) ∼ D θ [ e ϕ ( X, Y )] . Assume that D θ ≪ D train and deﬁne the density ratio w θ ( x, y ) := D θ ( x, y ) D train ( x, y ) . Then, the expectation under D θ can be written under D train as E ( X,Y ) ∼ D θ [ e ϕ ( X, Y )] = E ( X,Y ) ∼ D train [ w θ ( X, Y ) e ϕ ( X, Y )] . Applying Cauchy–Schwarz yields | E D train [ w θ e ϕ ] | ≤ q E D train [ w 2 θ ] q E D train [ e 2 ϕ ] . The second factor is exactly q L (2) train ( ϕ ) by the deﬁnition of L (2) train ( ϕ ) . For the ﬁrst factor , note that E D train [ w θ ] = 1 and χ 2 ( D θ ∥ D train ) = E D train  ( w θ − 1) 2  = E D train [ w 2 θ ] − 1 . Consequently , E D train [ w 2 θ ] = 1 + χ 2 ( D θ ∥ D train ) . Substituting these identities into the Cauchy–Schwarz bound gi ves | J ϕ ( θ ) − J ⋆ ( θ ) | ≤ p 1 + χ 2 ( D θ ∥ D train ) q L (2) train ( ϕ ) . By the deﬁnition of C ( θ ) in eq. (5), this is | J ϕ ( θ ) − J ⋆ ( θ ) | ≤ C ( θ ) q L (2) train ( ϕ ) , which is the statement of Lemma 5. Pr oof of Lemma 6. Let θ ∈ Θ be arbitrary . Assume that ρ ≪ ρ label and that π θ ( · | x ) ≪ π ref ( · | x ) for e very x with ρ label ( x ) > 0 . Under these conditions, D θ ≪ D train holds and the density ratio w θ ( x, y ) := D θ ( x, y ) D train ( x, y ) is well deﬁned on the support of D train . By deﬁnition, C ( θ ) 2 = 1 + χ 2 ( D θ ∥ D train ) = E ( X,Y ) ∼ D train  w θ ( X, Y ) 2  . Using D train ( x, y ) = ρ label ( x ) π ref ( y | x ) and D θ ( x, y ) = ρ ( x ) π θ ( y | x ) , one obtains the factorization w θ ( x, y ) = ρ ( x ) ρ label ( x ) · π θ ( y | x ) π ref ( y | x ) . Substituting this expression into the deﬁnition of C ( θ ) 2 and taking expectation under D train yields C ( θ ) 2 = E X ∼ ρ label "  ρ ( X ) ρ label ( X )  2 E Y ∼ π ref ( ·| X ) "  π θ ( Y | X ) π ref ( Y | X )  2 ## . By the deﬁnition of C pol ( θ ) , the inner expectation is bounded abo ve by C pol ( θ ) 2 for each x in the support of ρ label . Therefore, C ( θ ) 2 ≤ C pol ( θ ) 2 E X ∼ ρ label "  ρ ( X ) ρ label ( X )  2 # = C pol ( θ ) 2 C 2 prompt . T aking square roots yields C ( θ ) ≤ C prompt C pol ( θ ) . Pr oof of Lemma 7. Let θ ∈ Θ and ϕ ∈ Φ be arbitrary parameters, and let τ > 0 be an arbitrary clipping threshold. The argument is an identity at the le vel of population objecti ves, follo wed by a standard absolute-v alue bound. By deﬁnition of the clipped objectiv e, J ϕ,τ ( θ ) = E X ∼ ρ E Y ∼ π θ ( ·| X ) [ ˆ r ϕ ( X, Y )] − β E X ∼ ρ E Y ∼ π θ ( ·| X ) [ ℓ τ θ ( X, Y )] . Using D θ ( x, y ) = ρ ( x ) π θ ( y | x ) , this can be written as J ϕ,τ ( θ ) = E ( X,Y ) ∼ D θ [ ˆ r ϕ ( X, Y )] − β E ( X,Y ) ∼ D θ [ ℓ τ θ ( X, Y )] . For the e xact objectiv e, recall that KL( π θ ( · | x ) ∥ π ref ( · | x )) = E Y ∼ π θ ( ·| x ) [ ℓ θ ( x, Y )] . Substituting this identity into the deﬁnition of J ϕ ( θ ) yields J ϕ ( θ ) = E ( X,Y ) ∼ D θ [ ˆ r ϕ ( X, Y )] − β E ( X,Y ) ∼ D θ [ ℓ θ ( X, Y )] . Subtracting the two displays gi ves the exact identity J ϕ,τ ( θ ) − J ϕ ( θ ) = β E ( X,Y ) ∼ D θ  ℓ θ ( X, Y ) − ℓ τ θ ( X, Y )  . T aking absolute v alues and using | E [ U ] | ≤ E [ | U | ] yields | J ϕ,τ ( θ ) − J ϕ ( θ ) | ≤ β E ( X,Y ) ∼ D θ  | ℓ θ ( X, Y ) − ℓ τ θ ( X, Y ) |  , which is precisely the inequality asserted in Lemma 7. C.4 UNIFIED FIXED-POLICY BOUND Pr oof of Theorem 1. Let θ ∈ Θ and ϕ ∈ Φ be arbitrary , and let τ > 0 and δ ∈ (0 , 1) be arbitrary . Assume the conditions stated in Theorem 1, so that Lemmas 4, 5, and 7 are applicable. Lemma 1 provides the deterministic decomposition | b J ϕ,τ n,K ( θ ) − J ⋆ ( θ ) | ≤ | b J ϕ,τ n,K ( θ ) − J ϕ,τ ( θ ) | + | J ϕ,τ ( θ ) − J ϕ ( θ ) | + | J ϕ ( θ ) − J ⋆ ( θ ) | . T o control the ﬁrst term, apply Lemma 4 with r = ˆ r ϕ . W ith probability at least 1 − δ ov er the e valuation prompts and rollouts, | b J ϕ,τ n,K ( θ ) − J ϕ,τ ( θ ) | ≤ (1 + 2 β τ ) r log(4 /δ ) 2 n + r log(4 /δ ) 2 nK ! . The remaining two terms are controlled deterministically . Lemma 7 gi ves | J ϕ,τ ( θ ) − J ϕ ( θ ) | ≤ β E ( X,Y ) ∼ D θ [ | ℓ θ ( X, Y ) − ℓ τ θ ( X, Y ) | ] , and Lemma 5 giv es | J ϕ ( θ ) − J ⋆ ( θ ) | ≤ C ( θ ) q L (2) train ( ϕ ) . Substituting these three bounds into the decomposition yields the inequality stated in Theorem 1. Pr oof of Corollary 3. Let θ ∈ Θ be an arbitrary policy parameter , let ϕ ∈ Φ be an arbitrary reward-model parameter , let β > 0 be an arbitrary regularization coef ﬁcient, let δ ∈ (0 , 1) be an arbitrary conﬁdence lev el, and let n ≥ 1 and K ≥ 1 be arbitrary integers. Deﬁne α n,K,δ := r log(4 /δ ) 2 n + r log(4 /δ ) 2 nK , B θ ( τ ) := (1 + 2 β τ ) α n,K,δ + β T θ ( τ ) , where T θ ( τ ) := E ( X,Y ) ∼ D θ  ( | ℓ θ ( X, Y ) | − τ ) +  . Let ( X, Y ) ∼ D θ and deﬁne the nonnegati ve random variable Z := | ℓ θ ( X, Y ) | . With this notation one has T θ ( τ ) = E [( Z − τ ) + ] , so the function of interest can be written as B θ ( τ ) = (1 + 2 β τ ) α n,K,δ + β E [( Z − τ ) + ] . The next step is to relate the one-sided deriv atives of τ 7→ E [( Z − τ ) + ] to the tail probabilities of Z . For e very z ≥ 0 and ev ery τ ≥ 0 , the identity ( z − τ ) + = Z ∞ τ 1 { z > t } dt holds, because the inte grand equals 1 precisely on the interv al t ∈ [ τ , z ) when z > τ , and otherwise it is identically zero. Applying this identity with z = Z and using T onelli’ s theorem, which is applicable because the inte grand is nonnegati ve, yields the representation E [( Z − τ ) + ] = Z ∞ τ Pr( Z > t ) dt. Let τ ≥ 0 and let h > 0 . Using the integral representation at τ and at τ + h gives E [( Z − ( τ + h )) + ] − E [( Z − τ ) + ] = − Z τ + h τ Pr( Z > t ) dt. Since the function t 7→ Pr( Z > t ) is nonincreasing, one has h Pr( Z > τ + h ) ≤ Z τ + h τ Pr( Z > t ) dt ≤ h Pr( Z > τ ) . Dividing by h and combining with the previous display yields − Pr( Z > τ ) ≤ E [( Z − ( τ + h )) + ] − E [( Z − τ ) + ] h ≤ − Pr( Z > τ + h ) . Letting h ↓ 0 and using the monotone con ver gence Pr( Z > τ + h ) → Pr( Z > τ ) yields the right deri vati ve identity d dτ + E [( Z − τ ) + ] = − Pr( Z > τ ) . Let τ > 0 and let h ∈ (0 , τ ) . Using the integral representation at τ and at τ − h gives E [( Z − τ ) + ] − E [( Z − ( τ − h )) + ] = − Z τ τ − h Pr( Z > t ) dt. Since t 7→ Pr( Z > t ) is nonincreasing, one has h Pr( Z > τ ) ≤ Z τ τ − h Pr( Z > t ) dt ≤ h Pr( Z > τ − h ) . Dividing by h and combining with the previous display yields − Pr( Z > τ − h ) ≤ E [( Z − τ ) + ] − E [( Z − ( τ − h )) + ] h ≤ − Pr( Z > τ ) . Letting h ↓ 0 and using the monotone con ver gence Pr( Z > τ − h ) → Pr( Z ≥ τ ) yields the left deri vati ve identity d dτ − E [( Z − τ ) + ] = − Pr( Z ≥ τ ) . It now follo ws that B θ has one-sided deriv atives for e very τ ≥ 0 , and these deriv ativ es satisfy B ′ θ ( τ + ) = 2 β α n,K,δ − β Pr( Z > τ ) , B ′ θ ( τ − ) = 2 β α n,K,δ − β Pr( Z ≥ τ ) for ev ery τ > 0 . Let τ ⋆ be any minimizer of τ 7→ B θ ( τ ) ov er τ ≥ 0 . If τ ⋆ > 0 , the minimality of τ ⋆ implies that the left deriv ativ e is nonpositiv e and the right deri vati ve is nonnegati ve, so B ′ θ (( τ ⋆ ) − ) ≤ 0 ≤ B ′ θ (( τ ⋆ ) + ) holds. Substituting the one-sided deriv ative e xpressions yields Pr( Z > τ ⋆ ) ≤ 2 α n,K,δ ≤ Pr( Z ≥ τ ⋆ ) . If τ ⋆ = 0 , the minimality of τ ⋆ implies 0 ≤ B ′ θ (0 + ) , and therefore Pr( Z > 0) ≤ 2 α n,K,δ holds. If 2 α n,K,δ < 1 , the inequality 2 α n,K,δ ≤ Pr( Z ≥ 0) = 1 holds as well, and this yields the same two-sided condition with τ ⋆ = 0 . Finally , if 2 α n,K,δ ≥ 1 , for e very τ > 0 , one has B ′ θ ( τ − ) = 2 β α n,K,δ − β Pr( Z ≥ τ ) ≥ 2 β α n,K,δ − β ≥ 0 , and therefore B θ is nondecreasing on (0 , ∞ ) , which implies that τ ⋆ = 0 is a minimizer ov er τ ≥ 0 . Recalling that Z = | ℓ θ ( X, Y ) | with ( X , Y ) ∼ D θ , the stated conditions are exactly Pr ( X,Y ) ∼ D θ  | ℓ θ ( X, Y ) | > τ ⋆  ≤ 2 α n,K,δ ≤ Pr ( X,Y ) ∼ D θ  | ℓ θ ( X, Y ) | ≥ τ ⋆  , and when Pr( Z = τ ⋆ ) = 0 the two inequalities collapse to the equality Pr( Z > τ ⋆ ) = 2 α n,K,δ , which is equiv alent to the quantile statement. C.5 P A C-BA YES A UXILIARY BOUNDS Lemma 14 (P A C-Bayes bound for prompt sampling [McAllester, 1999, Seeger, 2002]) . Let P be a prior distribution on Θ , let τ > 0 and δ ∈ (0 , 1) be given, and let r : X × Y → [0 , 1] be a given re war d function. W ith pr obability at least 1 − δ over x 1 , . . . , x n ∼ ρ , the following inequality holds simultaneously for all posteriors Q on Θ : | J r,τ ( Q ) − b J r,τ n, ∞ ( Q ) | ≤ (1 + 2 β τ ) r KL( Q ∥ P ) + log(4 /δ ) 2 n . Pr oof. Let λ > 0 be arbitrary . For a gi ven parameter value θ ∈ Θ , consider a single prompt draw X ∼ ρ . As in the prompt-sampling argument in Lemma 3, the quantity g r,τ θ ( X ) lies in the interv al [ − β τ , 1 + β τ ] . Consequently , the centered random variable J r,τ ( θ ) − g r,τ θ ( X ) is almost surely bounded in an interval of width 1 + 2 β τ . Applying Lemma 9 yields E X ∼ ρ exp  λ  J r,τ ( θ ) − g r,τ θ ( X )  ≤ exp  λ 2 (1 + 2 β τ ) 2 8  . Now let x 1 , . . . , x n be i.i.d. draws from ρ . Using independence and the deﬁnition b J r,τ n, ∞ ( θ ) = 1 n n X i =1 g r,τ θ ( x i ) , it follows that E exp  λ  J r,τ ( θ ) − b J r,τ n, ∞ ( θ )   ≤ exp  λ 2 (1 + 2 β τ ) 2 8 n  . T aking e xpectation with respect to θ ∼ P and applying Marko v’ s inequality yields that, with probability at least 1 − δ / 2 ov er x 1: n , E θ ∼ P exp  λ  J r,τ ( θ ) − b J r,τ n, ∞ ( θ )   ≤ 2 δ exp  λ 2 (1 + 2 β τ ) 2 8 n  . On this ev ent, Lemma 10 can be applied with F ( θ ) = λ  J r,τ ( θ ) − b J r,τ n, ∞ ( θ )  . For e very posterior Q on Θ , this giv es λ  J r,τ ( Q ) − b J r,τ n, ∞ ( Q )  ≤ KL( Q ∥ P ) + log 2 δ + λ 2 (1 + 2 β τ ) 2 8 n . Optimizing ov er λ > 0 yields the one-sided bound J r,τ ( Q ) − b J r,τ n, ∞ ( Q ) ≤ (1 + 2 β τ ) r KL( Q ∥ P ) + log(2 /δ ) 2 n . Applying the same argument to the opposite deviation b J r,τ n, ∞ ( Q ) − J r,τ ( Q ) and taking a union bound yields the stated two-sided inequality with log(4 /δ ) . Lemma 15 (P A C-Bayes bound for rollout sampling [Catoni, 2007]) . Let P be a prior distribution on Θ , let τ > 0 and δ ∈ (0 , 1) be given, and let r : X × Y → [0 , 1] be a given r ewar d function. W ith probability at least 1 − δ over the r ollouts conditional on x 1: n , the following inequality holds simultaneously for all posteriors Q on Θ : | b J r,τ n, ∞ ( Q ) − b J r,τ n,K ( Q ) | ≤ (1 + 2 β τ ) r KL( Q ∥ P ) + log(4 /δ ) 2 nK . Pr oof. Condition on the realized prompts x 1: n , and let λ > 0 be arbitrary . For each inde x pair ( i, j ) , deﬁne Z i,j ( θ ) := r ( x i , y i,j ) − β ℓ τ θ ( x i , y i,j ) . For e very θ ∈ Θ , the bounds r ∈ [0 , 1] and ℓ τ θ ∈ [ − τ , τ ] imply − β τ ≤ Z i,j ( θ ) ≤ 1 + β τ . Conditional on ( x 1: n , θ ) , the rollouts are independent across all pairs ( i, j ) . Deﬁne the deviation ∆( θ ) := b J r,τ n, ∞ ( θ ) − b J r,τ n,K ( θ ) . By construction, b J r,τ n,K ( θ ) is the av erage of the nK random v ariables Z i,j ( θ ) , and b J r,τ n, ∞ ( θ ) is their conditional expectation giv en x 1: n . Applying Lemma 9 to the av erage of bounded independent terms yields E  exp  λ ∆( θ )  | x 1: n , θ  ≤ exp  λ 2 (1 + 2 β τ ) 2 8 nK  . T aking e xpectation ov er θ ∼ P and applying Marko v’ s inequality implies that, with probability at least 1 − δ / 2 ov er rollouts conditional on x 1: n , E θ ∼ P  exp  λ ∆( θ )  | x 1: n  ≤ 2 δ exp  λ 2 (1 + 2 β τ ) 2 8 nK  . On this ev ent, Lemma 10 applied with F ( θ ) = λ ∆( θ ) yields that, for e very posterior Q , λ  b J r,τ n, ∞ ( Q ) − b J r,τ n,K ( Q )  ≤ KL( Q ∥ P ) + log 2 δ + λ 2 (1 + 2 β τ ) 2 8 nK . Optimizing ov er λ > 0 gi ves b J r,τ n, ∞ ( Q ) − b J r,τ n,K ( Q ) ≤ (1 + 2 β τ ) r KL( Q ∥ P ) + log(2 /δ ) 2 nK . Applying the same argument to the de viation − ∆( θ ) and taking a union bound yields the stated two-sided inequality with log(4 /δ ) . C.6 P A C-BA YES MAIN BOUND Pr oof of Theorem 2. Let ϕ ∈ Φ be arbitrary , and let τ > 0 and δ ∈ (0 , 1) be gi ven. Let P denote the prior that appears in Theorem 2. The proof proceeds by combining two P A C-Bayes concentration inequalities with the deterministic reward-shift and clipping-bias bounds, and then substituting these ingredients into the same three-term decomposition used in the ﬁxed-polic y case. Apply Lemma 15 with reward r = ˆ r ϕ and conﬁdence lev el δ / 2 . Apply Lemma 14 with reward r = ˆ r ϕ and conﬁdence lev el δ / 2 . By a union bound, with probability at least 1 − δ ov er prompts and rollouts, both inequalities hold simultaneously for all posteriors Q on Θ . On this ev ent, for ev ery posterior Q , | b J ϕ,τ n,K ( Q ) − b J ϕ,τ n, ∞ ( Q ) | ≤ (1 + 2 β τ ) r KL( Q ∥ P ) + log(8 /δ ) 2 nK , | b J ϕ,τ n, ∞ ( Q ) − J ϕ,τ ( Q ) | ≤ (1 + 2 β τ ) r KL( Q ∥ P ) + log(8 /δ ) 2 n . Combining these two bounds via the triangle inequality yields | b J ϕ,τ n,K ( Q ) − J ϕ,τ ( Q ) | ≤ (1 + 2 β τ ) r KL( Q ∥ P ) + log(8 /δ ) 2 n + r KL( Q ∥ P ) + log(8 /δ ) 2 nK ! . The remaining two contributions follo w by av eraging pointwise bounds over θ ∼ Q . T aking expectation in Lemma 7 yields | J ϕ,τ ( Q ) − J ϕ ( Q ) | ≤ β E θ ∼ Q  E ( X,Y ) ∼ D θ [ | ℓ θ ( X, Y ) − ℓ τ θ ( X, Y ) | ]  . T aking expectation in Lemma 5 yields | J ϕ ( Q ) − J ⋆ ( Q ) | ≤ E θ ∼ Q [ C ( θ )] q L (2) train ( ϕ ) . Finally , apply the same add-and-subtract decomposition used in Lemma 1 directly to b J ϕ,τ n,K ( Q ) − J ⋆ ( Q ) , and then substitute the three bounds established above to obtain the stated inequality . On the same ev ent of probability at least 1 − δ , this giv es the inequality stated in Theorem 2, and the statement holds simultaneously for all posteriors Q because the concentration step was uniform ov er Q . C.7 PR OOFS FOR P A C-BA YES SPECIAL CASES C.7.1 Finite candidate class and checkpoint selection Pr oof of Corollary 1. Let M ≥ 2 be an integer , and let Θ M = { θ (1) , . . . , θ ( M ) } be the ﬁnite set of candidate parameters described in the statement of the corollary . Let P denote the uniform distribution on Θ M , so that P ( θ ( m ) ) = 1 / M holds for ev ery m ∈ { 1 , . . . , M } . Let Q be an arbitrary distribution supported on the same ﬁnite set Θ M . For each m ∈ { 1 , . . . , M } , deﬁne p m := P ( θ ( m ) ) = 1 M , q m := Q ( θ ( m ) ) , so that q m ≥ 0 holds for ev ery m and P M m =1 q m = 1 holds by the deﬁnition of a probability mass function. By the deﬁnition of the Kullback–Leibler di ver gence on a ﬁnite set, one has KL( Q ∥ P ) = M X m =1 q m log q m p m . Substituting the identity p m = 1 / M into the preceding display yields KL( Q ∥ P ) = M X m =1 q m log( q m M ) = log M + M X m =1 q m log q m , where the ﬁnal equality follows because P M m =1 q m = 1 allows the f actor log M to be separated from the summation. It therefore remains to control the quantity P M m =1 q m log q m . For every index m ∈ { 1 , . . . , M } , the probability value q m lies in the interval [0 , 1] , and therefore one has log q m ≤ 0 whenev er q m > 0 , which implies that q m log q m ≤ 0 whenev er q m > 0 . When q m = 0 , the contribution q m log q m is interpreted as 0 , which is consistent with the limiting identity lim t ↓ 0 t log t = 0 . Consequently , every term in the sum P M m =1 q m log q m is less than or equal to 0 , and hence M X m =1 q m log q m ≤ 0 . Substituting this inequality into the identity abov e giv es KL( Q ∥ P ) = log M + M X m =1 q m log q m ≤ log M . Finally , consider the special case in which Q is the Dirac distribution concentrated on a single element θ ( b m ) ∈ Θ M . In that case one has q b m = 1 and q m = 0 for all m  = b m . Substituting these values into the deﬁnition KL( Q ∥ P ) = P M m =1 q m log q m p m shows that the only nonzero contrib ution is the term indexed by b m , and therefore KL( Q ∥ P ) = 1 · log 1 1 / M = log M . This prov es the ﬁnal statement of the corollary . C.7.2 OU–SGD special case for the P A C-Bayes complexity term Lemma 16 (Bounds for the stationary cov ariance in the OU approximation) . Let H ∈ R d × d be symmetric and positive deﬁnite, let Σ g ∈ R d × d be symmetric and positive deﬁnite, and let ε > 0 . Assume that Σ ∈ R d × d is symmetric and satisﬁes the matrix equation H Σ + Σ H = ε Σ g . Assume also that H and Σ g commute, meaning that H Σ g = Σ g H holds. Assume ﬁnally that there exist constants 0 < m ≤ M < ∞ such that mI ⪯ H ⪯ M I . Then, Σ satisﬁes the two-sided bound ε 2 M Σ g ⪯ Σ ⪯ ε 2 m Σ g . (11) Pr oof. Throughout the proof, for symmetric matrices A and B , the notation A ⪯ B means that v ⊤ Av ≤ v ⊤ B v holds for e very v ector v ∈ R d . This deﬁnition is con venient because it reduces the veriﬁcation of a matrix inequality to the veriﬁcation of an ordinary inequality that holds uniformly ov er all vectors. Deﬁne the matrix-valued function F ( t ) := e − tH Σ e − tH for t ≥ 0 . Since H is symmetric, the matrix exponential e − tH is well-deﬁned for ev ery t ≥ 0 , and the map t 7→ F ( t ) is differentiable. Differentiating and using the product rule yields d dt F ( t ) = ( − H e − tH )Σ e − tH + e − tH Σ( − H e − tH ) = − e − tH ( H Σ + Σ H ) e − tH . Substituting the identity H Σ + Σ H = ε Σ g giv es d dt F ( t ) = − ε e − tH Σ g e − tH . Integrating the preceding identity from 0 to T gives F ( T ) − F (0) = − ε Z T 0 e − tH Σ g e − tH dt. Since F (0) = Σ , rearranging yields Σ = F ( T ) + ε Z T 0 e − tH Σ g e − tH dt. Because H is positiv e deﬁnite, there exists a constant m 0 > 0 such that H ⪰ m 0 I , and therefore the operator norm satisﬁes ∥ e − tH ∥ 2 ≤ e − tm 0 for e very t ≥ 0 . This inequality implies ∥ F ( T ) ∥ 2 = ∥ e − T H Σ e − T H ∥ 2 ≤ ∥ e − T H ∥ 2 2 ∥ Σ ∥ 2 ≤ e − 2 T m 0 ∥ Σ ∥ 2 , and hence F ( T ) con ver ges to the zero matrix as T → ∞ . T aking the limit T → ∞ yields the inte gral identity Σ = ε Z ∞ 0 e − tH Σ g e − tH dt. It remains to compare e − tH Σ g e − tH to scalar multiples of Σ g in the Loewner order . The assumption mI ⪯ H ⪯ M I means that e very eigen value of H lies in the interv al [ m, M ] . Consequently , ev ery eigen value of e − 2 tH lies in the interv al [ e − 2 tM , e − 2 tm ] , and this implies the inequalities e − 2 tM I ⪯ e − 2 tH ⪯ e − 2 tm I for ev ery t ≥ 0 . The commutati vity condition H Σ g = Σ g H implies that Σ g commutes with the matrix e xponential e − tH for e very t ≥ 0 . Therefore one has e − tH Σ g e − tH = Σ g e − tH e − tH = Σ g e − 2 tH . Since Σ g ≻ 0 , the matrix square root Σ 1 / 2 g exists and is symmetric and positiv e deﬁnite. Applying the congruence transformation with Σ 1 / 2 g to the Loewner inequalities abo ve yields Σ 1 / 2 g  e − 2 tM I  Σ 1 / 2 g ⪯ Σ 1 / 2 g e − 2 tH Σ 1 / 2 g ⪯ Σ 1 / 2 g  e − 2 tm I  Σ 1 / 2 g for ev ery t ≥ 0 . Using Σ 1 / 2 g I Σ 1 / 2 g = Σ g and the scalar factors in the two outer terms gi ves e − 2 tM Σ g ⪯ Σ 1 / 2 g e − 2 tH Σ 1 / 2 g ⪯ e − 2 tm Σ g for ev ery t ≥ 0 . The commutati vity condition implies that Σ 1 / 2 g commutes with e − tH and therefore also commutes with e − 2 tH , which yields Σ 1 / 2 g e − 2 tH Σ 1 / 2 g = e − 2 tH Σ g = e − tH Σ g e − tH . Substituting this identity into the preceding display yields e − 2 tM Σ g ⪯ e − tH Σ g e − tH ⪯ e − 2 tm Σ g for ev ery t ≥ 0 . Substituting these two bounds into the integral representation of Σ yields ε Z ∞ 0 e − 2 tM Σ g dt ⪯ Σ ⪯ ε Z ∞ 0 e − 2 tm Σ g dt. Evaluating the scalar inte grals giv es ε Z ∞ 0 e − 2 tM dt = ε 2 M , ε Z ∞ 0 e − 2 tm dt = ε 2 m , and substituting these values pro ves eq. (11). Pr oof of Corollary 2. Assume the parameter space is R d and the prior is P = N ( θ 0 , Λ) with Λ ≻ 0 . Furthermore, assume the posterior induced by SGD with constant step size ε > 0 is approximated by the stationary Ornstein-Uhlenbeck law Q SGD = N ( ˆ θ , Σ) . By the local quadratic approximation of the objective, the covariance Σ satisﬁes the continuous L yapunov equation H Σ + Σ H = ε Σ g , where Σ g ≻ 0 is the gradient noise co variance and H ≻ 0 is the objecti ve Hessian at the optimum ˆ θ . W e assume that H and Σ g commute, and that the matrix H is symmetric and satisﬁes mI ⪯ H ⪯ M I for some constants 0 < m ≤ M < ∞ . Apply Lemma 13 with µ Q = ˆ θ , Σ Q = Σ , µ P = θ 0 , and Σ P = Λ . This yields KL( Q SGD ∥ P ) = 1 2  tr(Λ − 1 Σ) + ( ˆ θ − θ 0 ) ⊤ Λ − 1 ( ˆ θ − θ 0 ) − d + log det(Λ) det(Σ)  . (12) The remaining task is to upper bound the trace term and to upper bound the logarithmic determinant ratio in a way that makes the dependence on ε , Σ g , and the constants m and M explicit. First, apply Lemma 16, which gives Σ ⪯ ε 2 m Σ g . Since Λ − 1 ≻ 0 , this inequality implies Λ − 1 / 2 ΣΛ − 1 / 2 ⪯ ε 2 m Λ − 1 / 2 Σ g Λ − 1 / 2 , and taking traces yields tr(Λ − 1 Σ) = tr(Λ − 1 / 2 ΣΛ − 1 / 2 ) ≤ ε 2 m tr(Λ − 1 / 2 Σ g Λ − 1 / 2 ) = ε 2 m tr(Λ − 1 Σ g ) . Second, apply Lemma 16 again, which also gi ves Σ ⪰ ε 2 M Σ g . This inequality implies that the eigen values of Σ dominate the eigen v alues of ε 2 M Σ g when both collections are arranged in nondecreasing order , and therefore the product of the eigen values of Σ is at least the product of the eigen values of ε 2 M Σ g . Consequently , det(Σ) ≥ det  ε 2 M Σ g  =  ε 2 M  d det(Σ g ) , where the last equality uses the basic scaling rule for determinants. T aking logarithms and rearranging yields log det(Λ) det(Σ) ≤ log det(Λ) − log det(Σ g ) − d log  ε 2 M  . Substituting the preceding two bounds into eq. (12) yields the claimed inequality eq. (8) , and this completes the proof. C.8 BUDGET ALLOCA TION Derivation of the uniform-cost baseline K ⋆ = 1 . Assume that the sampling budget satisﬁes nK ≤ B for some B > 0 , and assume that each rollout has the same cost so that the constraint depends only on the product nK . Consider the leading-order sampling structure in Lemma 4 and ignore multiplicati ve constants that do not depend on K . The resulting proxy has the form 1 √ n + 1 √ nK . Under the constraint nK ≤ B , one may take n = B /K without loss of generality for minimizing the proxy ov er K ≥ 1 . Substituting n = B /K yields 1 √ n + 1 √ nK = r K B + 1 √ B . The second term does not depend on K , and the ﬁrst term is strictly increasing in K for K ≥ 1 . Therefore the proxy is minimized by the smallest admissible value of K , which is K ⋆ = 1 . Pr oof of Corollary 4. Let B > 0 , c preﬁll > 0 , and c decode > 0 be given. Assume the b udget constraint B ≥ n c preﬁll + nK c decode , and consider the leading-order sampling structure induced by Lemma 4. As in the statement, treat K as a continuous variable with K ≥ 1 and ignore multiplicativ e constants and logarithmic factors that do not depend on K . The sampling proxy can be written in the form E ( n, K ) = 1 √ n + 1 √ nK . Under the constraint, the choice n = B c preﬁll + K c decode saturates the budget and maximizes n for a gi ven K , hence it minimizes E ( n, K ) for that K . Substituting this expression for n gi ves an objecti ve that depends only on K , E ( K ) = r c preﬁll + K c decode B  1 + 1 √ K  . Since B is constant, minimizing E ( K ) over K ≥ 1 is equi valent to minimizing the squared objecti ve F ( K ) :=  c preﬁll + K c decode   1 + 1 √ K  2 . Expanding the square giv es F ( K ) =  c preﬁll + K c decode   1 + 2 √ K + 1 K  =  c preﬁll + K c decode  +2  c preﬁll + K c decode  K − 1 / 2 +  c preﬁll + K c decode  K − 1 . Differentiating term by term yields F ′ ( K ) = c decode + 2  c decode K − 1 / 2 − 1 2  c preﬁll + K c decode  K − 3 / 2  +  c decode K − 1 −  c preﬁll + K c decode  K − 2  . Simplifying this expression gi ves F ′ ( K ) = c decode + c decode K − 1 / 2 − c preﬁll K − 3 / 2 − c preﬁll K − 2 . Multiplying by K 2 yields an equiv alent ﬁrst-order condition K 2 F ′ ( K ) = c decode K 2 + c decode K 3 / 2 − c preﬁll K 1 / 2 − c preﬁll . Let u = √ K , so that K = u 2 and K 3 / 2 = u 3 . The condition F ′ ( K ) = 0 is equiv alent to c decode u 4 + c decode u 3 − c preﬁll u − c preﬁll = 0 , and the polynomial factors as c decode u 3 ( u + 1) − c preﬁll ( u + 1) = ( u + 1)  c decode u 3 − c preﬁll  . Since u = √ K ≥ 1 , one has u + 1 > 0 , so any interior stationary point satisﬁes c decode u 3 = c preﬁll . Therefore u =  c preﬁll c decode  1 / 3 , K = u 2 =  c preﬁll c decode  2 / 3 . This is the interior stationary point. Because the optimisation domain is K ≥ 1 , the continuous proxy minimiser is K ⋆ = max ( 1 ,  c preﬁll c decode  2 / 3 ) . Pr oof of Corollary 5. Let Z denote the per-rollout contrib ution used in the empirical objecti ve. Assume that the estimator av erages Z over n independent prompts and K independent rollouts per prompt. Deﬁne the two-stage variance quantities σ 2 prompt := V ar( E [ Z | X ]) , σ 2 rollout := E [V ar( Z | X )] . The standard variance decomposition for a tw o-stage av erage yields the proxy V ( n, K ) = σ 2 prompt n + σ 2 rollout nK . Assume the budget constraint B ≥ n c preﬁll + nK c decode , and substitute the saturated choice n = B / ( c preﬁll + K c decode ) . This yields V ( K ) = c preﬁll + K c decode B  σ 2 prompt + σ 2 rollout K  . Since B is constant, minimizing V ( K ) o ver K ≥ 1 is equi valent to minimizing G ( K ) :=  c preﬁll + K c decode   σ 2 prompt + σ 2 rollout K  . Expanding giv es G ( K ) = c preﬁll σ 2 prompt + c preﬁll σ 2 rollout K + c decode K σ 2 prompt + c decode σ 2 rollout . Differentiating yields G ′ ( K ) = − c preﬁll σ 2 rollout K 2 + c decode σ 2 prompt . Setting G ′ ( K ) = 0 gives c decode σ 2 prompt = c preﬁll σ 2 rollout K 2 , which implies K 2 = c preﬁll c decode · σ 2 rollout σ 2 prompt . T aking square roots yields the interior stationary point K = s c preﬁll c decode · σ 2 rollout σ 2 prompt . Because the optimisation domain is K ≥ 1 , the continuous proxy minimiser is K ⋆ = max ( 1 , s c preﬁll c decode · σ 2 rollout σ 2 prompt ) .

Generalisation of RLHF under Reward Shift and Clipped KL Regularisation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment