Tangent Space Fine-Tuning for Directional Preference Alignment in Large Language Models

Tangent Space Fine-Tuning for Directional Preference Alignment in Large Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Our goal is to enable large language models (LLMs) to balance multiple human preference dimensions; such as helpfulness, safety, and verbosity, through principled and controllable alignment. Existing preference optimization methods, including Direct Preference Optimization (DPO), collapse feedback into a single scalar reward, fixing one balance among objectives and preventing traversal of the Pareto front. Recent work by Ortiz-Jimenez et al. (2023) showed that fine-tuning can be viewed in a model’s tangent space, where linearized updates act as additive vectors that can be composed to jointly perform well on multiple tasks. Building on this formulation, we extend this idea to preference alignment and propose Tangent-Space Direct Preference Optimization (TS-DPO), which performs DPO within this locally linear regime to learn per-objective update directions. These directions can be linearly combined at inference to generate user-specified behaviors without additional optimization. Evaluated on the helpfulness-verbosity trade-off using the HelpSteer and UltraFeedback datasets, TS-DPO achieves broader Pareto-optimal coverage and smoother preference control than scalarized DPO. Canonical Correlation Analysis (CCA) further shows that tangent-space training amplifies canonical directions aligned with distinct preferences, improving disentanglement.


💡 Research Summary

This paper addresses the challenge of simultaneously satisfying multiple human preference dimensions—such as helpfulness, safety, and verbosity—in large language models (LLMs). Conventional Direct Preference Optimization (DPO) collapses pairwise human feedback into a single scalar reward, yielding only one point on the Pareto frontier and requiring retraining whenever a different trade‑off is desired. Inspired by Ortiz‑Jimenez et al.’s (2023) observation that fine‑tuning updates behave linearly in a model’s tangent space, the authors propose Tangent‑Space Direct Preference Optimization (TS‑DPO).

TS‑DPO freezes a pretrained instruction‑tuned model (θ₀) and approximates the effect of any parameter change Δθ with a first‑order Taylor expansion: f(x;θ₀+Δθ)≈f(x;θ₀)+J_{θ₀}(x)·Δθ, where J_{θ₀}(x) is the Jacobian of the model output with respect to its parameters. Within this linearized regime, the standard DPO loss is applied, but only the tangent‑space update vectors (dparams) are optimized. Crucially, a separate update vector τ_help and τ_verb is learned for each preference axis (helpfulness and verbosity). At inference time the model can be re‑parameterized as θ(λ)=θ₀+λ₁τ_help+λ₂τ_verb, where λ₁ and λ₂ are user‑specified scalars. This makes the model’s behavior continuously controllable without any additional fine‑tuning, reward modeling, or separate checkpoints.

Implementation uses PyTorch’s functorch library: make_functional_with_buffers extracts the frozen base parameters, and a Jacobian‑vector product (JVP) efficiently computes the linear contribution of Δθ during each forward pass. Only the last 16 transformer layers and the language‑model head are made trainable, keeping the number of optimized parameters modest.

Experiments are conducted on Llama‑3.2‑1B‑Instruct as the base model. Helpfulness data come from UltraFeedback, while verbosity data are drawn from HelpSteer2. For each axis, 6 000 pairwise preference examples are used for training and 2 000 for validation. TS‑DPO is compared against two baselines: (1) DPO‑Mixed, which scalarizes both datasets into a single reward and thus learns only one trade‑off; and (2) Task‑Vector DPO, which trains two independent DPO models and later linearly combines their parameter deltas. Evaluation consists of (i) pairwise preference accuracy on held‑out DPO datasets (measuring latent utility) and (ii) reward‑model scoring of free‑form generations (measuring surface behavior). By sweeping λ₁ and λ₂ across convex (λ₁+λ₂=1), affine (λ₁ fixed, λ₂ varied), and extrapolated (λ₂ up to 5) regimes, Pareto frontiers are plotted for each method.

Results show that TS‑DPO consistently dominates the scalarized baseline: for any given level of verbosity, TS‑DPO achieves higher helpfulness, and the frontier is smoother, reflecting the additive nature of tangent‑space updates. The Task‑Vector baseline also benefits from linear combination but lags behind TS‑DPO because its updates are learned in the full non‑linear space, leading to less precise composability. Canonical Correlation Analysis (CCA) reveals that τ_help and τ_verb align with distinct canonical directions in the model’s internal representations, confirming that the learned updates remain disentangled and interpretable.

Training overhead is modest: the extra JVP operation adds roughly one hour to a one‑epoch run on a V100 GPU, while on an H100 the entire TS‑DPO training finishes in about 15 minutes, demonstrating practical scalability.

In summary, TS‑DPO introduces a principled, efficient framework for multi‑objective preference alignment by learning separate tangent‑space update vectors for each objective and linearly combining them at inference. This approach eliminates the need for repeated fine‑tuning, provides smooth and interpretable control over trade‑offs, and empirically yields broader Pareto coverage than existing scalarized or task‑vector methods. Future work may extend the method to additional dimensions (e.g., safety, creativity), explore higher‑order approximations, or integrate meta‑learning to further enhance modularity and control.


Comments & Academic Discussion

Loading comments...

Leave a Comment