Learning to Rank Caption Chains for Video-Text Alignment
Direct preference optimization (DPO) is an effective technique to train language models to generate preferred over dispreferred responses. However, this binary “winner-takes-all” approach is suboptimal for vision-language models whose response quality is highly dependent on visual content. In particular, a response may still be faithful to the visual inputs even if it is less preferable than an alternative. The standard Bradley-Terry DPO formulation lacks this nuance, upweighting winning responses without sufficient regard for whether the “losing” response still maintains high visual fidelity. In this work, we investigate ranking optimization as an alternative that more precisely situates responses’ faithfulness to visual inputs. We focus on video-text alignment using detailed video captions, proposing a method to generate challenging, totally ordered caption chains at scale through repeated caption degradation. Our results show ranking optimization outperforms binary DPO for long-form content generation and assessment, and importantly, we find that these approaches require finetuning of the vision encoder to be effective, challenging the view of DPO as purely a language-reweighting process.
💡 Research Summary
The paper tackles a fundamental limitation of Direct Preference Optimization (DPO) when applied to vision‑language models (VLMs) for video‑text alignment. Traditional DPO adopts a binary “winner‑takes‑all” formulation based on the Bradley‑Terry model, which only distinguishes a preferred response from a non‑preferred one. In video‑centric tasks, a response that is less preferred can still be highly faithful to the visual input, and the binary loss fails to capture this nuance, leading to sub‑optimal learning of fine‑grained visual details.
To address this, the authors propose a ranking‑based DPO that leverages the Plackett‑Luce model, which generalizes Bradley‑Terry from pairwise comparisons to full rankings. By training on an entire ordered list of captions, the model can assign appropriate reward to each caption according to its relative visual fidelity, rather than treating all “losers” as equally bad.
A key technical contribution is the automatic generation of totally ordered caption chains (RCC – Ranked Caption Chain). Starting from a high‑quality, detailed video caption, a large language model (Claude 3.7‑Sonnet) repeatedly introduces visually‑grounded errors selected from a predefined taxonomy (object omission, attribute mis‑labeling, relational errors, temporal inconsistencies, etc.). Each mutation preserves the structure and previously introduced errors, yielding a chain c₁ ≻ c₂ ≻ … ≻ cₙ where the rank directly corresponds to the number of errors. This process ensures (1) a clear total order, (2) minimal linguistic variation across the chain, (3) controllable error distribution, and (4) scalability, as the LLM can generate millions of such chains without human annotation.
The generated RCC datasets (RCC‑PVD and RCC‑MSR) are used to train two state‑of‑the‑art VLMs—Perception‑LM‑1B and Qwen2.5‑VL‑3B—via LoRA adapters. Three training regimes are compared: (i) binary Bradley‑Terry DPO using only the top two captions of each chain, (ii) Multi‑Preference Optimization (MPO) which treats the top caption as winner and all others as losers, and (iii) the proposed Plackett‑Luce ranking DPO that consumes the full chain. All models are fine‑tuned for 1 000 steps (learning rate 1e‑6) on a single node with eight A100 GPUs.
Evaluation spans three benchmark families: (a) detailed video captioning on MSR‑VTT, PVD, VDC, and ARGUS, (b) a custom long‑form multiple‑choice QA set designed to probe fine‑grained visual reasoning, and (c) a caption‑matching task (TempCompass). Human‑in‑the‑loop LLM‑as‑Judge scores (Relevance, Descriptiveness, Temporal Consistency, Fluency) show that ranking DPO consistently outperforms binary DPO and MPO across all metrics. Notably, the gains are most pronounced for temporal consistency and descriptiveness, indicating better capture of fine visual details.
A critical finding is that the performance boost from ranking DPO only materializes when the vision encoder is jointly fine‑tuned with the language model. Freezing the visual backbone and only updating the LLM yields marginal improvements, suggesting that DPO is not merely a language‑reweighting technique but also a driver for learning richer visual representations when the visual encoder is allowed to adapt.
The authors also discuss limitations: the error taxonomy is hand‑crafted, potentially missing complex multi‑error scenarios that occur in real videos; the chain generation currently maintains caption length, which may not reflect natural variations where more detail is added rather than only degraded; and the approach assumes access to a powerful LLM for chain creation, which may be a bottleneck for some research groups.
Future directions include expanding the error taxonomy, allowing dynamic length adjustments, incorporating human‑in‑the‑loop feedback to refine rankings, and exploring semi‑supervised or self‑training regimes that blend synthetic RCC data with real human‑ranked examples.
In summary, this work introduces a novel, scalable method for producing fully ordered caption chains and demonstrates that training VLMs with a Plackett‑Luce‑based ranking objective yields superior video‑text alignment, especially for long‑form, detail‑rich generation tasks. It challenges the prevailing view of DPO as a purely language‑centric fine‑tuning method and highlights the importance of jointly adapting visual encoders to fully exploit ranking‑based preference learning.
Comments & Academic Discussion
Loading comments...
Leave a Comment