Towards End-to-End Alignment of User Satisfaction via Questionnaire in Video Recommendation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Short-video recommender systems typically optimize ranking models using dense user behavioral signals, such as clicks and watch time. However, these signals are only indirect proxies of user satisfaction and often suffer from noise and bias. Recently, explicit satisfaction feedback collected through questionnaires has emerged as a high-quality direct alignment supervision, but is extremely sparse and easily overwhelmed by abundant behavioral data, making it difficult to incorporate into online recommendation models. To address these challenges, we propose a novel framework which is towards End-to-End Alignment of user Satisfaction via Questionaire, named EASQ, to enable real-time alignment of ranking models with true user satisfaction. Specifically, we first construct an independent parameter pathway for sparse questionnaire signals by combining a multi-task architecture and a lightweight LoRA module. The multi-task design separates sparse satisfaction supervision from dense behavioral signals, preventing the former from being overwhelmed. The LoRA module pre-inject these preferences in a parameter-isolated manner, ensuring stability in the backbone while optimizing user satisfaction. Furthermore, we employ a DPO-based optimization objective tailored for online learning, which aligns the main model outputs with sparse satisfaction signals in real time. This design enables end-to-end online learning, allowing the model to continuously adapt to new questionnaire feedback while maintaining the stability and effectiveness of the backbone. Extensive offline experiments and large-scale online A/B tests demonstrate that EASQ consistently improves user satisfaction metrics across multiple scenarios. EASQ has been successfully deployed in a production short-video recommendation system, delivering significant and stable business gains.

💡 Research Summary

The paper addresses a fundamental challenge in short‑video recommendation: traditional ranking models are trained on dense behavioral signals such as clicks, watch time, likes, and skips, which are only indirect proxies for true user satisfaction. While explicit satisfaction feedback collected via questionnaires provides high‑quality supervision, it is extremely sparse (≈0.5 % of video views trigger a questionnaire, with less than 2 % response rate) and can easily be drowned out by the massive volume of behavioral data during model training.

To solve this, the authors propose EASQ (End‑to‑End Alignment of user Satisfaction via Questionnaire), a framework that enables real‑time alignment of ranking models with explicit satisfaction signals in an online learning setting. The core ideas are:

Decoupled Multi‑Task Architecture – The model contains two expert networks: a main task network that learns from abundant behavioral logs, and a satisfaction‑alignment task network that learns exclusively from questionnaire responses. By separating the two tasks at the upper layers, the sparse questionnaire supervision is prevented from being overwhelmed.
Lightweight LoRA Injection – A Low‑Rank Adaptation (LoRA) module is inserted at the embedding stage of the backbone (a Transformer‑based encoder). LoRA adds a small set of low‑rank weight matrices that are trained only on questionnaire data, while the original backbone weights remain frozen. This early‑stage injection allows satisfaction information to influence item representations without destabilizing the main model.
Online‑Tailored Direct Preference Optimization (DPO) – Inspired by recent preference‑learning work in large language models, the authors adopt a DPO loss that directly optimizes the policy using pairwise preference data derived from questionnaires (e.g., “Satisfied” > “Unsatisfied”). Unlike standard RLHF, which relies on a fixed reference model, EASQ treats the current main model as the reference, enabling continual online updates. The loss is:

L_DPO = −log σ(β

Towards End-to-End Alignment of User Satisfaction via Questionnaire in Video Recommendation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment