On the Reliability of User-Centric Evaluation of Conversational Recommender Systems

On the Reliability of User-Centric Evaluation of Conversational Recommender Systems
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

User-centric evaluation has become a key paradigm for assessing Conversational Recommender Systems (CRS), aiming to capture subjective qualities such as satisfaction, trust, and rapport. To enable scalable evaluation, recent work increasingly relies on third-party annotations of static dialogue logs by crowd workers or large language models. However, the reliability of this practice remains largely unexamined. In this paper, we present a large-scale empirical study investigating the reliability and structure of user-centric CRS evaluation on static dialogue transcripts. We collected 1,053 annotations from 124 crowd workers on 200 ReDial dialogues using the 18-dimensional CRS-Que framework. Using random-effects reliability models and correlation analysis, we quantify the stability of individual dimensions and their interdependencies. Our results show that utilitarian and outcome-oriented dimensions such as accuracy, usefulness, and satisfaction achieve moderate reliability under aggregation, whereas socially grounded constructs such as humanness and rapport are substantially less reliable. Furthermore, many dimensions collapse into a single global quality signal, revealing a strong halo effect in third-party judgments. These findings challenge the validity of single-annotator and LLM-based evaluation protocols and motivate the need for multi-rater aggregation and dimension reduction in offline CRS evaluation.


💡 Research Summary

This paper investigates the reliability of user‑centric evaluation of Conversational Recommender Systems (CRS) when judgments are made on static dialogue transcripts by third‑party annotators. The authors focus on the 18‑dimensional CRS‑Que questionnaire, which operationalises subjective aspects such as accuracy, usefulness, satisfaction, trust, humanness, and rapport. To answer two research questions—(RQ1) how reliably individual dimensions can be assessed, and (RQ2) how the dimensions relate to each other—the authors conduct a large‑scale crowdsourcing study.

Data and Annotation Procedure
They sample 200 dialogues from the ReDial movie‑recommendation corpus and recruit 124 native‑English crowd workers via Prolific. Each worker rates ten dialogues (nine random, one low‑quality “quasi‑gold” control) on a 5‑point Likert scale for all 18 CRS‑Que items, adopting the perspective of the information seeker. Quality control includes two explicit attention checks, a quasi‑gold filter, and removal of submissions with unrealistic completion times. After filtering, 117 high‑quality workers remain, providing a total of 1,053 ratings (average 5.27 ratings per dialogue).

Statistical Framework
The authors first perform an a‑priori power analysis, showing that with N = 200 dialogues and k ≈ 5 raters per dialogue they have 80 % power to detect an ICC as low as 0.06. For reliability they compute:

  1. One‑way random‑effects ICC(1) and ICC(1,k) – measuring absolute agreement without accounting for systematic rater bias.
  2. Crossed random‑effects model – including random intercepts for both dialogues and raters, yielding two reliability metrics:
    * Rel_single_dial = σ²_dialogue / (σ²_dialogue + σ²_rater + σ²_resid) – reliability of a single annotator after controlling for rater bias.
    * Rel(k)_dial – reliability of the mean rating across k raters (k ≈ 5.14).
  3. Krippendorff’s α (ordinal) – a rank‑based agreement statistic robust to scale differences.

Interpretation follows conventional thresholds: ICC < 0.5 poor, 0.5–0.75 moderate, > 0.75 good; α > 0.67 considered tentative.

Results – RQ1 (Reliability)
Table 1 (reproduced in the paper) shows that utilitarian dimensions such as Accuracy, Satisfaction, Perceived Usefulness, CUI Understanding, Perceived Ease of Use, Trust Confidence, and CUI Attentiveness achieve Rel(k)_dial values between 0.60 and 0.69 (moderate to good) and Rel_single_dial around 0.25–0.30. Socially grounded constructs—CUI Humanness, CUI Rapport, Interaction Adequacy—have Rel(k)_dial ≈ 0.40–0.48 and Rel_single_dial as low as 0.12. ICC(1) is essentially zero for most dimensions, indicating that raw absolute agreement is dominated by rater‑specific leniency or severity. Krippendorff’s α ranges from 0.41 (CUI Humanness) to 0.69 (Accuracy), mirroring the pattern of Rel_single_dial and suggesting that raters agree more on relative ordering than on absolute scores.

Results – RQ2 (Structure)
Spearman correlation analysis across dialogues reveals a dense, highly positive correlation matrix. Hierarchical clustering shows that most dimensions collapse into a single large cluster, evidencing a strong halo effect: a dialogue judged positively on one dimension tends to receive high scores on many others, regardless of theoretical distinctiveness. Transparency is the only dimension with a slightly weaker connection, but still part of the overall cluster.

Interpretation and Implications
The findings challenge two implicit assumptions prevalent in current CRS evaluation practice: (1) that external annotators can reliably infer subjective user experiences from static logs, and (2) that the multi‑dimensional questionnaire captures orthogonal constructs. The study demonstrates that while outcome‑oriented dimensions are moderately reliable when multiple raters are aggregated, socially grounded dimensions are not. Moreover, the pervasive halo effect suggests that treating each dimension independently may inflate perceived system quality and obscure specific weaknesses.

Practical recommendations include:

  • Multi‑rater aggregation – at least five independent annotations per dialogue are needed to achieve acceptable reliability for most utilitarian dimensions.
  • Dimension reduction – given the high inter‑correlations, applying factor analysis or principal component analysis can identify a smaller set of latent factors (e.g., “effectiveness”, “trust”, “social presence”) for more parsimonious reporting.
  • Re‑design of social constructs – to improve reliability, future questionnaires might incorporate cues that are observable in static transcripts (e.g., explicit empathy statements, turn‑taking patterns) or supplement with interaction logs that capture non‑verbal signals.
  • Caution with LLM‑based automatic evaluators – since LLMs are typically trained on human annotations, any systematic bias or low reliability in the human data will propagate to the model, potentially leading to over‑optimistic automatic scores.

Limitations and Future Work
The study is confined to movie‑recommendation dialogues; other domains (music, e‑commerce) may exhibit different reliability patterns. The “information seeker” perspective may not capture the full spectrum of user roles (e.g., expert vs. novice). The paper does not directly compare human annotations with LLM‑generated scores, leaving open the question of how much LLMs can mitigate or exacerbate the identified reliability issues.

Conclusion
User‑centric evaluation of CRS on static dialogue logs is feasible for utilitarian, outcome‑oriented dimensions when multiple annotators are used, but it is unreliable for socially grounded constructs. The strong inter‑dimensional correlations indicate a halo effect that can mislead researchers who treat each dimension as independent. To obtain trustworthy offline evaluations, the community should adopt multi‑rater aggregation, consider dimensionality reduction, and, where possible, complement static‑log assessments with real‑time user interaction data.


Comments & Academic Discussion

Loading comments...

Leave a Comment