A benchmark for joint dialogue satisfaction, emotion recognition, and emotion state transition prediction

A benchmark for joint dialogue satisfaction, emotion recognition, and emotion state transition prediction
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

User satisfaction is closely related to enterprises, as it not only directly reflects users’ subjective evaluation of service quality or products, but also affects customer loyalty and long-term business revenue. Monitoring and understanding user emotions during interactions helps predict and improve satisfaction. However, relevant Chinese datasets are limited, and user emotions are dynamic; relying on single-turn dialogue cannot fully track emotional changes across multiple turns, which may affect satisfaction prediction. To address this, we constructed a multi-task, multi-label Chinese dialogue dataset that supports satisfaction recognition, as well as emotion recognition and emotional state transition prediction, providing new resources for studying emotion and satisfaction in dialogue systems.


💡 Research Summary

This paper introduces a large‑scale Chinese multi‑task dialogue benchmark that simultaneously addresses user satisfaction prediction, emotion recognition, and emotion state transition prediction in customer‑service conversations. Recognizing that existing datasets are predominantly English, single‑turn, and lack dynamic emotional annotations, the authors construct a novel corpus comprising 90,000 complete sessions (1,240,327 turns, 1,590,895 user utterances) covering five typical telecom service categories. Each user utterance is annotated with one of seven fine‑grained emotions (Worry, Anger, Insult, Disappointment, Anxiety, Gratitude, No Emotion), nine possible emotion‑state transitions (e.g., Neutral→Negative, Negative→Positive), and a satisfaction label (Satisfied, Dissatisfied, Neutral). Annotation proceeds in three stages with rigorous cross‑verification and senior reviewer oversight; an automated script maps fine‑grained emotions to satisfaction polarity, followed by manual correction.

Statistical analysis reveals a heavily imbalanced distribution: “No Emotion” accounts for 96.3 % of emotion labels, and the majority of transitions are Neutral→Neutral (≈80 %). This mirrors real‑world service calls where users primarily seek factual information rather than express affect. To mitigate imbalance, the authors collapse emotions into three polarity categories (Positive, Neutral, Negative) for downstream modeling.

For benchmarking, eight recent large language models (LLMs) – including Baichuan2‑7B, GLM4‑9B, Deepseek, Mistral‑7B, TeleChat2‑7B, Qwen‑7B, LLaMa2‑7B, and LLaMa3‑8B – and two traditional satisfaction classifiers are fine‑tuned in a unified multi‑head architecture. The input concatenates the full dialogue context with the current user turn, enabling the model to jointly predict emotion, transition, and satisfaction. Evaluation uses accuracy, precision, recall, and F1 across all three tasks.

Results demonstrate that incorporating emotion and transition signals yields consistent gains over models trained solely on satisfaction. Average F1 improvements range from 3 to 5 percentage points, with emotion‑transition information contributing the most to satisfaction accuracy. Among LLMs, LLaMa2‑7B and LLaMa3‑8B achieve the highest overall performance, while Baichuan2‑7B and TeleChat2‑7B also perform competitively. The study confirms that dynamic emotional cues are valuable auxiliary features for predicting user satisfaction in multi‑turn dialogues.

The authors discuss limitations, notably the residual label imbalance and the relatively coarse nine‑category transition schema, which may not capture more nuanced affective trajectories. Future work is suggested to integrate multimodal signals (speech prosody, visual cues) and finer‑grained emotion taxonomies, potentially enhancing model robustness and applicability to real‑time customer‑service monitoring systems. By releasing the dataset and benchmark results, the paper provides a foundational resource for Chinese dialogue research, encouraging further exploration of joint emotion‑satisfaction modeling with advanced LLMs.


Comments & Academic Discussion

Loading comments...

Leave a Comment