DIAL-SUMMER: A Structured Evaluation Framework of Hierarchical Errors in Dialogue Summaries
Dialogues are a predominant mode of communication for humans, and it is immensely helpful to have automatically generated summaries of them (e.g., to revise key points discussed in a meeting, to review conversations between customer agents and product users). Prior works on dialogue summary evaluation largely ignore the complexities specific to this task: (i) shift in structure, from multiple speakers discussing information in a scattered fashion across several turns, to a summary’s sentences, and (ii) shift in narration viewpoint, from speakers’ first/second-person narration, standardized third-person narration in the summary. In this work, we introduce our framework DIALSUMMER to address the above. We propose DIAL-SUMMER’s taxonomy of errors to comprehensively evaluate dialogue summaries at two hierarchical levels: DIALOGUE-LEVEL that focuses on the broader speakers/turns, and WITHIN-TURN-LEVEL that focuses on the information talked about inside a turn. We then present DIAL-SUMMER’s dataset composed of dialogue summaries manually annotated with our taxonomy’s fine-grained errors. We conduct empirical analyses of these annotated errors, and observe interesting trends (e.g., turns occurring in middle of the dialogue are the most frequently missed in the summary, extrinsic hallucinations largely occur at the end of the summary). We also conduct experiments on LLM-Judges’ capability at detecting these errors, through which we demonstrate the challenging nature of our dataset, the robustness of our taxonomy, and the need for future work in this field to enhance LLMs’ performance in the same. Code and inference dataset coming soon.
💡 Research Summary
The paper introduces DIALSUMMER, a structured evaluation framework specifically designed for dialogue summarization, a task that poses unique challenges compared to traditional single‑document summarization. Two fundamental issues are highlighted: (i) the structural shift from a multi‑speaker, multi‑turn dialogue to a compact set of summary sentences, and (ii) the narration‑viewpoint shift from first/second‑person utterances in the dialogue to third‑person narration in the summary. To address these, the authors propose a hierarchical error taxonomy that operates at two levels.
At the Dialogue‑Level, five novel macro‑errors capture problems that affect the overall integrity of the conversation representation: Wrong Turn Sequence, Missed Turn, Speaker Misattribution, Speaker Identity Bias, Viewpoint Distortion, and Extrinsic Conversation (insertion of content not present in the source). At the Within‑Turn Level, five micro‑errors (adapted from prior work) focus on the semantic fidelity of individual turns: Wrong Linking, Changed Meaning, Extrinsic Context, Missed Conversation, and a refined version of hallucination categories. Each error type is precisely defined with illustrative examples, ensuring mutual exclusivity and reducing the ambiguity that plagued earlier taxonomies (e.g., the “contradiction” label in Tang et al., 2024).
To validate the taxonomy, the authors construct an inference dataset of 192 multi‑turn dialogues drawn from the Anthropic‑Test collection, each paired with a human‑written summary. Three trained annotators label every summary with the fine‑grained error types, the exact sentences where they occur, and free‑form rationales. The resulting dataset contains 1,842 annotated error instances, providing a rich resource for both error analysis and model evaluation.
Statistical analysis reveals systematic patterns: turns located in the middle of dialogues are most frequently omitted, while extrinsic hallucinations tend to appear toward the end of summaries. These trends suggest that current summarization models capture opening information well but struggle to maintain a coherent representation of the entire dialogue flow.
The paper further investigates whether large language models (LLMs) can serve as “LLM‑Judges” for automatic error detection. Using few‑shot prompts, models such as GPT‑4, Claude, and Llama‑2 are tasked with identifying the taxonomy’s error categories in the annotated summaries. Overall accuracy hovers around 60–70 %, with structural errors (Wrong Turn Sequence, Missed Turn) being detected at ≈80 % while more subtle narration‑related errors (Viewpoint Distortion, Speaker Identity Bias) fall below 40 %. Notably, explicitly providing the taxonomy in the prompt improves performance by 5–7 percentage points, demonstrating the practical utility of a well‑defined error schema for guiding LLM evaluators.
Contributions of the work are fourfold: (1) a novel hierarchical error taxonomy tailored to dialogue summarization, (2) a publicly released, human‑annotated dataset of dialogue‑summary pairs with fine‑grained error labels, (3) comprehensive empirical analyses of error distributions and their positional tendencies, and (4) an evaluation of LLM‑Judge capabilities, highlighting both their current limitations and the benefits of taxonomy‑aware prompting.
Limitations include the modest size and domain concentration of the dataset (primarily customer‑agent and casual conversations), and the strictness of the “Missed Conversation” definition, which may lead to inter‑annotator variance. Moreover, the LLM‑Judge experiments reveal that current models are not yet reliable enough for high‑stakes applications, especially for nuanced viewpoint errors.
Future research directions suggested by the authors encompass scaling the dataset across diverse domains and languages, developing semi‑supervised methods for efficient annotation, integrating error detection with automatic error‑repair models, and designing hybrid human‑LLM evaluation pipelines. By providing a clear, extensible taxonomy and a benchmark dataset, DIALSUMMER lays the groundwork for more precise, interpretable, and ultimately higher‑quality dialogue summarization systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment