Dutch Metaphor Extraction from Cancer Patients' Interviews and Forum Data using LLMs and Human in the Loop

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Metaphors and metaphorical language (MLs) play an important role in healthcare communication between clinicians, patients, and patients’ family members. In this work, we focus on Dutch language data from cancer patients. We extract metaphors used by patients using two data sources: (1) cancer patient storytelling interview data and (2) online forum data, including patients’ posts, comments, and questions to professionals. We investigate how current state-of-the-art large language models (LLMs) perform on this task by exploring different prompting strategies such as chain of thought reasoning, few-shot learning, and self-prompting. With a human-in-the-loop setup, we verify the extracted metaphors and compile the outputs into a corpus named HealthQuote.NL. We believe the extracted metaphors can support better patient care, for example shared decision making, improved communication between patients and clinicians, and enhanced patient health literacy. They can also inform the design of personalized care pathways. We share prompts and related resources at https://github.com/4dpicture/HealthQuote.NL

💡 Research Summary

**
The paper presents the first systematic study of automated Dutch‑language metaphor extraction from cancer patient narratives, combining large language models (LLMs) with a human‑in‑the‑loop (HITL) validation pipeline. Two distinct corpora are used: (1) transcribed interviews with 13 cancer patients, their significant others, and interviewers (total 13 documents, up to 13,777 words each) and (2) a large collection of Dutch‑language forum posts from kanker.nl, comprising 15,653 blog entries, 17,290 comments, 2,246 group discussions, 5,777 “ask a professional” questions, and 10,134 reactions across breast, prostate, and melanoma cancer types.

The authors explore a suite of open‑source LLMs of varying sizes and domains (Qwen‑3 8B, Gemma‑3 12B/27B, Llama‑3.1 8B, Mistral 7B, DeepSeek‑R1 8B, Meditron 7B, MedLlama2 7B) and evaluate three prompting strategies: a simple Instruction Prompt (IP), Refined Prompt version 1 (RP‑v1) and version 2 (RP‑v2). RP‑v1 incorporates a “expert linguist” persona, few‑shot examples, chain‑of‑thought (CoT) reasoning, an external automatic verification checklist (requiring the model to locate the exact sentence, speaker role, and section), and a structured output schema that classifies each metaphor by form (word/phrase/sentence), source domain (e.g., violence, journey, nature), and communicative function (e.g., explanation, coping, empowerment). RP‑v2 adds the full English Metaphor Menu (17 categories) to the prompt in order to test whether an explicit taxonomy improves extraction.

Automatic verification is performed by a separate script that checks whether the model’s output can be traced back to the original text. Candidates that pass this check are then reviewed by three native‑Dutch PhD‑level annotators with expertise in computational linguistics and health communication. Annotators assess each candidate on three criteria: (1) Faithfulness – the metaphor must be explicitly present in the source text, (2) Metaphoricity – the expression must involve a genuine cross‑domain conceptual mapping rather than a literal term, idiom, or conventional phrase, and (3) Contextual Appropriateness – the metaphor must convey the intended meaning within its original context. Disagreements are resolved through discussion to reach consensus.

Quantitative results show that prompting strategy dramatically affects precision. The simple Instruction Prompt generated 72 candidate metaphors, of which 41 were validated (56.9 % precision). RP‑v1 generated 38 candidates with 24 validated (63.2 % precision), the highest among all settings, demonstrating that structured prompting, CoT reasoning, and explicit extraction constraints reduce hallucinations and idiom confusion. RP‑v2, despite producing many more candidates (174), yielded only 24 validated metaphors (13.8 % precision), indicating that providing the full English Metaphor Menu introduced substantial noise and over‑interpretation, likely due to cross‑lingual bias.

Error analysis identifies three dominant failure modes: (i) Hallucination – the model invents metaphors not present in the source; (ii) Idiom Confusion – Dutch idiomatic expressions are mistakenly labeled as metaphors; (iii) Abstraction – the model extracts overly generic descriptions rather than the concrete metaphorical phrasing. RP‑v1 mitigates these errors, whereas RP‑v2’s broader knowledge injection exacerbates them.

The final curated dataset, HealthQuote.NL, contains 130 validated Dutch metaphors drawn from both interview and forum sources. Each entry includes the original Dutch metaphor, its English translation, the full sentence context, speaker role (patient, significant other, or interviewer), source domain classification, and functional label.

The authors acknowledge several limitations: the interview corpus is small (13 documents), limiting statistical generalizability; only open‑source LLMs were evaluated, so performance relative to state‑of‑the‑art commercial models (e.g., GPT‑4, Claude) remains unknown; human validation is resource‑intensive, raising scalability concerns; and the functional taxonomy of metaphors is relatively coarse, lacking detailed guidance for clinical implementation.

Future work is proposed along four axes: (1) expanding the corpus to include more patients, cancer types, and multilingual data; (2) benchmarking against commercial LLMs and exploring fine‑tuning on domain‑specific corpora; (3) developing semi‑automated or crowdsourced validation pipelines to reduce annotation cost; and (4) building downstream applications such as metaphor‑aware decision‑support tools, patient‑education materials, or personalized communication aids that leverage the extracted metaphors to improve shared decision‑making and health literacy.

In summary, this study demonstrates that carefully engineered prompts combined with a robust HITL framework can harness open‑source LLMs to extract clinically relevant metaphors from Dutch cancer patient narratives with respectable precision (63 %). The publicly released prompts, code, and the HealthQuote.NL dataset provide a valuable resource for researchers in computational linguistics, health communication, and AI‑driven patient‑centered care, and set a methodological foundation for extending metaphor extraction to other languages and medical domains.

Dutch Metaphor Extraction from Cancer Patients' Interviews and Forum Data using LLMs and Human in the Loop

💡 Research Summary

Comments & Academic Discussion

Leave a Comment