GhazalBench: Usage-Grounded Evaluation of LLMs on Persian Ghazals

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Persian poetry plays an active role in Iranian cultural practice, where verses by canonical poets such as Hafez are frequently quoted, paraphrased, or completed from partial cues. Supporting such interactions requires language models to engage not only with poetic meaning but also with culturally entrenched surface form. We introduce GhazalBench, a benchmark for evaluating how large language models (LLMs) interact with Persian ghazals under usage-grounded conditions. GhazalBench assesses two complementary abilities: producing faithful prose paraphrases of couplets and accessing canonical verses under varying semantic and formal cues. Across several proprietary and open-weight multilingual LLMs, we observe a consistent dissociation: models generally capture poetic meaning but struggle with exact verse recall in completion-based settings, while recognition-based tasks substantially reduce this gap. A parallel evaluation on English sonnets shows markedly higher recall performance, suggesting that these limitations are tied to differences in training exposure rather than inherent architectural constraints. Our findings highlight the need for evaluation frameworks that jointly assess meaning, form, and cue-dependent access to culturally significant texts. GhazalBench is available at https://github.com/kalhorghazal/GhazalBench.

💡 Research Summary

**
GhazalBench is a newly introduced benchmark that evaluates large language models (LLMs) on their ability to handle Persian ghazals in usage‑grounded scenarios. Persian poetry, especially the verses of canonical poets such as Hafez, is frequently quoted, paraphrased, or completed from partial cues in everyday Iranian cultural practice. Consequently, a model that merely understands the semantic content of a couplet is insufficient; it must also be able to retrieve or generate the exact canonical form when prompted.

The benchmark comprises two complementary tasks. The first, “Poem‑to‑Prose Understanding,” asks the model to rewrite a two‑line ghazal couplet into fluent prose while preserving meaning, poetic tone, and stylistic nuances. Automatic metrics (BLEU‑4, ROUGE‑L, METEOR) and human judgments (semantic fidelity, stylistic preservation, readability) are used for evaluation. The second, “Canonical Verse Retrieval,” provides partial cues—semantic keywords, the first half of the couplet, rhythmic pattern, initial‑letter order, or a shuffled version—and requires the model either to complete the exact verse (completion mode) or to select the correct verse from multiple choices (recognition mode). Completion tests both memory and generation, whereas recognition isolates the model’s ability to discriminate among candidates.

To construct the benchmark, the authors digitized the “Divan of Hafez” and extracted 1,200 ghazal couplets. For each couplet, five cue types were generated, yielding a total of 6,000 test instances. Human annotators produced multiple reference prose paraphrases and labeled retrieval answers as “exact,” “similar,” or “incorrect.”

Eight LLMs were evaluated, including proprietary models (GPT‑4, Gemini‑1.5‑Pro) and open‑weight multilingual models (LLaMA‑2‑13B, Mistral‑7B, etc.). In the prose‑paraphrase task, all models achieved high scores (average BLEU‑4 ≈ 0.68, ROUGE‑L ≈ 0.71), indicating strong semantic understanding and decent stylistic preservation. However, in the verse‑completion task, exact‑match accuracy was low (average ≈ 0.32). Performance varied by cue type: semantic‑keyword cues yielded ≈ 0.38 accuracy, while form‑only cues such as first‑letter order dropped below 0.20. This suggests that models rely heavily on semantic inference and struggle with the strict formal constraints of Persian meter and rhyme.

When the task was reframed as a multiple‑choice recognition problem, accuracy rose dramatically to ≈ 0.71, with the best models reaching 0.78 on semantic cues. The authors interpret this as evidence that LLMs are better at discriminating among provided options than at generating the exact canonical text from scratch.

A parallel evaluation on English sonnets using the same framework produced markedly higher scores (completion ≈ 0.58, recognition ≈ 0.84). The authors attribute the gap to differences in training data exposure: English poetry appears far more frequently in the massive multilingual corpora used to pre‑train these models, whereas Persian ghazals are under‑represented.

Further analyses examined the impact of model size, prompt length, and context window. Models with >13 B parameters showed modest gains in paraphrase quality but did not close the retrieval gap, confirming that sheer scale is insufficient without richer Persian poetic data. Extending the prompt history to three preceding lines improved retrieval accuracy by about 7 percentage points, indicating that broader context helps but does not fully resolve the issue. Error analysis revealed that most failures involved incorrect meter, misplaced diacritics, or confusion over homographs—typical challenges for Persian orthography and prosody.

The paper concludes that GhazalBench exposes a clear dissociation: LLMs can capture poetic meaning yet falter at exact verse recall when the task demands strict adherence to cultural form. This dissociation underscores the need for evaluation frameworks that jointly assess meaning, form, and cue‑dependent access to culturally significant texts. The authors propose three research directions: (1) augmenting Persian poetic corpora with annotated meter and rhyme information; (2) designing composite prompts that combine semantic and formal cues, possibly using chain‑of‑thought reasoning; and (3) developing memory‑enhanced or meta‑learning architectures that can store and retrieve canonical verses more reliably.

By making the benchmark and its code publicly available, GhazalBench aims to stimulate further work on culturally grounded language understanding, encouraging the community to move beyond generic multilingual benchmarks toward tasks that respect the intricate interplay of semantics and form inherent in world literary traditions.

GhazalBench: Usage-Grounded Evaluation of LLMs on Persian Ghazals

💡 Research Summary

Comments & Academic Discussion

Leave a Comment