CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts. Building on the HIPE-2020 and HIPE-2022 campaigns, it extends the series toward semantic relation extraction by targeting the task of identifying person–place associations in multiple languages and time periods. Systems are asked to classify relations of two types - $at$ (“Has the person ever been at this place?”) and $isAt$ (“Is the person located at this place around publication time?”) - requiring reasoning over temporal and geographical cues. The lab introduces a three-fold evaluation profile that jointly assesses accuracy, computational efficiency, and domain generalization. By linking relation extraction to large-scale historical data processing, HIPE-2026 aims to support downstream applications in knowledge-graph construction, historical biography reconstruction, and spatial analysis in digital humanities.

💡 Research Summary

The paper presents HIPE‑2026, the latest CLEF shared task in the History of the Past Evaluation (HIPE) series, which moves beyond the earlier focus on multilingual named‑entity recognition and linking to the more demanding problem of person‑place relation extraction in noisy, historical documents. The task requires systems to classify each person–place pair mentioned in a document into two relation types: at (the person was present at the place at any time before the document’s publication) and isAt (the person was present at the place in the immediate temporal vicinity of the publication date). The at relation is further divided into three labels—true (explicit evidence), probable (inferred from indirect cues), and false (no evidence)—while isAt is a binary (+/‑) decision that refines the at judgment by imposing a tighter temporal window.

Two test collections are provided. Test Set A comprises multilingual newspaper articles (French, German, English, Luxembourgish) spanning roughly 200 years (19th–20th centuries) drawn from the HIPE‑2022 corpus. Surprise Test Set B contains French literary texts from the 16th–18th centuries and evaluates only the at relation, thereby probing domain generalization. A pilot annotation effort on 119 pairs showed moderate to high inter‑annotator agreement (Cohen’s κ 0.7–0.9 for at, 0.4–0.9 for isAt). Preliminary experiments with GPT‑4o achieved up to 0.8 agreement on at but much lower and more variable performance on isAt (0.2–0.7), highlighting the difficulty of fine‑grained temporal reasoning.

Evaluation is organized into three complementary profiles:

Accuracy Profile – uses macro‑averaged recall across all labels, ensuring that rare labels (e.g., probable, false) receive equal weight despite class imbalance. For Test Set A, macro‑recall is computed separately for at and isAt and then averaged for the final ranking.
Accuracy‑Efficiency Profile – combines the macro‑recall score with measures of model size, inference time, and memory consumption to produce a balanced score. This encourages participants to explore lightweight models, efficient prompting, or task‑specific classifiers rather than relying solely on massive LLMs.
Generalization Profile – evaluates macro‑recall on the Surprise Set B, testing how well systems trained on newspaper data transfer to literary texts and to a different temporal domain.

The task design forces participants to handle several technical challenges. First, the need for temporal reasoning means that models must go beyond surface‑level co‑occurrence and interpret indirect cues (e.g., titles, event participation) to assign probable labels. The authors explicitly connect this to Hobbs’ abductive reasoning framework, encouraging systems that can generate minimal explanatory assumptions. Second, historical OCR output introduces orthographic variation, spelling errors, and layout artefacts, requiring robust preprocessing (language‑specific spell correction, OCR error modeling). Third, the quadratic growth of candidate pairs (O(N²) per document) demands efficient candidate filtering—such as distance‑based heuristics, pre‑scoring with smaller language models, or batch inference—to keep computational costs tractable. Fourth, multilinguality across four languages calls for cross‑lingual transfer techniques, multilingual pretrained encoders (e.g., XLM‑R, mBERT), or language‑adaptive fine‑tuning.

The authors also provide open‑source resources: the annotated datasets, baseline implementations, and scoring scripts are released under a CC‑BY 4.0 license on GitHub and Zenodo, facilitating reproducibility and community extensions. By integrating accuracy, efficiency, and domain robustness into a single benchmark, HIPE‑2026 aligns with recent sustainability trends in NLP (e.g., SustaiNLP, EfficientQA) and offers a realistic testbed for applications such as historical knowledge‑graph construction, biographical trajectory reconstruction, and spatio‑temporal analysis in the digital humanities.

In conclusion, HIPE‑2026 pushes the frontier of relation extraction toward historically grounded, multilingual, and temporally aware settings. It invites the development of models that can reason abductively, operate efficiently on large noisy corpora, and generalize across genres and centuries—capabilities that are essential for the next generation of digital‑humanities tools. Future work is likely to focus on (1) more sophisticated candidate‑pair pruning and batch processing pipelines, (2) stronger multilingual and noise‑robust representations, and (3) explicit abductive or NLI‑based verification modules to improve probable label detection.

CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts

💡 Research Summary

Comments & Academic Discussion

Leave a Comment