LLMs as Span Annotators: A Comparative Study of LLMs and Humans

LLMs as Span Annotators: A Comparative Study of LLMs and Humans
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Span annotation - annotating specific text features at the span level - can be used to evaluate texts where single-score metrics fail to provide actionable feedback. Until recently, span annotation was done by human annotators or fine-tuned models. In this paper, we study whether large language models (LLMs) can serve as an alternative to human annotators. We compare the abilities of LLMs to skilled human annotators on three span annotation tasks: evaluating data-to-text generation, identifying translation errors, and detecting propaganda techniques. We show that overall, LLMs have only moderate inter-annotator agreement (IAA) with human annotators. However, we demonstrate that LLMs make errors at a similar rate as skilled crowdworkers. LLMs also produce annotations at a fraction of the cost per output annotation. We release the dataset of over 40k model and human span annotations for further research.


💡 Research Summary

This paper investigates whether large language models (LLMs) can serve as reliable span annotators, an alternative to human annotators, across three distinct tasks: evaluating data‑to‑text (D2T) generation, identifying errors in machine translation (MT), and detecting propaganda techniques in news articles. Span annotation differs from traditional single‑score metrics by marking the exact text spans that exhibit a particular property (e.g., an error or a rhetorical device) and assigning them a category and an optional reason. This fine‑grained feedback is more actionable and explainable, but historically required costly human labor.

The authors first formalize the span annotation problem: given a text Y, a set of categories C, annotation guidelines G, and optionally a source X, the output is a set of tuples ⟨start, end, category, reason⟩. They then describe how to automate this process with LLMs. A prompt is constructed that includes the text, the list of categories, and the guidelines, and the model is asked to return a JSON list of annotations. Constrained decoding ensures syntactic validity, and for models that do not natively support structured output, the authors strip any “” tags and extract the last valid JSON object.

To evaluate the quality of the generated spans, the paper adopts several metrics that go beyond simple overlap counts. Pearson correlation of span counts checks whether an annotator systematically over‑ or under‑annotates. Precision, recall, and F1 are computed in both a “hard” version (requiring matching categories) and a “soft” version (ignoring categories). The γ‑score, adapted from Krippendorff’s α, measures the overall disorder of the alignment, rewarding near‑matches even when exact overlap is missing. Finally, an S∅‑score handles cases where one annotator produces no spans, rewarding perfect agreement and penalizing unnecessary annotations.

The experimental setup covers three tasks:

  1. D2T‑EVAL – The authors generate 1,200 summaries from structured data (weather, phone specs, soccer match reports) using zero‑shot LLM generation. Human annotations are collected from vetted crowdworkers on Prolific, following a qualification phase and iterative guideline refinement. LLMs (Llama 3.3 70B, DeepSeek‑R1, among others) are tested with various prompting strategies (5‑shot, chain‑of‑thought, with/without guidelines, with/without reasons). The best prompting configuration yields an average of 0.20 annotations per output and an F1 of about 0.62 (hard) on the D2T dev set.

  2. MT‑EVAL – Using the WMT 2024 general shared‑task outputs, the authors sample 2,854 translation segments across nine language pairs and three domains (news, literary, social). Professional translators have already annotated major and minor errors using the ESA protocol. LLMs are prompted to replicate this annotation. Results show an F1 around 0.58 (hard) and a γ‑score indicating moderate alignment with the professional annotations.

  3. PROPAGANDA – The dataset from Da San Martino et al. (2019) contains 914 news articles annotated for 18 propaganda techniques. No source text is needed; the task is purely intrinsic. LLMs achieve an F1 of roughly 0.55 (hard) and a lower γ‑score, reflecting the difficulty of capturing nuanced rhetorical strategies.

Across all tasks, the inter‑annotator agreement (IAA) between LLMs and humans is “moderate” but comparable to the agreement among qualified crowdworkers who passed the qualification test. Importantly, the error rate of LLMs is similar to that of skilled crowdworkers, suggesting that LLMs are not dramatically less reliable. The authors also conduct an error analysis: LLMs often miss complex logical relations, produce imprecise span boundaries, or omit the explanatory “reason” field, especially when the prompt does not explicitly request it. Prompt engineering proves crucial; adding few‑shot examples or a chain‑of‑thought component consistently improves both the number of annotated spans and their correctness.

Cost analysis reveals that LLM annotation costs are a fraction of human costs (approximately $0.02 per annotated output versus typical crowdworker rates), making LLMs attractive for large‑scale annotation campaigns. The paper concludes by releasing a dataset of over 40 k span annotations (both human and model) along with reasoning traces, providing a valuable benchmark for future research.

Key contributions include:

  • Demonstrating that LLMs, when guided by well‑structured prompts and detailed guidelines, can produce useful span annotations across diverse tasks.
  • Quantifying LLM‑human agreement and showing it matches that of vetted crowdworkers.
  • Providing a thorough error analysis that highlights current limitations and points to future improvements (e.g., better reasoning prompts, fine‑tuning for span detection).
  • Releasing a large, publicly available span‑annotation corpus to foster further work in automated, fine‑grained text evaluation.

Limitations noted are the difficulty LLMs have with overlapping or multi‑label spans, occasional inconsistency in generating reasons, and the reliance on English‑centric prompts and datasets. Future directions suggested include exploring chain‑of‑thought prompting, fine‑tuning on span‑annotation data, extending to multilingual settings, and integrating LLMs into hybrid pipelines where human oversight corrects the most ambiguous cases.


Comments & Academic Discussion

Loading comments...

Leave a Comment