Generation-Time vs. Post-hoc Citation: A Holistic Evaluation of LLM Attribution

Generation-Time vs. Post-hoc Citation: A Holistic Evaluation of LLM Attribution
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Trustworthy Large Language Models (LLMs) must cite human-verifiable sources in high-stakes domains such as healthcare, law, academia, and finance, where even small errors can have severe consequences. Practitioners and researchers face a choice: let models generate citations during decoding, or let models draft answers first and then attach appropriate citations. To clarify this choice, we introduce two paradigms: Generation-Time Citation (G-Cite), which produces the answer and citations in one pass, and Post-hoc Citation (P-Cite), which adds or verifies citations after drafting. We conduct a comprehensive evaluation from zero-shot to advanced retrieval-augmented methods across four popular attribution datasets and provide evidence-based recommendations that weigh trade-offs across use cases. Our results show a consistent trade-off between coverage and citation correctness, with retrieval as the main driver of attribution quality in both paradigms. P-Cite methods achieve high coverage with competitive correctness and moderate latency, whereas G-Cite methods prioritize precision at the cost of coverage and speed. We recommend a retrieval-centric, P-Cite-first approach for high-stakes applications, reserving G-Cite for precision-critical settings such as strict claim verification. Our codes and human evaluation results are available at https://anonymous.4open.science/r/Citation_Paradigms-BBB5/


💡 Research Summary

This paper investigates how large language models (LLMs) should provide verifiable citations in high‑stakes domains such as healthcare, law, finance, and academia. The authors distinguish two fundamental paradigms: Generation‑Time Citation (G‑Cite), where the model inserts citation markers while generating text in a single pass, and Post‑hoc Citation (P‑Cite), where the model first drafts an answer and then adds or verifies citations in a separate step. To systematically compare these paradigms, the study adapts four widely used attribution datasets—ALCE, LongCite, REASONS, and FEVER—so that each can be evaluated under both paradigms. The datasets cover short‑ and long‑context QA, sentence‑, document‑, and claim‑level citation granularity, and span open‑domain, scientific, and fact‑verification tasks.

Eight state‑of‑the‑art methods are benchmarked, four per paradigm: (i) zero‑shot prompting, (ii) fine‑tuned models (LongCite‑8B for G‑Cite, CiteBART for P‑Cite), (iii) retrieval‑augmented generation (RAG), and (iv) advanced hybrid techniques (CoT Citation for G‑Cite, Citation‑Enhanced Generation (CEG) for P‑Cite). All experiments use the same backbone LLM (LLaMA‑3.1‑8B‑Instruct) except CiteBART, which relies on the BART architecture.

Performance is measured with five standard metrics: citation precision, recall, correctness (harmonic mean of precision and recall), coverage (fraction of ground‑truth citations present), and latency (average inference time). In addition, a human evaluation with two expert annotators assesses answer correctness (whether every claim is supported by the cited source) and citation hallucination (whether any citation points to a non‑existent or out‑of‑scope source). Inter‑annotator agreement reaches κ = 0.873, indicating reliable judgments.

Key findings:

  1. Retrieval augmentation is the dominant factor for both paradigms. Moving from zero‑shot to RAG yields the largest gains across all datasets, improving G‑Cite correctness on FEVER by ~50 percentage points and boosting coverage on LongBench‑Cite by ~47 pp.
  2. P‑Cite consistently achieves higher coverage while maintaining competitive correctness. On REASONS, P‑Cite reaches 99 % coverage versus 97 % for G‑Cite; on FEVER, P‑Cite attains 74 % coverage and 75 % correctness, whereas G‑Cite shows 27 % coverage but 94 % correctness.
  3. G‑Cite excels in precision‑critical scenarios. In fact‑verification tasks (e.g., legal policy checks), G‑Cite delivers the highest precision and correctness, making it suitable when a single claim must be rigorously validated.
  4. Advanced hybrid methods allow practitioners to trade off latency against coverage and correctness. CoT Citation (G‑Cite) and CEG (P‑Cite) can be tuned to meet specific operational constraints.
  5. Human evaluation corroborates automatic metrics: P‑Cite methods produce more correct answers (78 % vs. 69 %) and fewer citation hallucinations (37 % vs. 41 %).

Based on these results, the authors recommend a retrieval‑centric, P‑Cite‑first strategy for high‑stakes applications where comprehensive evidence is essential, and a G‑Cite approach for precision‑critical tasks such as strict claim verification. They also emphasize that investment in robust retrieval infrastructure is a prerequisite for trustworthy LLM attribution, regardless of the chosen paradigm. All code, datasets, and human evaluation annotations are released to facilitate reproducibility and future research.


Comments & Academic Discussion

Loading comments...

Leave a Comment