Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers
Fact verification is essential for ensuring the reliability of LLM applications. In this study, we evaluate 12 pre-trained LLMs and one specialized fact-verifier, including frontier LLMs and open-weight reasoning LLMs, using a collection of examples from 14 fact-checking benchmarks. We share three findings intended to guide future development of more robust fact verifiers. First, we highlight the importance of addressing annotation errors and ambiguity in datasets, demonstrating that approximately 16% of ambiguous or incorrectly labeled data substantially influences model rankings. Neglecting this issue may result in misleading conclusions during comparative evaluations, and we suggest using a systematic pipeline utilizing LLM-as-a-judge to help identify these issues at scale. Second, we discover that frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance. We therefore recommend that future studies include comparisons with these simple yet highly effective baselines. Lastly, despite their effectiveness, frontier LLMs incur substantial costs, motivating the development of small, fine-tuned fact verifiers. We show that these small models still have room for improvement, particularly on instances that require complex reasoning. Encouragingly, we demonstrate that augmenting training with synthetic multi-hop reasoning data significantly enhances their capabilities in such instances. We release our code, model, and dataset at https://github.com/just1nseo/verifying-the-verifiers.
💡 Research Summary
The paper conducts a comprehensive evaluation of fact‑verification models by examining both the quality of benchmark data and the cost‑effectiveness of the models themselves. The authors first assemble a balanced collection of 1,749 instances drawn from 14 publicly available fact‑verification benchmarks covering a wide range of domains (news, scientific claims, legal texts, etc.). To ensure that the evaluation data are reliable, they apply a two‑stage filtering pipeline: (1) removal of unverifiable statements and trivial verbatim matches using heuristic rules and n‑gram overlap, which eliminates roughly 45% of the raw samples; (2) a novel “LLM‑as‑a‑judge” system that queries four frontier large language models (o3‑mini, GPT‑4o, Gemini 2.0‑Flash, Llama 3.1 405B FP8) with zero‑shot prompts to flag potential label mismatches. The flagged 344 candidates are then examined by three specialized judges that assess completeness, logical coherence, and faithfulness of the model‑generated rationales. Only examples that receive unanimous positive judgments are retained, dramatically reducing the human workload to about 20% of the original set. Manual inspection of the remaining cases reveals that 6.7% of the original labels are outright wrong, while 9.1% are ambiguous (e.g., contextual, linguistic, knowledge‑level, or numerical ambiguities). The authors split the refined data into CLEAR‑FACTS (cleaned, corrected examples) and GRAY‑FACTS (the ambiguous subset) for downstream analysis.
Next, the paper evaluates 12 pre‑trained LLMs—including Llama 3.1 8B/70B, GPT‑4o, Claude, and the cutting‑edge “o1” model—alongside a specialized, fine‑tuned fact‑verifier called MiniCheck (7 B parameters). Evaluation uses macro‑F1 to mitigate label imbalance, and for three‑way classification datasets the “contradictory” class is merged into “not attributable”. The authors explore both zero‑shot prompting and few‑shot in‑context learning (typically 4–8 examples). Across virtually all models, few‑shot prompting yields a 3–5 percentage‑point boost over zero‑shot, and the few‑shot version of o1 consistently achieves the highest macro‑F1 across the entire benchmark suite. This finding underscores that a simple few‑shot baseline, often omitted in prior work, can outperform more elaborate fine‑tuning approaches.
However, the superior performance of frontier LLMs comes with substantial computational and monetary costs, making them impractical for large‑scale fact‑verification pipelines (e.g., reward modeling for LLM alignment). Consequently, the authors focus on the smaller MiniCheck model. While MiniCheck is cost‑effective, it lags behind large models on tasks that require multi‑hop reasoning, such as CoverBench and Hover, where its macro‑F1 drops below 15. To address this gap, the authors devise a synthetic multi‑hop data generation algorithm. Starting from existing fact‑verification pairs, they automatically insert intermediate reasoning steps and link them to create multi‑hop instances. MiniCheck is then fine‑tuned on a mixture of the original data and the synthetic multi‑hop set. Experiments show that this augmentation improves MiniCheck’s performance on the challenging multi‑hop benchmarks by an average of 7.2 percentage points, without harming its results on the other, simpler datasets.
The paper’s contributions can be summarized as threefold: (1) a scalable, LLM‑as‑a‑judge pipeline for detecting and correcting label errors and ambiguities, which reveals that roughly 16% of benchmark instances can materially affect model rankings; (2) an empirical demonstration that few‑shot in‑context prompting is a strong, under‑reported baseline for fact verification; and (3) a practical recipe for boosting lightweight fact‑verifiers through synthetic multi‑hop reasoning data, narrowing the performance gap with expensive frontier models. The authors release code, models, and the refined CLEAR‑FACTS and GRAY‑FACTS datasets, providing a valuable resource for future research on reliable, cost‑efficient fact verification.
Comments & Academic Discussion
Loading comments...
Leave a Comment