GLEAN: Grounded Lightweight Evaluation Anchors for Contamination-Aware Tabular Reasoning
Tabular reasoning benchmarks mix semantic inference, numerical computation, and brittle table formatting, yet evaluations for small models remain vulnerable to contamination, dataset artifacts, and retrieval failures. We propose GLEAN, a lightweight evaluation protocol that integrates contamination-aware probes, weak-supervision governance, retrieval-reasoning diagnostics, and structured error attribution under tight hardware constraints. We evaluate across TabFact, WTQ via Squall, TableBench, RobuT, and SciTab under a 16GB GPU budget. Using Squall gold SQL as an executable anchor (95.2% execution), GLEAN assigns a deterministic error taxonomy (L0-L4 plus L0.5 context miss) and reveals a stable error-mode separation: TAPEX errors skew toward grounding (L3) while TAPAS errors skew toward hallucination/abstention (L2/L0). We validate evidence-row heuristics against SQL-derived rows on simple queries (0.62 precision / 0.71 recall; hybrid recall 0.81) and show that retrieval Recall@K can saturate even when end-to-end EM/F1 remains limited, motivating attribution beyond raw recall. We release a modular framework with audits and sensitivity checks to make small-model tabular evaluation more contamination-aware and diagnostic.
💡 Research Summary
The paper introduces GLEAN, a lightweight, contamination‑aware evaluation protocol designed for tabular reasoning tasks under strict hardware constraints (single 16 GB GPU). Modern TableQA benchmarks combine natural‑language understanding, numeric computation, and fragile table formatting, but existing evaluation pipelines are vulnerable to data leakage, dataset shortcuts, and conflated retrieval‑reasoning failures—issues that become especially pronounced for small models with limited context windows. GLEAN addresses these gaps through four tightly coupled components.
First, a suite of low‑cost contamination probes (canary insertion, n‑gram overlap, entity swaps, schema renaming, and value counterfactuals) is applied to test whether performance gains stem from memorization rather than genuine reasoning. Human verification on a thousand transformed examples shows that paraphrase and schema swaps preserve label semantics, while counterfactual swaps often break them, so the latter are treated as stress tests and reported as delta‑metrics relative to the unperturbed split.
Second, the protocol adopts weak‑supervision governance inspired by Snorkel and WRENCH. Programmatic labeling functions (implemented in pure Python/regex) are audited for coverage, conflict rate, abstention rate, and intrinsic accuracy without ever training on LF outputs, thus providing a transparent measure of label quality and dataset bias.
Third, GLEAN explicitly disentangles retrieval from reasoning. Row‑selection is performed using a variety of sparse (TF‑IDF, BM25, BM25F), dense (BGE, E5, DPR), hybrid, and reranking strategies. Recall@K is measured for K ∈ {1,2,5,10}. If no evidence row is retrieved, the example receives an L0.5 “context‑miss” label. Retrieved rows are then fed to the downstream QA model under strict token budgets (512, 1024, or 2048 tokens). Columns are further pruned to a maximum of 16 based on question‑token overlap.
Fourth, GLEAN anchors error attribution in executable SQL using the Squall augmentation of WikiTableQuestions, which supplies gold SQL queries and token‑level alignments. Executing the gold SQL in SQLite succeeds on 95.2 % of queries, providing a reliable oracle answer. Errors are then classified into a deterministic taxonomy: L0 (no answer), L0.5 (context miss), L1 (execution error), L2 (hallucination—answer not in table when gold is table‑grounded), L3 (grounding error—answer is a table cell but mismatches gold), and L4 (calculation/logic error when both answer and gold are non‑table values). This SQL‑anchored attribution separates grounding failures from pure computational mistakes, a distinction that heuristic string‑matching cannot guarantee.
The authors evaluate GLEAN across five benchmarks: TabFact (fact verification), WTQ via Squall, TableBench, RobuT (robustness perturbations), and SciTab (scientific verification). Models include TAPAS (base and large), TAPEX, DeBERTa‑v3, and an open‑weight Qwen2.5‑3B model equipped with program‑of‑thought (PoT) execution. All experiments respect the 16 GB GPU budget, with retrieval and reasoning components run on a single consumer GPU.
Key findings:
- The artifact detector (logistic regression on surface/metadata features) achieves only chance‑level performance (≈0.52 accuracy), indicating that the datasets contain minimal shortcut signals. DeBERTa‑v3 modestly outperforms it (≈0.57).
- On the robustness suite RobuT, TAPEX’s EM drops from 0.471 to 0.343 (ΔF1 = ‑0.128) and TAPAS from 0.259 to 0.182 (ΔF1 = ‑0.082), with row‑perturbations causing the largest degradation, confirming brittleness to lexical and structural changes even when the underlying tables remain unchanged.
- SciTab proves challenging: DeBERTa‑v3 reaches only 0.30 accuracy and macro‑F1 = 0.197, reflecting the difficulty of scientific table verification under lightweight settings.
- Serialization robustness (six formats: markdown, CSV, TSV, JSON, HTML, key‑value) shows near‑zero EM for Qwen2.5‑3B‑Instruct across all formats, with F1 varying by only 0.0049, suggesting that format sensitivity exists but is minor for open‑weight prompting.
- TableQA baselines on WTQ via Squall and TableBench confirm TAPEX as the strongest (EM = 0.435, F1 = 0.467), while TAPAS‑large offers modest gains over TAPAS‑base. The PoT‑augmented Qwen2.5‑3B achieves high execution rates (96 % on a 200‑sample subset) but lower overall EM/F1 (0.665/0.697 on Squall‑200, 0.260/0.274 on TableBench‑200).
- Error taxonomy analysis reveals distinct failure modes: TAPEX errors are dominated by L3 grounding errors (≈45 % of its mistakes), indicating that row selection and grounding are its primary bottleneck. TAPAS, conversely, exhibits a higher proportion of L2 hallucination and L0 no‑answer errors (≈30 % each), suggesting a tendency to generate answers outside the table or to abstain.
- Importantly, high Retrieval@K (often > 0.9) does not guarantee high end‑to‑end EM/F1; many examples retrieve the correct evidence row yet still fail due to reasoning errors, underscoring the need for attribution beyond raw recall.
GLEAN’s modular codebase, contamination probes, and sensitivity‑analysis scripts are released publicly, enabling future researchers to replicate the diagnostics, plug in new retrieval or reasoning components, and extend the protocol to multi‑table or tool‑augmented settings. The authors argue that GLEAN fills a critical gap: a lightweight yet rigorous evaluation framework that can be applied to small models without requiring massive compute, while providing fine‑grained, executable‑grounded error attribution. Potential extensions include richer evidence‑row detection (e.g., using neural alignment), integration of uncertainty estimation for generated SQL, and adaptation to emerging tool‑use paradigms in LLM‑driven table reasoning.
Comments & Academic Discussion
Loading comments...
Leave a Comment