Table-BiEval: A Self-Supervised, Dual-Track Framework for Decoupling Structure and Content in LLM Evaluation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

As Large Language Models (LLMs) evolve into autonomous agents, the capability to faithfully translate natural language into rigorous structured formats-essential for tool invocation-and to convert complex tabular information into machine-readable specifications has become paramount. However, current evaluations lack effective methodologies to measure this structural fidelity without costly human intervention, as traditional text metrics fail to detect semantic drift in code-like outputs. This paper proposes Table-BiEval, a novel approach based on a human-free, self-supervised evaluation framework, to assess LLMs performance quantitatively. By leveraging deterministic Intermediate Representations, our framework calculates Content Semantic Accuracy and Normalized Tree Edit Distance to decouple structure from content. Also, it empirically evaluates 15 state-of-the-art LLMs across dual topological dimensions-hierarchical structures and flat tables. The results reveal substantial variability, highlighting that mid-sized models can surprisingly outperform larger counterparts in structural efficiency and confirming that deep recursive nesting remains a universal bottleneck for current architectures.

💡 Research Summary

Table‑BiEval introduces a fully self‑supervised, dual‑track evaluation framework designed to quantify the ability of large language models (LLMs) to translate natural language into rigorous structured formats and tabular specifications without any human annotation. The authors identify a critical gap in current evaluation practices: traditional text‑based metrics such as BLEU, ROUGE, or BERTScore are ill‑suited for code‑like outputs because they cannot detect semantic drift or structural errors that are pivotal when LLMs generate JSON, XML, CSV, HTML, Markdown, LaTeX, or flattened JSON/XML lists for tool invocation or data pipelines.

The core of Table‑BiEval is the use of deterministic Intermediate Representations (IR). An LLM’s raw textual output is first parsed into an “IR original” (IRₒ). A prompt‑generated natural‑language description simulates user intent, which is fed to a target LLM that produces a structured output. This output is parsed again into an “IR generated” (IR_g). By aligning IRₒ and IR_g, the framework computes two complementary metrics:

Content Semantic Accuracy (CSA) – measures the proportion of meaning units (key‑value pairs, cell contents, etc.) that exactly match between IRₒ and IR_g. Unlike token‑level overlap scores, CSA directly captures semantic fidelity, making it sensitive to subtle meaning shifts in code‑like generations.
Normalized Tree Edit Distance (NTED) – quantifies structural deviation by calculating the minimum edit cost (insert, delete, substitute) required to transform the tree of IR_g into that of IRₒ, then normalizing by tree size. NTED reflects how parsimoniously a model preserves hierarchical relationships, especially depth and recursive dependencies.

Table‑BiEval splits evaluation into two tracks:

Structure Eval focuses on hierarchical data (JSON, XML). It stresses deep nesting, recursive references, and long‑range logical dependencies.
Table Eval covers six flat‑or‑semi‑structured formats—CSV, HTML, Markdown, LaTeX, flattened JSON, and flattened XML lists—testing row/column alignment, cell merging, multi‑header hierarchies, and spatial consistency.

The evaluation pipeline consists of five fully automated stages: (1) raw LLM output → (2) IR parsing, (3) prompt generation → (4) target LLM generation → (5) re‑parsing and metric computation. Because all steps are deterministic and require no external judges, the framework scales to large batches at negligible cost.

Empirically, the authors benchmark 15 state‑of‑the‑art LLMs—including GPT‑4, Claude‑2, LLaMA‑2‑70B, Falcon‑40B, and several open‑source models—across eight tasks per track (e.g., nested JSON creation, complex multi‑header tables, hierarchical configuration files). Key findings:

Size does not guarantee structural efficiency. Mid‑scale models (e.g., LLaMA‑2‑13B) often achieve lower NTED and higher CSA than larger counterparts, indicating that parameter count alone does not predict the ability to maintain clean hierarchical structures.
Recursive depth is a universal bottleneck. As nesting depth exceeds four levels, NTED spikes dramatically for all models, and CSA drops, revealing a fundamental limitation in current transformer architectures to preserve deep logical chains.
Table complexity exposes a structure‑content trade‑off. Simple row‑column alignment tasks yield high CSA (>90%) and low NTED, but tasks requiring merged cells, spanning headers, or heterogeneous column types see CSA fall below 60%, while NTED rises, highlighting that richer tabular schemas strain both semantic and structural capacities.
Self‑supervision yields massive cost savings. Compared with human‑in‑the‑loop evaluations, Table‑BiEval reduces annotation expense by over 95 % while delivering fine‑grained, reproducible scores that can be fed back into model fine‑tuning loops.

The paper situates Table‑BiEval against prior work: earlier Text‑to‑Table efforts (e.g., Wu et al., 2022) limited themselves to flat grids and relied on rule‑based parsers; code‑generation evaluation often uses AST similarity or execution‑based correctness, which are not directly applicable to non‑code tabular data. By unifying tree‑edit distance with semantic matching across both hierarchical and tabular domains, Table‑BiEval offers a universal, post‑hoc evaluation that does not depend on external execution environments.

Limitations are acknowledged. IR parsing requires format‑specific parsers, so extending to novel or highly irregular formats incurs engineering effort. The framework currently cannot handle multimodal inputs such as image‑based tables or handwritten scans. NTED, while capturing structural deviation, does not weight errors by downstream business impact; a minor misplaced node may be more costly than a large subtree error in certain applications.

Future directions include (1) integrating multimodal OCR pipelines to evaluate image‑derived tables, (2) developing adaptive weighting schemes that combine CSA and NTED into a single “structural fidelity score” tailored to specific use‑cases, and (3) exploring architectural modifications (e.g., hierarchical attention, recurrence‑enhanced transformers) that directly address the observed depth bottleneck.

In summary, Table‑BiEval provides a rigorous, scalable, and human‑free methodology for assessing the structural fidelity of LLM‑generated data. Its dual‑track design, deterministic IR foundation, and novel metrics enable researchers and practitioners to pinpoint weaknesses in hierarchical reasoning and tabular reasoning, guide model selection, and drive targeted improvements for LLMs destined for autonomous agent roles, tool‑calling pipelines, and enterprise data automation.

Table-BiEval: A Self-Supervised, Dual-Track Framework for Decoupling Structure and Content in LLM Evaluation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment