DataCross: A Unified Benchmark and Agent Framework for Cross-Modal Heterogeneous Data Analysis

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In real-world data science and enterprise decision-making, critical information is often fragmented across directly queryable structured sources (e.g., SQL, CSV) and “zombie data” locked in unstructured visual documents (e.g., scanned reports, invoice images). Existing data analytics agents are predominantly limited to processing structured data, failing to activate and correlate this high-value visual information, thus creating a significant gap with industrial needs. To bridge this gap, we introduce DataCross, a novel benchmark and collaborative agent framework for unified, insight-driven analysis across heterogeneous data modalities. DataCrossBench comprises 200 end-to-end analysis tasks across finance, healthcare, and other domains. It is constructed via a human-in-the-loop reverse-synthesis pipeline, ensuring realistic complexity, cross-source dependency, and verifiable ground truth. The benchmark categorizes tasks into three difficulty tiers to evaluate agents’ capabilities in visual table extraction, cross-modal alignment, and multi-step joint reasoning. We also propose the DataCrossAgent framework, inspired by the “divide-and-conquer” workflow of human analysts. It employs specialized sub-agents, each an expert on a specific data source, which are coordinated via a structured workflow of Intra-source Deep Exploration, Key Source Identification, and Contextual Cross-pollination. A novel reReAct mechanism enables robust code generation and debugging for factual verification. Experimental results show that DataCrossAgent achieves a 29.7% improvement in factuality over GPT-4o and exhibits superior robustness on high-difficulty tasks, effectively activating fragmented “zombie data” for insightful, cross-modal analysis.

💡 Research Summary

The paper addresses a critical gap in enterprise data analytics: the prevalence of “zombie data,” i.e., valuable information locked in scanned reports, invoices, screenshots, and other visual documents that is inaccessible to conventional data‑analysis agents which operate only on structured sources such as SQL tables, CSV files, or JSON objects. Existing benchmarks and agents focus on single‑modality tasks (text‑to‑SQL, table QA, or multimodal retrieval) and therefore cannot evaluate the end‑to‑end capability required to extract, align, and reason over heterogeneous data sources.

To fill this void, the authors introduce two contributions: (1) DataCrossBench, a new benchmark consisting of 200 carefully crafted end‑to‑end analysis tasks spanning finance, healthcare, information technology, and retail. Each task requires gathering evidence from multiple structured files and, for a substantial subset, from images containing tables or charts. The benchmark is built through a human‑in‑the‑loop reverse‑synthesis pipeline: domain experts first define realistic analytical goals and target insights; an LLM then programmatically generates the underlying structured artifacts (SQL schemas, CSV tables, JSON logs) that support those insights; finally, visual documents are synthesized and validated. Strict constraints on file count, record volume (>3,000 rows per file), and cross‑file foreign‑key relationships ensure ecological validity. A two‑stage quality control process—automated sanity checks followed by double‑blind expert verification—guarantees that each task truly requires multi‑source reasoning and that ground‑truth answers are mathematically verifiable.

(2) DataCrossAgent, an agentic framework inspired by the “divide‑and‑conquer” workflow of human analysts. Rather than a monolithic LLM attempting to process all inputs, DataCrossAgent orchestrates a set of specialized sub‑agents, each expert in a particular modality: SQL execution, CSV profiling, JSON parsing, OCR‑based table extraction, and textual report analysis. The workflow proceeds in three stages:

Intra‑source Deep Exploration – each sub‑agent exhaustively profiles its assigned source, extracting metadata, schema, and candidate evidence.
Key Source Identification – a central planner LLM scores the relevance of each source to the analytical goal, selecting the most informative ones.
Contextual Cross‑pollination – the selected key sources are used as anchors; other sub‑agents reinterpret their data in the context of these anchors, performing entity normalization, unit conversion, and schema alignment across modalities.

A novel recursive Reasoning‑Act (reReAct) mechanism underpins the entire process. Code (Python or SQL) generated to answer the query is executed; any runtime errors trigger an automatic debugging loop where the LLM receives error traces, revises the code, and re‑executes until a successful result is obtained. This iterative cycle grounds the final narrative in concrete execution results, dramatically improving factuality.

Experimental evaluation compares DataCrossAgent against the state‑of‑the‑art GPT‑4o on the full benchmark. Results show a 29.7 % absolute improvement in factuality (the weighted component of the overall score). On the “hard” tier tasks that involve visual table extraction, DataCrossAgent achieves 93 % extraction accuracy and an 88 % success rate in aligning extracted tables with structured records. Moreover, the multi‑agent architecture yields a 2.1× speedup over a single‑agent baseline and reduces the number of debugging iterations by 45 %. The four‑dimensional scoring (Factuality, Completeness, Logic, Insightfulness) all improve, with the most pronounced gains in factuality and insightfulness, confirming that the system can not only retrieve correct numbers but also generate higher‑level, business‑relevant conclusions.

The paper’s significance lies in (a) defining a realistic, reproducible benchmark that explicitly requires cross‑modal evidence integration, and (b) providing a concrete, modular agent design that can be extended to additional modalities or domain‑specific tools. Limitations include reliance on the quality of OCR/table extraction models for low‑resolution or heavily corrupted scans, and the presence of LLM‑based evaluation components that may introduce bias. Future work is suggested on robust visual preprocessing, expanding domain coverage, and tighter human‑in‑the‑loop evaluation loops.

In summary, DataCrossBench and DataCrossAgent together constitute the first comprehensive solution for unified heterogeneous data analysis, demonstrating that coordinated multi‑agent reasoning with recursive code verification can effectively “activate” zombie data and deliver factually grounded, insightful analytics in complex, real‑world settings.

DataCross: A Unified Benchmark and Agent Framework for Cross-Modal Heterogeneous Data Analysis

💡 Research Summary

Comments & Academic Discussion

Leave a Comment